The Whole Project
It's two files. Make a directory and create the Dockerfile:
deploy-gemma-to-cloud-run/
├── Dockerfile # this step
└── web/
└── index.html # Step 5The Dockerfile
FROM ollama/ollama:latest
# Cloud Run routes traffic to $PORT (8080). Ollama defaults to 127.0.0.1:11434,
# so we must rebind it to listen on 0.0.0.0:8080.
ENV OLLAMA_HOST=0.0.0.0:8080
# Allow a browser on ANY origin to call this service (CORS).
# For production, set this to your exact site, e.g. https://you.github.io
ENV OLLAMA_ORIGINS=*
# Keep the model resident in GPU memory between requests (don't unload on idle).
ENV OLLAMA_KEEP_ALIVE=-1
# Allow a few concurrent requests. Match this to --concurrency on deploy.
ENV OLLAMA_NUM_PARALLEL=4
# Store models at a fixed path so the pull below lands in an image layer.
ENV OLLAMA_MODELS=/models
# Bake the weights into the image: start the server, wait until it's ready, pull the model.
# Now the weights ship INSIDE the image — no download on cold start.
# Swap to gemma4:e2b for a lighter/faster build.
RUN ollama serve & until ollama list >/dev/null 2>&1; do sleep 1; done && ollama pull gemma4:e4b
ENTRYPOINT ["ollama", "serve"]That's the entire server. Six environment variables, one bake step, one entrypoint.
Line by Line
FROM ollama/ollama:latest— the official Ollama image. It already contains the NVIDIA runtime bits; Cloud Run supplies the GPU drivers. Nothing to install.OLLAMA_HOST=0.0.0.0:8080— without this, Ollama listens only on localhost on port 11434 and Cloud Run can't route to it. This is the single most common reason a deploy "succeeds" but never serves traffic.OLLAMA_ORIGINS=*— the CORS switch. With it set, Ollama answers the browser's preflightOPTIONSand returnsAccess-Control-Allow-Origin. Without it, a webpage on any other origin gets a CORS error and never reaches the model. (*is safe here precisely because the endpoint is unauthenticated — there are no cookies or credentials to protect.)OLLAMA_KEEP_ALIVE=-1— keeps the model loaded in the GPU so the second request doesn't pay the load cost again. The instance still scales to zero when idle; this only governs behavior while the instance is awake.OLLAMA_NUM_PARALLEL=4— how many requests one instance handles at once. We mirror this with--concurrency 4at deploy time.- The
RUNline — starts Ollama in the background, waits until it's accepting requests, thenollama pull gemma4:e4b. BecauseOLLAMA_MODELS=/modelsis set, the weights are written into the image. This is the "bake" — covered next.
Why Bake Instead of Pull at Startup
You have two options for getting the weights into a running container:
- Pull at startup (the official tutorial's approach): ship the bare Ollama image and
ollama pullwhen the container boots. Simple, but every cold start re-downloads ~9.6 GB before it can answer — slow and repeated, since the service scales to zero. - Bake into the image (what we do): download the weights once at build time so they're already there when the container starts. Cold starts are fast and predictable.
Google's own guidance: baking is the recommended path for models under ~10 GB. gemma4:e4b is 9.6 GB — right at the line, and the better default for a demo where the first impression is the cold-start latency.
Which Gemma to Use
gemma4:e4b (the default above) is the sweet spot for a single L4: 9.6 GB on disk, ~4.5B effective parameters, a 128K-token context window, text + image + audio in. It leaves comfortable room in the L4's 24 GB of VRAM for the context.
Swap the tag in the RUN line (and later in your requests) if you want a different trade-off:
| Tag | Size | Notes |
|---|---|---|
gemma4:e2b | 7.2 GB | Lighter and faster; builds quicker. Great when speed matters more than peak quality. |
gemma4:e4b | 9.6 GB | Recommended default. Best quality that still bakes cleanly and fits an L4 with headroom. |
gemma4:12b | 7.6 GB | 256K context. Fits an L4, but less KV-cache headroom for long chats. |
gemma4:26b / :31b | 18–20 GB | Do not use on an L4. No room left for context. These need a bigger GPU (e.g. nvidia-rtx-pro-6000). |
The model tag is load-bearing: whatever you
ollama pullhere must exactly match themodelfield in your requests in Steps 4 and 5, or you'll get a404 model not found.
A .dockerignore
One thing worth knowing: gcloud run deploy --source decides what to upload from .gcloudignore (or, if that's absent, your .gitignore) — not .dockerignore. We include a .dockerignore anyway so a local docker build stays lean, and it's harmless here regardless: this Dockerfile has no COPY/ADD, so nothing from the context ever enters the image. (Your Dockerfile is always used when present — the patterns below never match it.)
web
README.md
*.md
deploy.sh
.git
.gitignoreWhat You Have Now
- A
Dockerfilethat bakesgemma4:e4band turns on CORS - A
.dockerignore - No image built yet — that happens in one command next
Reference: Ollama on Docker · Cloud Run GPU best practices (model storage) · Ollama environment variables (FAQ) · Gemma 4 on Ollama