Step 3: Bake Gemma into a Container

The Whole Project

It's two files. Make a directory and create the Dockerfile:

deploy-gemma-to-cloud-run/
├── Dockerfile        # this step
└── web/
    └── index.html    # Step 5

The Dockerfile

FROM ollama/ollama:latest
 
# Cloud Run routes traffic to $PORT (8080). Ollama defaults to 127.0.0.1:11434,
# so we must rebind it to listen on 0.0.0.0:8080.
ENV OLLAMA_HOST=0.0.0.0:8080
 
# Allow a browser on ANY origin to call this service (CORS).
# For production, set this to your exact site, e.g. https://you.github.io
ENV OLLAMA_ORIGINS=*
 
# Keep the model resident in GPU memory between requests (don't unload on idle).
ENV OLLAMA_KEEP_ALIVE=-1
 
# Allow a few concurrent requests. Match this to --concurrency on deploy.
ENV OLLAMA_NUM_PARALLEL=4
 
# Store models at a fixed path so the pull below lands in an image layer.
ENV OLLAMA_MODELS=/models
 
# Bake the weights into the image: start the server, wait until it's ready, pull the model.
# Now the weights ship INSIDE the image — no download on cold start.
# Swap to gemma4:e2b for a lighter/faster build.
RUN ollama serve & until ollama list >/dev/null 2>&1; do sleep 1; done && ollama pull gemma4:e4b
 
ENTRYPOINT ["ollama", "serve"]

That's the entire server. Six environment variables, one bake step, one entrypoint.

Line by Line

FROM ollama/ollama:latest — the official Ollama image. It already contains the NVIDIA runtime bits; Cloud Run supplies the GPU drivers. Nothing to install.
OLLAMA_HOST=0.0.0.0:8080 — without this, Ollama listens only on localhost on port 11434 and Cloud Run can't route to it. This is the single most common reason a deploy "succeeds" but never serves traffic.
OLLAMA_ORIGINS=* — the CORS switch. With it set, Ollama answers the browser's preflight OPTIONS and returns Access-Control-Allow-Origin. Without it, a webpage on any other origin gets a CORS error and never reaches the model. (* is safe here precisely because the endpoint is unauthenticated — there are no cookies or credentials to protect.)
OLLAMA_KEEP_ALIVE=-1 — keeps the model loaded in the GPU so the second request doesn't pay the load cost again. The instance still scales to zero when idle; this only governs behavior while the instance is awake.
OLLAMA_NUM_PARALLEL=4 — how many requests one instance handles at once. We mirror this with --concurrency 4 at deploy time.
The RUN line — starts Ollama in the background, waits until it's accepting requests, then ollama pull gemma4:e4b. Because OLLAMA_MODELS=/models is set, the weights are written into the image. This is the "bake" — covered next.

Why Bake Instead of Pull at Startup

You have two options for getting the weights into a running container:

Pull at startup (the official tutorial's approach): ship the bare Ollama image and ollama pull when the container boots. Simple, but every cold start re-downloads ~9.6 GB before it can answer — slow and repeated, since the service scales to zero.
Bake into the image (what we do): download the weights once at build time so they're already there when the container starts. Cold starts are fast and predictable.

Google's own guidance: baking is the recommended path for models under ~10 GB. gemma4:e4b is 9.6 GB — right at the line, and the better default for a demo where the first impression is the cold-start latency.

Which Gemma to Use

gemma4:e4b (the default above) is the sweet spot for a single L4: 9.6 GB on disk, ~4.5B effective parameters, a 128K-token context window, text + image + audio in. It leaves comfortable room in the L4's 24 GB of VRAM for the context.

Swap the tag in the RUN line (and later in your requests) if you want a different trade-off:

Tag	Size	Notes
`gemma4:e2b`	7.2 GB	Lighter and faster; builds quicker. Great when speed matters more than peak quality.
`gemma4:e4b`	9.6 GB	Recommended default. Best quality that still bakes cleanly and fits an L4 with headroom.
`gemma4:12b`	7.6 GB	256K context. Fits an L4, but less KV-cache headroom for long chats.
`gemma4:26b` / `:31b`	18–20 GB	Do not use on an L4. No room left for context. These need a bigger GPU (e.g. `nvidia-rtx-pro-6000`).

The model tag is load-bearing: whatever you ollama pull here must exactly match the model field in your requests in Steps 4 and 5, or you'll get a 404 model not found.

A `.dockerignore`

One thing worth knowing: gcloud run deploy --source decides what to upload from .gcloudignore (or, if that's absent, your .gitignore) — not .dockerignore. We include a .dockerignore anyway so a local docker build stays lean, and it's harmless here regardless: this Dockerfile has no COPY/ADD, so nothing from the context ever enters the image. (Your Dockerfile is always used when present — the patterns below never match it.)

web
README.md
*.md
deploy.sh
.git
.gitignore

What You Have Now

A Dockerfile that bakes gemma4:e4b and turns on CORS
A .dockerignore
No image built yet — that happens in one command next

Reference: Ollama on Docker · Cloud Run GPU best practices (model storage) · Ollama environment variables (FAQ) · Gemma 4 on Ollama