What It Costs

Cloud Run GPU is billed per second while an instance is awake, and there is no always-free GPU tier — the free monthly allotment covers CPU, memory, and requests, but every GPU-second is billed from the first one.

StateCost
Idle (scaled to zero)$0 — no instance, no GPU, no charge
Awake and serving~$1.40/active-hour — roughly $0.67 GPU + ~$0.75 for the 8 vCPU / 32 GiB alongside it
Pinned awake 24/7~$800–1,000/month (don't do this for a demo)

The whole point of scale-to-zero: a GPU instance shuts down after a short idle period (roughly 10–15 minutes) with no traffic. So a build-test-demo-teardown session — even with a few hundred requests — lands in the low single-dollar range. The danger isn't normal use; it's an instance left awake, or one woken repeatedly by traffic you didn't expect.

Per-second prices vary by region and change over time. Check the live Cloud Run pricing page for exact figures before relying on them.

The Open-Endpoint Risk

--allow-unauthenticated means anyone with the URL can run inference on your GPU. There's no key to leak because there's no key at all. For a workshop or a personal demo that's fine — but understand what you've exposed, and bound it.

The guardrails you already have, and what each actually does:

  • --max-instances 1 — your hard ceiling. No matter how much traffic arrives, at most one GPU instance ever exists. This is the single most important cost control. (Cloud Run's default is 100; never leave that on a public GPU service.)
  • Scale-to-zero — as long as you don't set --min-instances, idle traffic costs nothing and abuse stops billing a short while (~10–15 minutes) after the last request.
  • --timeout + --concurrency — cap how long and how many requests a single instance handles, limiting the blast radius of one abuser.
  • A budget alert — set one now:
    # In the console: Billing → Budgets & alerts → Create budget
    But know its limit: a budget alert only emails you after spend crosses a threshold. It does not cap or stop anything. A real hard stop requires wiring the budget to a Pub/Sub topic and a function that disables billing — out of scope here, but Google documents it.

What doesn't help: a shared "API key" checked in your page's JavaScript. Anyone can read it straight from the page source, so it stops nobody. Real auth has to live server-side — which means the next section.

Locking It Down for Production

When you're past the demo, pick one:

Option A — make it private again (simplest). Drop public access and require a Google identity token. Now only callers you grant roles/run.invoker can reach it:

gcloud run services update gemma --region $REGION --no-allow-unauthenticated
gcloud run services add-iam-policy-binding gemma --region $REGION \
  --member="user:you@gmail.com" --role="roles/run.invoker"
 
# Then call it with a token:
curl -H "Authorization: Bearer $(gcloud auth print-identity-token)" \
  "$SERVICE_URL/v1/chat/completions" -d '{...}'

A browser can't hold that token safely, so this means putting a small backend of your own in front — which is the honest answer for a real product.

Option B — keep it browser-callable but guarded. Front the service with an external HTTPS load balancer, lock direct *.run.app ingress (--ingress=internal-and-cloud-load-balancing), and attach Cloud Armor rate-limiting rules. More moving parts, but it keeps the public-browser shape while throttling abuse.

For most "I just want a demo" cases, neither is needed — you tear it down instead.

Teardown

Opened a new terminal since Step 4? Re-set your vars first:

export REGION=us-central1
export PROJECT_ID=$(gcloud config get-value project)
export SERVICE_URL=$(gcloud run services describe gemma --region $REGION --format='value(status.url)')

Two commands end all billing for this blueprint.

# 1. Delete the service — stops all GPU/CPU billing immediately.
gcloud run services delete gemma --region $REGION --quiet
 
# 2. Delete the image so it stops costing Artifact Registry storage.
#    (Cloud Run's --source build creates a repo named "cloud-run-source-deploy".)
gcloud artifacts repositories delete cloud-run-source-deploy \
  --location $REGION --quiet

Or nuke everything — service, image, project, the lot — in one shot:

gcloud projects delete $PROJECT_ID

That sends the project to a 30-day recovery window (undo with gcloud projects undelete), then it's gone for good. It's the cleanest "I'm done" button there is.

Where to Go From Here

  • Swap the model size. Change gemma4:e4b to gemma4:e2b (lighter/faster) in the Dockerfile and your requests, then redeploy. Same shape, different trade-off.
  • Higher throughput? When one Ollama instance isn't enough, vLLM serves the same OpenAI-compatible API with much higher concurrency. Google ships a prebuilt vLLM image for Gemma. The webpage doesn't change.
  • One-click path. Google AI Studio has a "Deploy to Cloud Run" button for Gemma that does a version of all this for you — handy once you understand what it's doing under the hood.
  • A real frontend. The streaming loop in Step 5 drops straight into a React/Next.js component — same fetch + getReader logic.

Key Takeaways

  • A GPU is one flag on Cloud Run. --gpu 1 --gpu-type nvidia-l4 is the whole story; the rest is a normal deploy.
  • Bake the model into the image for fast, repeatable cold starts — viable for any Gemma variant under ~10 GB.
  • OLLAMA_ORIGINS is the CORS switch that makes a browser call work with no proxy. --allow-unauthenticated is the open switch. Together they're the entire "callable from a webpage" trick.
  • Open + GPU = real money exposed. --max-instances 1 and scale-to-zero bound it; a budget alert only warns you; teardown is the true off switch.
  • The blueprint is the unit of work. What you built in six steps — open model, serverless GPU, public URL, streaming webpage — is a complete, demoable system. Everything above is where you take it next.

Reference: Cloud Run pricing · Configure maximum instances · Cap spend with budget notifications · Managing access on Cloud Run · Delete and restore projects