What You Built

Four moving pieces, each independently replaceable:

  • Chunker in plain Python. Replace it with semchunk or chonkie when you want smarter splits.
  • Embeddings through Vertex AI. Swap the model to text-embedding-large-exp-03-07 (3072 dims) for higher quality, or text-multilingual-embedding-002 for non-English content. Migrate the embedding column dim along with it.
  • Storage in Cloud SQL + pgvector. The hardest piece to replace later — and the most boring, on purpose.
  • Generation through Gemini Flash on Vertex AI. Swap to gemini-pro-latest for harder synthesis, or to Anthropic via Vertex AI Model Garden without leaving GCP.

You can stop here. Or you can take it further.

Lock Down /admin/ingest

The first thing to fix before exposing this to anyone real. Two clean options on Cloud Run:

Option A — Cloud Run IAM (simplest, GCP-native). Drop --allow-unauthenticated, then grant roles/run.invoker only to people/services that should be able to call it. Callers send an ID token in the Authorization header.

gcloud run services remove-iam-policy-binding rag \
  --region=$REGION \
  --member="allUsers" \
  --role="roles/run.invoker"
 
gcloud run services add-iam-policy-binding rag \
  --region=$REGION \
  --member="user:you@gmail.com" \
  --role="roles/run.invoker"

Then to call it:

curl -H "Authorization: Bearer $(gcloud auth print-identity-token)" \
  $SERVICE_URL/ask -d '...'

Option B — Split into two services. Keep the public /ask open. Move /admin/ingest to a second Cloud Run service that's behind IAM. Now the public surface area is exactly one POST endpoint, and the ingest job runs on a Cloud Scheduler cron without exposing anything human-callable.

Option B is what real apps end up doing. Option A is enough for everything pre-launch.

Private IP + Direct VPC Egress

Right now your Cloud SQL instance has a public IP (the Cloud SQL Python Connector still tunnels through Google's network, so traffic is private — but the IP exists). For production, switch to private IP only and connect from Cloud Run via Direct VPC egress.

The change in code is one line — flip ip_type=IPTypes.PRIVATE in src/db.py. The change in infra:

  1. Create a VPC network if you don't have one
  2. Add a private IP to the SQL instance: gcloud sql instances patch rag-db --no-assign-ip --network=projects/$PROJECT_ID/global/networks/default
  3. Add --network=default --subnet=default to your gcloud run deploy

After that, the database has no public IP at all. Even an exposed credential couldn't reach it from the internet.

Move to pgvector Halfvec or Binary for Scale

At ~100K chunks, a 768-dim float embedding column is ~300 MB. At 10M chunks, it's 30 GB. pgvector 0.7+ has two compressed types:

  • halfvec(N) — 16-bit floats. Half the storage. ~2% recall loss in practice.
  • bit(N) with <%> operator — 1-bit-per-dim binary quantization with a re-rank step. ~10× storage savings. Often paired with a re-rank against vector(N).

You'd alter the column type and rebuild the HNSW index. The application code doesn't change — pgvector operators behave the same way against all three types.

When to bother: when storage cost or query latency becomes visible. Below 1M chunks, just stay with vector(768).

Pure vector search misses exact-match queries — names, IDs, specific terminology. The fix is a hybrid query:

WITH semantic AS (
  SELECT id, embedding <=> %s AS distance
  FROM chunks ORDER BY distance LIMIT 20
),
keyword AS (
  SELECT id, ts_rank(to_tsvector('english', content),
                      plainto_tsquery('english', %s)) AS rank
  FROM chunks
  WHERE to_tsvector('english', content) @@ plainto_tsquery('english', %s)
  ORDER BY rank DESC LIMIT 20
)
SELECT * FROM (
  SELECT id, 1.0 / (1 + distance) AS score FROM semantic
  UNION ALL
  SELECT id, rank AS score FROM keyword
) combined
ORDER BY score DESC
LIMIT 5;

That's Reciprocal Rank Fusion. Postgres' built-in full-text search does the keyword half — no extra service. Add a GIN index on to_tsvector('english', content) and it's millisecond-fast.

Stand Up Evals

Two things to measure, in order:

  1. Retrieval recall@K. Build a tiny set of (question, expected_chunk_id) pairs. For each question, run retrieval with K=10. Track what fraction of expected chunks land in the top-10. Below ~80% is where you should look at chunker quality, embedding model, or query rewriting.
  2. Answer correctness. Same set, but with (question, expected_answer). Use Gemini-as-a-judge: "Given this question, this expected answer, and this generated answer — does the generated answer match? Yes/No." Re-run on every code change.

A 30-row evaluation set catches more issues than a 300-row one you never run.

Cloud Scheduler + Pub/Sub for Periodic Ingest

For docs that change (Notion, a Google Drive folder, a GitHub wiki), set up a daily refresh:

  • Cloud Scheduler cron job hits a /admin/refresh endpoint
  • Endpoint enumerates new/changed docs, embeds them, upserts into Postgres
  • Behind IAM auth so only Scheduler's service account can invoke it

Cloud Scheduler is dirt cheap — first three jobs are free, additional jobs are pennies a month.

Swap Cloud Run for Vertex AI Agent Engine

When you grow into a multi-step agent that needs sessions, tracing, and auth out of the box, look at Vertex AI Agent Engine. It's a managed runtime for ADK and LangChain agents. The Cloud SQL retrieval code doesn't change — you reuse it from inside the agent.

The matching blueprint here is Multi-Agent A2A on Cloud Run — same Cloud Run shape, multiple agents talking to each other.

Move Infra into Terraform

Everything in Steps 2, 3, and 7 is a one-time bootstrap. Once it works, codify it. The full set of resources is small enough to fit in one file:

resource "google_sql_database_instance" "rag" { ... }
resource "google_sql_database" "rag" { ... }
resource "google_sql_user" "human" { ... }
resource "google_sql_user" "runtime" { ... }
resource "google_service_account" "runtime" { ... }
resource "google_project_iam_member" "runtime_cloudsql_client" { ... }
resource "google_project_iam_member" "runtime_aiplatform" { ... }
resource "google_cloud_run_v2_service" "rag" { ... }

~80 lines of HCL replaces ~30 gcloud commands. terraform destroy becomes the new cleanup.

The Companion Repo

The runnable mirror of everything in the blueprint lives at github.com/maraja/rag-on-cloud-sql-pgvector. Clone it, fill in your .env, apply both SQL migrations, and gcloud run deploy --source . to get a working service in under fifteen minutes.

git clone https://github.com/maraja/rag-on-cloud-sql-pgvector.git
cd rag-on-cloud-sql-pgvector
cp .env.example .env  # fill it in
uv sync

The README in that repo is short on purpose. The blueprint above is the documentation.

Where to Go From Here

Key Takeaways

  • One database for everything. Vectors live next to your rows in Postgres. One backup, one connection, one set of credentials.
  • IAM auth all the way down. Humans, services, and the Cloud Run runtime all authenticate to Cloud SQL with the same mechanism — no passwords anywhere.
  • Vertex AI is a single dependency for embeddings and generation. Two API surfaces, one auth, one bill.
  • Cloud Run scales to zero. A hobby RAG app costs as much as leaving Cloud SQL running — and you can stop that too.
  • The blueprint is the unit of work. What you built in 8 steps is enough to ship a useful internal tool. Everything in this last step is optimization.