One-Line Summary: Embeddings turn arbitrary text into 768- or 1536-dim vectors; cosine-similarity search over those vectors finds rows by meaning, not by exact keys — essential when the agent doesn't know what to ask for by name.
Prerequisites: Lesson 01-structured-retrieval-bigquery-as-a-tool.md.
What's the Concept?
Structured retrieval works when the agent already knows the lookup key — a customer ID, an order number, a date range. Semantic retrieval handles the other half: "find documents that discuss billing disputes" or "look up tickets that sound like this new one."
The mechanism: pre-compute an embedding — a high-dimensional vector — for every chunk of text in the relevant gold table. At query time, embed the agent's search string the same way, then find the rows whose vectors are closest to the query vector (usually by cosine similarity). The result is a ranked list of semantically-similar rows.
On GCP, this is a two-service combination:
- Vertex AI Embeddings API to compute embeddings (
text-embedding-005,text-embedding-004, gecko family). - BigQuery's native
VECTORcolumn +VECTOR_SEARCHfor storage and search, or Vertex AI Vector Search for high-scale managed ANN.
How It Works
The pipeline:
┌─────────────────────────────────┐
silver ───▶ │ gold.support_tickets │
│ ┌──────────┬──────────┐ │
│ │ ticket_id│ subject │ body │
│ └──────────┴──────────┴────────┘
│ ▲ │
│ │ daily refresh │
│ ┌─────────┴──────┐ │
│ │ Vertex AI │ │
│ │ Embeddings API │ │
│ └─────────┬──────┘ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ embedding │ │
│ │ VECTOR<768> │ │
│ └──────────────────┘ │
└─────────────────────────────────┘
┌────────────────┐
user query ─────▶ │ Vertex AI │ ──▶ query_vec VECTOR<768>
│ Embeddings API │
└────────────────┘
│
▼
VECTOR_SEARCH(
gold.support_tickets,
'embedding',
query_vec,
top_k => 5
)
│
▼
top 5 similar ticket rowsThe gold table generation in BigQuery uses the built-in ML.GENERATE_EMBEDDING:
CREATE OR REPLACE TABLE `myco.gold.support_tickets`
PARTITION BY DATE(created_at)
AS
SELECT
t.ticket_id,
t.customer_id,
t.subject,
t.body,
t.status,
t.created_at,
emb.ml_generate_embedding_result AS embedding
FROM `myco.silver.tickets` t,
UNNEST([STRUCT(t.subject || '\n\n' || t.body AS content)]) AS chunk,
LATERAL (
SELECT ml_generate_embedding_result
FROM ML.GENERATE_EMBEDDING(
MODEL `myco.embedding_models.text_embedding_005`,
(SELECT chunk.content AS content),
STRUCT(TRUE AS flatten_json_output)
)
) emb;Search at query time, also in SQL:
SELECT base.ticket_id, base.subject, base.body, distance
FROM VECTOR_SEARCH(
TABLE `myco.gold.support_tickets`,
'embedding',
(SELECT ml_generate_embedding_result AS query_vec
FROM ML.GENERATE_EMBEDDING(
MODEL `myco.embedding_models.text_embedding_005`,
(SELECT 'customer complained about a double-charge' AS content)
)),
top_k => 5,
distance_type => 'COSINE'
);The agent's tool wraps this query the same way structured tools wrap their queries.
Why It Matters
- The agent can search by meaning. A user asking "did anyone else hit this bug?" doesn't know which ticket IDs to look up; semantic search does.
- Chunks become first-class data. Long documents get split into paragraph-sized chunks, each embedded separately. The agent retrieves the relevant chunks, not whole documents.
- In-warehouse search keeps the architecture flat. BigQuery's
VECTOR_SEARCHruns alongside your regular SQL. No separate vector DB to operate, until you genuinely need one.
Key Technical Details
- Embedding dimensions: 768 for
text-embedding-005, 3072 forgemini-embedding-001. Higher dims are slightly more accurate but cost more in storage and search latency. - BigQuery vector index types: IVF (cheap, fast build, fine for <10M rows), ScaNN (recommended for >10M). Both indexes are eventually consistent — there's a small delay between insert and searchability.
text-embedding-005is bilingual-strong and cheap (~$0.0001 per 1000 input chars). Re-embedding a million rows of customer-support data is single-digit dollars.- Distance metric: cosine for normalized text embeddings (always with Vertex AI text models), L2/euclidean for image embeddings.
- Embedding cost is recurring — every refresh re-embeds new or changed rows. Cache aggressively; we'll cover that in Module 07.
Common Misconceptions
"Embeddings replace structured search." They complement it. Structured retrieval is faster, cheaper, and exact for known-key lookups. Use both — see 03-hybrid-retrieval-structured-plus-semantic.md.
"Bigger embedding model = better retrieval." Not always. Domain match matters more than parameter count. A general-purpose 768-dim model often beats a specialist 1536-dim model on out-of-domain content.
"Just use a vector database." You eventually might, but starting with BigQuery's native vector search means one fewer service to operate, monitor, and pay for. Graduate when scale forces it.
Connections to Other Concepts
03-hybrid-retrieval-structured-plus-semantic.md— Combining the two retrieval modes.- Course
04-refinement-in-bigquery/02-silver-to-gold-modeling-for-agents.md— Where the embedded gold table is built. - Course
07-operating-the-system/02-cost-control-on-bigquery-and-vertex-ai.md— Embedding cost dominates this part of the bill.
Further Reading
- Anthropic, "Introducing Contextual Retrieval" (Sept 2024) — Prepend a model-generated 50–100 token "context" to each chunk before embedding. The simplest large win in retrieval quality this decade; bake it into your chunking step. https://www.anthropic.com/research/contextual-retrieval
- Muennighoff et al., "MTEB: Massive Text Embedding Benchmark" (2022, leaderboard continuously updated) — How you actually pick an embedding model in 2026. The Hugging Face leaderboard at
huggingface.co/spaces/mteb/leaderboardis the source of truth; always re-test the shortlist on your own corpus. - Karpukhin et al., "Dense Passage Retrieval for Open-Domain Question Answering" (2020) — Foundational dual-encoder paper; the conceptual base under every Vertex AI text-embedding model. https://arxiv.org/abs/2004.04906
- Khattab & Zaharia, "ColBERT: Late Interaction over BERT" (2020) — Multi-vector retrieval as a stronger but more expensive alternative to single-vector embeddings. https://arxiv.org/abs/2004.12832
- Liu et al., "Lost in the Middle: How Language Models Use Long Contexts" (2023) — Why ranking matters even when you can fit everything; informs your top-k choice. https://arxiv.org/abs/2307.03172
- Google Cloud, "Vertex AI text embeddings" + "BigQuery vector embeddings and similarity search" — Authoritative product docs; check often, the model family rotates fast.
- Cohere Rerank v3 / Voyage rerank-2 docs — Cross-encoder rerankers; the standard two-stage retrieval (ANN candidate set → reranker) is the production pattern when relevance trumps latency.