One-Line Summary: Two Dataform models — one silver, one gold-with-embeddings — turn the bronze JSONL into a chunked, embedded, filterable BigQuery table the retrieval tool can query directly.
Prerequisites: Lesson 02-step-one-ingesting-product-docs.md and Module 04.
What's the Concept?
Bronze docs are useless to an agent as-is — they're long Markdown files. The refinement step does three things:
- Parse — extract frontmatter (title, tags, section) and clean the markdown body.
- Chunk — split each doc into ~500-token chunks with ~50 tokens of overlap, so the retrieval can return a precise passage.
- Embed — compute one vector per chunk using Vertex AI.
The output is gold.docs_chunks: one row per chunk, with text, metadata, and a vector column ready for hybrid retrieval.
How It Works
Two Dataform models do all of this.
Silver: silver_docs.sqlx — one row per doc, cleaned:
config {
type: "table",
bigquery: { partitionBy: "DATE(_pulled_at)" }
}
SELECT
doc_path,
-- Extract frontmatter title (One-Line Summary line)
REGEXP_EXTRACT(content, r"\\*\\*One-Line Summary\\*\\*:\\s*(.+?)\\n") AS summary,
REGEXP_EXTRACT(content, r"^#\\s+(.+?)\\n") AS title,
-- Strip code fences for embedding purposes (keep for retrieval response)
REGEXP_REPLACE(content, r"```[\\s\\S]*?```", "") AS body_for_embed,
content AS body_full,
content_hash,
_commit,
_pulled_at
FROM ${ref({schema: "bronze", name: "github_docs_raw"})}
WHERE content IS NOT NULL
QUALIFY ROW_NUMBER() OVER (PARTITION BY doc_path ORDER BY _pulled_at DESC) = 1Gold: gold_docs_chunks.sqlx — chunked and embedded:
config {
type: "incremental",
uniqueKey: ["chunk_id"],
bigquery: { partitionBy: "DATE(_chunked_at)", clusterBy: ["doc_path"] }
}
WITH base AS (
SELECT
s.doc_path, s.title, s.summary, s.content_hash, s._commit,
-- Naive chunker: split into ~2000-char chunks with 200-char overlap
-- Production would use a smarter sentence-aware splitter
STRUCT(
chunk_offset AS offset,
SUBSTR(s.body_for_embed, chunk_offset, 2200) AS text
) AS chunk
FROM ${ref("silver_docs")} s,
UNNEST(GENERATE_ARRAY(1, LENGTH(s.body_for_embed), 2000)) AS chunk_offset
${when(incremental(), `
WHERE s.content_hash NOT IN (
SELECT content_hash FROM ${self()}
)
`)}
),
with_ids AS (
SELECT
*,
TO_HEX(SHA256(CONCAT(doc_path, "::", CAST(chunk.offset AS STRING))))
AS chunk_id
FROM base
WHERE LENGTH(chunk.text) > 200 -- skip near-empty trailing chunks
),
embedded AS (
SELECT
w.*,
emb.ml_generate_embedding_result AS embedding
FROM with_ids w,
LATERAL (
SELECT ml_generate_embedding_result
FROM ML.GENERATE_EMBEDDING(
MODEL ${ref({schema: "embedding_models", name: "text_embedding_005"})},
(SELECT w.chunk.text AS content),
STRUCT(TRUE AS flatten_json_output)
)
) emb
)
SELECT
chunk_id,
doc_path,
title,
summary,
chunk.offset AS chunk_offset,
chunk.text AS chunk_text,
embedding,
content_hash,
_commit,
CURRENT_TIMESTAMP() AS _chunked_at
FROM embeddedThe incremental logic — WHERE s.content_hash NOT IN (SELECT content_hash FROM ${self()}) — is the critical cost control. After the initial run, only chunks whose underlying doc has changed get re-embedded. A typical hourly refresh re-embeds 0–20 chunks instead of all 5,000.
Dataform scheduling glues this together — both models tagged capstone:docs, both running hourly, with the embedding model itself versioned as a separate Dataform model:
# workflow_settings.yaml
schedules:
- name: capstone_docs
cron: "10 * * * *"
target:
tags: ["capstone:docs"]
notifications:
onFailure: ["data-alerts@myco.com"]Why It Matters
- Chunking is a product decision, not an implementation detail. Chunks too small → no context; too large → low retrieval precision. ~500 tokens with ~50 overlap is a defensible default.
- Incremental embedding is the only way the bill scales. Re-embedding every chunk every hour is technically possible and financially insane. Hash-gated incremental is the production pattern.
- The gold table is the contract. Tomorrow's retrieval tool reads from
gold.docs_chunks. Its shape is the agent's API.
Key Technical Details
- This naive chunker splits on character boundaries; production-grade chunkers split on paragraphs or sentences (LangChain's
RecursiveCharacterTextSplitter, LlamaIndex'sSentenceSplitter). Worth upgrading once the rest works. text-embedding-005produces 768-dim vectors and costs ~0.20 for the initial run, then near-zero per hourly refresh.- BigQuery's
VECTOR_SEARCHrequires a vector index for ANN; create it after gold is populated:CREATE VECTOR INDEX docs_chunks_idx ON gold.docs_chunks(embedding). - Capture the embedding model version in the gold table so re-embedding can be triggered when you upgrade. Without it, mixed-vintage embeddings degrade search quality silently.
Common Misconceptions
"Chunking is solved." It's not — chunk strategy meaningfully affects retrieval quality, and the right strategy is domain-specific. Treat the chunker as a hyperparameter you'll iterate on.
"Re-embed everything on schema change." Often you don't need to. If only the metadata schema changed (new tag column), keep existing embeddings. Re-embed only on changes to the text being embedded.
"Bigger chunks are better." Bigger chunks mean each retrieved result carries more context — but also more noise, and you can fit fewer of them in the agent's context budget. Test on real queries; don't pick on intuition alone.
Connections to Other Concepts
- Course
05-serving-data-to-agents/02-semantic-retrieval-embeddings-and-vector-search.md— The semantic-retrieval theory in this lesson's practice. - Course
04-refinement-in-bigquery/04-incremental-and-idempotent-pipelines.md— Where the incremental embedding pattern comes from. 04-step-three-wiring-the-retrieval-tool.md— Where this gold table gets consumed.
Further Reading
- LlamaIndex docs on chunking strategies — Sentence-aware, semantic, and hierarchical chunkers explained.
- "Lost in the Middle" (Liu et al., 2023) — Why chunk placement in the agent's prompt matters as much as the chunk itself.
- Vertex AI's "text-embedding-005" model card — Detailed perf characteristics.