One-Line Summary: The bronze → silver step is where shapeless raw payloads become typed, deduplicated, one-row-per-thing tables you can actually rely on.
Prerequisites: Module 03.
What's the Concept?
Bronze data is honest but useless. JSON blobs, mixed types, occasional nulls where you didn't expect them, duplicates from at-least-once delivery. Silver is the layer where you decide: "for this entity, there's exactly one row per ID, the columns mean what they say, and we vouch for it."
This is the unglamorous core of data engineering. Most production-impact bugs come from a missed deduplication, a wrong type coercion, or a join that quietly drops 3% of rows. The bronze → silver step is where rigor pays.
How It Works
A bronze → silver BigQuery query, end-to-end:
-- silver.orders, built from bronze.stripe_charges_raw
CREATE OR REPLACE TABLE `myco.silver.orders`
PARTITION BY DATE(created_at)
CLUSTER BY customer_id
AS
WITH parsed AS (
SELECT
JSON_VALUE(payload, '$.id') AS order_id,
JSON_VALUE(payload, '$.customer') AS customer_id,
SAFE_CAST(JSON_VALUE(payload, '$.amount') AS INT64)
AS amount_cents,
JSON_VALUE(payload, '$.currency') AS currency,
JSON_VALUE(payload, '$.status') AS status,
TIMESTAMP_SECONDS(SAFE_CAST(JSON_VALUE(payload, '$.created') AS INT64))
AS created_at,
_ingestion_timestamp AS _ingested_at,
_source_file AS _source_file
FROM `myco.bronze.stripe_charges_raw`
WHERE _ingestion_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) -- incremental
AND JSON_VALUE(payload, '$.id') IS NOT NULL
),
deduped AS (
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY _ingested_at DESC) AS rn
FROM parsed
)
WHERE rn = 1
)
SELECT
order_id,
customer_id,
amount_cents,
currency,
status,
created_at,
_ingested_at,
_source_file
FROM deduped;Three responsibilities visible in that query:
- Parsing + typing.
JSON_VALUE+SAFE_CAST— theSAFE_family returns NULL on cast failure instead of throwing, so bad rows degrade gracefully. Bad rows can be counted separately and alerted on. - Deduplication.
ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ... DESC)plusWHERE rn = 1collapses multiple bronze records of the same logical entity into the freshest one. This pattern works for both naturally-duplicate data and CDC change streams. - Provenance. The
_ingested_atand_source_filecolumns let any silver row be traced back to its bronze origin. Cheap, invaluable.
Why It Matters
- Silver is what every downstream consumer trusts. Gold tables read from it. Analysts query it. Agents call tools that wrap it. If silver is wrong, everything downstream is wrong.
- Type safety propagates. A
SAFE_CASTfailure surfaces as NULL, which silver-layer tests catch. Without it, the same bad data would land in gold and the agent would see "NaN" or "12.50" as a string. - Deduplication is non-negotiable. At-least-once ingestion guarantees duplicates. Silver is where you commit to a single source of truth per entity.
Key Technical Details
- Always partition silver tables by a meaningful date column (usually
created_atorevent_date). Partitioning is what makes queries cheap as the table grows. - Cluster on the most common filter dimension (usually entity ID or user ID). Clustering is free in BigQuery and dramatically reduces scan cost.
- Use
CREATE OR REPLACE TABLEfor full rebuilds when the table is small (<100M rows). For larger tables, useMERGEfor incremental updates (see lesson04-incremental-and-idempotent-pipelines.md). - Capture a row count per partition into a small
_pipeline_metricstable after each silver run. This is your first line of defense for data quality monitoring.
Common Misconceptions
"Just SELECT * from the bronze view." That works for one source and one table. The minute you have ten sources, you'll wish you'd codified the cleaning logic explicitly.
"Deduplication can wait until gold." Every gold computation runs on top of silver. If silver has duplicates, all gold computations are wrong. Dedupe at silver, once, in one place.
"Use UPSERT/MERGE for everything." CREATE OR REPLACE is fine until tables get large. Most silver tables can be fully rebuilt in seconds; the simplicity is worth the cost.
Connections to Other Concepts
02-silver-to-gold-modeling-for-agents.md— The next layer up.03-dbt-for-versioned-transforms.md— Putting these queries under version control.04-incremental-and-idempotent-pipelines.md— When full rebuilds get too expensive.
Further Reading
- Google Cloud, "Common SQL patterns for BigQuery" — Reference for
JSON_VALUE,SAFE_CAST, window functions. - dbt project documentation — The "staging models" concept in dbt is exactly silver-layer modeling.
- Maxime Beauchemin, "Functional Data Engineering" — Why idempotent, immutable transformations are the durable pattern.