Bronze to Silver: Cleaning and Conforming

One-Line Summary: The bronze → silver step is where shapeless raw payloads become typed, deduplicated, one-row-per-thing tables you can actually rely on.

Prerequisites: Module 03.

What's the Concept?

Bronze data is honest but useless. JSON blobs, mixed types, occasional nulls where you didn't expect them, duplicates from at-least-once delivery. Silver is the layer where you decide: "for this entity, there's exactly one row per ID, the columns mean what they say, and we vouch for it."

This is the unglamorous core of data engineering. Most production-impact bugs come from a missed deduplication, a wrong type coercion, or a join that quietly drops 3% of rows. The bronze → silver step is where rigor pays.

How It Works

A bronze → silver BigQuery query, end-to-end:

-- silver.orders, built from bronze.stripe_charges_raw
CREATE OR REPLACE TABLE `myco.silver.orders`
PARTITION BY DATE(created_at)
CLUSTER BY customer_id
AS
WITH parsed AS (
  SELECT
    JSON_VALUE(payload, '$.id')                AS order_id,
    JSON_VALUE(payload, '$.customer')          AS customer_id,
    SAFE_CAST(JSON_VALUE(payload, '$.amount') AS INT64)
                                                AS amount_cents,
    JSON_VALUE(payload, '$.currency')          AS currency,
    JSON_VALUE(payload, '$.status')            AS status,
    TIMESTAMP_SECONDS(SAFE_CAST(JSON_VALUE(payload, '$.created') AS INT64))
                                                AS created_at,
    _ingestion_timestamp                       AS _ingested_at,
    _source_file                               AS _source_file
  FROM `myco.bronze.stripe_charges_raw`
  WHERE _ingestion_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)  -- incremental
    AND JSON_VALUE(payload, '$.id') IS NOT NULL
),
deduped AS (
  SELECT *
  FROM (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY _ingested_at DESC) AS rn
    FROM parsed
  )
  WHERE rn = 1
)
SELECT
  order_id,
  customer_id,
  amount_cents,
  currency,
  status,
  created_at,
  _ingested_at,
  _source_file
FROM deduped;

Three responsibilities visible in that query:

Parsing + typing. JSON_VALUE + SAFE_CAST — the SAFE_ family returns NULL on cast failure instead of throwing, so bad rows degrade gracefully. Bad rows can be counted separately and alerted on.
Deduplication. ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ... DESC) plus WHERE rn = 1 collapses multiple bronze records of the same logical entity into the freshest one. This pattern works for both naturally-duplicate data and CDC change streams.
Provenance. The _ingested_at and _source_file columns let any silver row be traced back to its bronze origin. Cheap, invaluable.

Why It Matters

Silver is what every downstream consumer trusts. Gold tables read from it. Analysts query it. Agents call tools that wrap it. If silver is wrong, everything downstream is wrong.
Type safety propagates. A SAFE_CAST failure surfaces as NULL, which silver-layer tests catch. Without it, the same bad data would land in gold and the agent would see "NaN" or "12.50" as a string.
Deduplication is non-negotiable. At-least-once ingestion guarantees duplicates. Silver is where you commit to a single source of truth per entity.

Key Technical Details

Always partition silver tables by a meaningful date column (usually created_at or event_date). Partitioning is what makes queries cheap as the table grows.
Cluster on the most common filter dimension (usually entity ID or user ID). Clustering is free in BigQuery and dramatically reduces scan cost.
Use CREATE OR REPLACE TABLE for full rebuilds when the table is small (<100M rows). For larger tables, use MERGE for incremental updates (see lesson 04-incremental-and-idempotent-pipelines.md).
Capture a row count per partition into a small _pipeline_metrics table after each silver run. This is your first line of defense for data quality monitoring.

Common Misconceptions

"Just SELECT * from the bronze view." That works for one source and one table. The minute you have ten sources, you'll wish you'd codified the cleaning logic explicitly.

"Deduplication can wait until gold." Every gold computation runs on top of silver. If silver has duplicates, all gold computations are wrong. Dedupe at silver, once, in one place.

"Use UPSERT/MERGE for everything." CREATE OR REPLACE is fine until tables get large. Most silver tables can be fully rebuilt in seconds; the simplicity is worth the cost.

Connections to Other Concepts

02-silver-to-gold-modeling-for-agents.md — The next layer up.
03-dbt-for-versioned-transforms.md — Putting these queries under version control.
04-incremental-and-idempotent-pipelines.md — When full rebuilds get too expensive.