One-Line Summary: A three-tier lake-and-warehouse layout — bronze for raw, silver for cleaned, gold for serve-ready — that keeps refinement steps inspectable and reversible.

Prerequisites: Lesson 01-the-agent-data-flywheel.md.

What's the Concept?

The medallion pattern is a naming convention for what every mature data pipeline ends up doing anyway: keep the raw source separate from the cleaned form, and the cleaned form separate from the product. The colors are mnemonic — bronze is the least valuable but the closest to truth; gold is the most valuable but the most opinionated.

It's worth committing to the names because they become shared vocabulary on the team. "Is that a bronze issue or a silver issue?" instantly tells you whether the bug is in ingestion or transformation.

How It Works

                ┌─────────────┐
   external →   │   BRONZE    │   raw, as-received, immutable
                │   (GCS)     │   no schema enforcement
                └──────┬──────┘
                       │  parse, validate, conform types

                ┌─────────────┐
                │   SILVER    │   one-row-per-thing, deduped
                │  (BigQuery) │   conformed schema, lossless
                └──────┬──────┘
                       │  aggregate, join, summarize, embed

                ┌─────────────┐
                │    GOLD     │   agent-ready, lossy on purpose
                │  (BigQuery) │   one table per use case
                └─────────────┘

Bronze is your tape backup of the universe. JSON blobs, CSV files, webhook payloads — landed in GCS exactly as received, with a partition key (usually ingestion_date=YYYY-MM-DD) and a source identifier. The discipline: never overwrite, never edit. If a downstream layer breaks, you replay from bronze.

Silver is the layer where you've decided what the data is. Types are correct, duplicates collapsed, foreign keys consistent, columns named the way your team agreed. A silver table answers: "what is one row, and what does each column mean?" One silver table per source entity is the usual carve-up — a customers table, an orders table, a support_tickets table.

Gold is the layer where you've decided what the data is for. A gold table is opinionated: it joins silver tables together, aggregates them, computes embeddings, drops columns the agent doesn't need. There's usually one gold table per agent use case. A "billing agent" gets a gold_billing_agent_context table; a "product agent" gets a different one. Don't try to make a universal gold layer — that's a silver table.

Why It Matters

  • Recovery becomes cheap. A bad transformation in gold is a non-event when you can re-derive it from silver in minutes. A bad transformation in silver is a non-event when you can re-derive it from bronze.
  • Refinement steps become reviewable. Each transition is a pull request: bronze → silver is a parsing logic change; silver → gold is a business logic change.
  • Reasoning about freshness gets easier. Bronze freshness is "how often do we ingest?"; silver freshness is "how often do we transform?"; gold freshness is "how stale can the agent tolerate?"

Key Technical Details

  • Bronze data should be retained for at least 90 days, often longer for compliance; GCS lifecycle rules can auto-archive to colder storage.
  • Silver tables in BigQuery should be partitioned (usually by event date) and clustered (usually by entity ID) — both are free in BigQuery and dramatically reduce scan cost.
  • A single gold table rarely needs more than 50 columns. If yours has 200, it's probably a silver table wearing a gold name tag.

Common Misconceptions

"Bronze is wasted storage." GCS is cheap; bad data decisions are expensive. The bronze layer is your audit trail. The cost is rounding error compared to the recovery time it saves once.

"Just one big table is simpler." It's simpler until the day you want to change a transformation rule and the auditor asks for the original payload from six months ago.

Connections to Other Concepts

  • Course 03-the-raw-data-lake/01-cloud-storage-as-a-lake.md — How bronze is laid out in GCS.
  • Course 04-refinement-in-bigquery/01-bronze-to-silver-cleaning-and-conforming.md — The bronze→silver step in detail.
  • Course 04-refinement-in-bigquery/02-silver-to-gold-modeling-for-agents.md — How gold tables are shaped for agent retrieval.

Further Reading

  • Databricks, "Medallion Architecture" docs — The reference description of the pattern.
  • Maxime Beauchemin, "The Rise of the Data Engineer" (2017) — Background on why this pattern emerged.
  • Google Cloud, "Data lake architecture on GCP" — Google's framing of the same idea in their vocabulary.