One-Line Summary: The choice of when to enforce a schema — at ingest time or at query time — sets the rigidity-versus-flexibility tradeoff for your entire pipeline.
Prerequisites: Lesson 02-bucket-layouts-and-partitioning.md.
What's the Concept?
Every byte of data eventually gets a schema applied to it before anything useful happens. The strategic question is when — and where — that schema gets enforced.
Schema-on-write means the schema is enforced at the moment data is ingested. If a row doesn't match, it's rejected. The classic data warehouse uses schema-on-write: you CREATE TABLE customers (id INT, ...) and any INSERT that doesn't fit fails.
Schema-on-read means the data is stored as-is and the schema is applied lazily, when something queries it. The classic data lake uses schema-on-read: dump JSON into a bucket; figure out columns later.
The practical pattern, for both lake and warehouse on GCP: schema-on-read at bronze, schema-on-write at silver. Bronze stays flexible; silver enforces.
How It Works
At the bronze layer in GCS:
- Files are JSONL, Parquet, or CSV — whatever the source produced.
- No type validation. A field that was a string yesterday and a number today both land successfully.
- An external-table view in BigQuery can read the files for ad-hoc inspection, but the table definition is generated, not authoritative.
- If a new field appears in the source payload, it just shows up in tomorrow's files. Bronze doesn't care.
At the silver layer in BigQuery:
- The table has an explicit
CREATE TABLEwith typed columns. - The bronze → silver transformation has to cast, coerce, drop, or default every field. Bad data triggers an alert; well-formed data lands.
- If the source adds a new field, the bronze→silver code stays unchanged until you decide to surface that field. It's a choice, not an automatic propagation.
- Schema changes here are reviewed, versioned, deployed deliberately.
bronze (schema-on-read) silver (schema-on-write)
──────────────────────── ─────────────────────────
{ id: 42, id INT64
amount: "12.50", amount_cents INT64
status: null, ───▶ status STRING
fee_breakdown: {...}, fee_total NUMERIC
...30 more fields... } updated_at TIMESTAMP
(5 columns; the other 30
stay in bronze for now)Why It Matters
- Bronze can keep ingesting through schema drift. When a vendor adds a new field at 2am, your pipeline doesn't break. You notice in the morning, decide if you want it, propagate to silver during business hours.
- Silver is the contract. Once a column is in silver, downstream agents and pipelines depend on it. You owe stability there. Bronze owes nothing.
- Debugging stays cheap. When silver looks wrong, you can re-query the same row in bronze to see what was actually received. Schema-on-write would have rejected or coerced before you could see.
Key Technical Details
- BigQuery's
EXTERNAL TABLEover GCS supports schema auto-detection (AUTO_DETECT) which is the schema-on-read interface — convenient for inspection, not for production queries. - BigQuery's
LOAD DATAinto a native table is the schema-on-write step; you can pass--source_format=NEWLINE_DELIMITED_JSON --schema=schema.jsonor rely on auto-detection if you're feeling brave. - For long-lived silver tables, define the schema in code (Terraform / Dataform / dbt YAML), not via console. Schema is part of the pipeline source code.
- Tracking "what's in bronze but not in silver" — sometimes called the dark schema — is a useful exercise quarterly. It's where new product opportunities hide.
Common Misconceptions
"Schema-on-read is for sloppy teams." Schema-on-read at the lake is what makes the whole system robust to upstream change. Sloppy is letting upstream changes break your warehouse silently. The pattern above is rigorous about where rigidity lives.
"Just use Avro/Protobuf for everything." Strongly-typed serialization at ingest is a fine choice when you control the producer. For third-party APIs and webhooks, you don't control the producer; JSONL with schema-on-read is the realistic answer.
Connections to Other Concepts
- Course
04-refinement-in-bigquery/01-bronze-to-silver-cleaning-and-conforming.md— The transformation that enforces the silver schema. 04-data-governance-from-day-one.md— How schema decisions interact with governance.- Course
07-operating-the-system/01-observability-and-data-quality-monitoring.md— How you watch for schema drift in production.
Further Reading
- Martin Kleppmann, "Designing Data-Intensive Applications" — chapter 4 on schemas in data systems is the definitive primer.
- Google Cloud, "Schema auto-detection in BigQuery" docs.
- "Schema-on-Read vs Schema-on-Write: A Pragmatic Take" — Confluent blog post that captures the modern-stack consensus.