One-Line Summary: CDC reads a source database's transaction log directly, turning every insert/update/delete into a stream of events the warehouse can mirror in near-real-time.
Prerequisites: Lesson 02-event-streams-with-pub-sub.md.
What's the Concept?
When your source of truth is a production database — Postgres holding customer accounts, MySQL holding orders — you have two ways to bring it into the lake. You can issue periodic SELECT * FROM customers WHERE updated_at > ? queries (batch). Or you can read the database's own change log and replay every row mutation as it happens (CDC).
CDC is the production answer when you need near-real-time data in the warehouse without putting query load on the OLTP database. The database is already writing every change to a write-ahead log for its own crash recovery. CDC tools tail that log and republish the events.
How It Works
On GCP, the canonical CDC service is Datastream. It's a managed agent that:
- Connects to your Postgres / MySQL / Oracle source via a private connection.
- Reads the WAL (write-ahead log) —
pgoutputplugin for Postgres, binlog for MySQL. - Publishes change events to a destination, typically GCS bronze or BigQuery directly.
Postgres production Datastream BigQuery
──────────────── ───────── ────────
COMMIT ──── WAL ────▶ tail ────▶ events ──▶ bronze.orders_raw
(one row per change)
│
│ silver step
▼
silver.orders
(one row per order,
latest version)Each CDC event has at least:
- An operation:
INSERT,UPDATE, orDELETE. - A primary key.
- The new row state (and often the previous row state).
- A commit timestamp (the LSN/binlog position is the authoritative ordering).
The bronze table is append-only and contains every change. The silver table collapses the change history into the current state per entity, using ROW_NUMBER() OVER (PARTITION BY id ORDER BY commit_ts DESC) or a similar merge pattern.
Why It Matters
- Zero load on the source database. Reading the WAL is something the DB does anyway for replication. CDC piggybacks on that infrastructure.
- Captures deletes, which batch can't. A
SELECT *query has no way to notice a row that's been deleted. CDC explicitly emits delete events. - Sub-minute freshness, end-to-end. Datastream + BigQuery typically lands changes within 30–60 seconds of commit.
- Audit trail comes free. Because bronze is append-only, you have a complete history of every row mutation — useful for debugging and for time-travel queries the agent might need.
Key Technical Details
- Postgres requires
wal_level=logicaland at least one replication slot. The slot retains WAL until it's consumed — if Datastream goes down, the source's disk usage grows. Monitor this. - MySQL needs
binlog_format=ROWand at least 7 days of binlog retention. Otherwise a Datastream pause can lose data. - DDL changes (schema migrations) need careful handling. Datastream can detect column adds; column drops and type changes usually break the stream and need manual recovery.
- Cost model is per-GB processed. For a moderately busy OLTP database, expect $50–200/month.
Common Misconceptions
"CDC means I don't need batch." You still might. CDC is great for tables that change frequently; for slow-changing reference data, a nightly batch is simpler and equally fresh enough.
"CDC is real-time replication." It's near-real-time. There's a small lag from commit to lake — usually under a minute — and that lag can spike during high write volume or replication-slot back-pressure.
"CDC will replace my warehouse ETL." No — CDC fills bronze. Silver and gold modeling are still entirely your job; CDC just makes the input fresher.
Connections to Other Concepts
- Course
04-refinement-in-bigquery/01-bronze-to-silver-cleaning-and-conforming.md— How the CDC stream gets merged into a usable silver table. 02-event-streams-with-pub-sub.md— Datastream itself runs over Pub/Sub semantics; understanding the bus helps debug CDC issues.- Course
07-operating-the-system/01-observability-and-data-quality-monitoring.md— Replication lag is a first-class metric.
Further Reading
- Google Cloud, "Datastream overview" docs.
- Martin Kleppmann, "Designing Data-Intensive Applications" (O'Reilly) — chapter 11 on log-based replication is the best primer on what's happening inside Datastream.
- Debezium documentation — the open-source equivalent; reading their docs gives a clearer mental model of CDC mechanics.