Change Data Capture from Databases

One-Line Summary: CDC reads a source database's transaction log directly, turning every insert/update/delete into a stream of events the warehouse can mirror in near-real-time.

Prerequisites: Lesson 02-event-streams-with-pub-sub.md.

What's the Concept?

When your source of truth is a production database — Postgres holding customer accounts, MySQL holding orders — you have two ways to bring it into the lake. You can issue periodic SELECT * FROM customers WHERE updated_at > ? queries (batch). Or you can read the database's own change log and replay every row mutation as it happens (CDC).

CDC is the production answer when you need near-real-time data in the warehouse without putting query load on the OLTP database. The database is already writing every change to a write-ahead log for its own crash recovery. CDC tools tail that log and republish the events.

How It Works

On GCP, the canonical CDC service is Datastream. It's a managed agent that:

Connects to your Postgres / MySQL / Oracle source via a private connection.
Reads the WAL (write-ahead log) — pgoutput plugin for Postgres, binlog for MySQL.
Publishes change events to a destination, typically GCS bronze or BigQuery directly.

   Postgres production     Datastream                BigQuery
   ────────────────         ─────────                 ────────
   COMMIT  ──── WAL ────▶  tail  ────▶  events  ──▶  bronze.orders_raw
                                                     (one row per change)
                                                          │
                                                          │ silver step
                                                          ▼
                                                     silver.orders
                                                     (one row per order,
                                                      latest version)

Each CDC event has at least:

An operation: INSERT, UPDATE, or DELETE.
A primary key.
The new row state (and often the previous row state).
A commit timestamp (the LSN/binlog position is the authoritative ordering).

The bronze table is append-only and contains every change. The silver table collapses the change history into the current state per entity, using ROW_NUMBER() OVER (PARTITION BY id ORDER BY commit_ts DESC) or a similar merge pattern.

Why It Matters

Zero load on the source database. Reading the WAL is something the DB does anyway for replication. CDC piggybacks on that infrastructure.
Captures deletes, which batch can't. A SELECT * query has no way to notice a row that's been deleted. CDC explicitly emits delete events.
Sub-minute freshness, end-to-end. Datastream + BigQuery typically lands changes within 30–60 seconds of commit.
Audit trail comes free. Because bronze is append-only, you have a complete history of every row mutation — useful for debugging and for time-travel queries the agent might need.

Key Technical Details

Postgres requires wal_level=logical and at least one replication slot. The slot retains WAL until it's consumed — if Datastream goes down, the source's disk usage grows. Monitor this.
MySQL needs binlog_format=ROW and at least 7 days of binlog retention. Otherwise a Datastream pause can lose data.
DDL changes (schema migrations) need careful handling. Datastream can detect column adds; column drops and type changes usually break the stream and need manual recovery.
Cost model is per-GB processed. For a moderately busy OLTP database, expect $50–200/month.

Common Misconceptions

"CDC means I don't need batch." You still might. CDC is great for tables that change frequently; for slow-changing reference data, a nightly batch is simpler and equally fresh enough.

"CDC is real-time replication." It's near-real-time. There's a small lag from commit to lake — usually under a minute — and that lag can spike during high write volume or replication-slot back-pressure.

"CDC will replace my warehouse ETL." No — CDC fills bronze. Silver and gold modeling are still entirely your job; CDC just makes the input fresher.

Connections to Other Concepts

Course 04-refinement-in-bigquery/01-bronze-to-silver-cleaning-and-conforming.md — How the CDC stream gets merged into a usable silver table.
02-event-streams-with-pub-sub.md — Datastream itself runs over Pub/Sub semantics; understanding the bus helps debug CDC issues.
Course 07-operating-the-system/01-observability-and-data-quality-monitoring.md — Replication lag is a first-class metric.