One-Line Summary: Once your tables grow past a hundred million rows, full rebuilds get too expensive — incremental and idempotent pipelines update only what changed, safely re-runnable on any failure.
Prerequisites: Lesson 03-dbt-for-versioned-transforms.md.
What's the Concept?
Two properties matter together:
Incremental — the pipeline only processes the data that's new or changed since the last run, instead of re-scanning the entire source. The win is cost and time: a 1 TB silver table becomes a 10 GB nightly update.
Idempotent — running the same pipeline twice with the same inputs produces the same outputs. No duplicate rows, no orphaned partitions, no "did the previous run finish?" anxiety. If a job fails halfway, you re-run it; the result is the same as if it had succeeded the first time.
These properties travel together because incrementality without idempotency is dangerous. An incremental pipeline that double-inserts on retry is worse than a slow-but-deterministic full rebuild.
How It Works
The canonical idempotent incremental pattern in BigQuery uses MERGE:
MERGE `myco.silver.orders` AS target
USING (
SELECT
JSON_VALUE(payload, '$.id') AS order_id,
JSON_VALUE(payload, '$.customer') AS customer_id,
SAFE_CAST(JSON_VALUE(payload, '$.amount') AS INT64) AS amount_cents,
TIMESTAMP_SECONDS(SAFE_CAST(JSON_VALUE(payload, '$.created') AS INT64)) AS created_at,
_ingestion_timestamp AS _ingested_at
FROM `myco.bronze.stripe_charges_raw`
-- The watermark window — overlap is intentional
WHERE _ingestion_date BETWEEN DATE_SUB(@run_date, INTERVAL 1 DAY) AND @run_date
AND JSON_VALUE(payload, '$.id') IS NOT NULL
QUALIFY ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY _ingested_at DESC) = 1
) AS source
ON target.order_id = source.order_id
WHEN MATCHED AND source._ingested_at > target._ingested_at THEN
UPDATE SET
customer_id = source.customer_id,
amount_cents = source.amount_cents,
created_at = source.created_at,
_ingested_at = source._ingested_at
WHEN NOT MATCHED THEN
INSERT (order_id, customer_id, amount_cents, created_at, _ingested_at)
VALUES (source.order_id, source.customer_id, source.amount_cents, source.created_at, source._ingested_at);Three properties this query gets right:
- Watermarked. Only reads the last day plus today's bronze partition. Cheap.
- Lossless on overlap. The two-day window guarantees no row falls through the cracks if yesterday's job ran late.
- Idempotent. Running it twice does nothing extra — the
MATCHEDclause checks_ingested_atso older versions never overwrite newer ones.
For dbt users, the equivalent is the incremental materialization:
{{ config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='merge',
partition_by={'field': 'created_at', 'data_type': 'timestamp'}
) }}
SELECT ...
FROM {{ source('bronze', 'stripe_charges_raw') }}
{% if is_incremental() %}
WHERE _ingestion_date >= (SELECT DATE_SUB(MAX(DATE(_ingested_at)), INTERVAL 1 DAY) FROM {{ this }})
{% endif %}dbt generates the MERGE statement automatically; you only write the SELECT and declare the merge key.
Why It Matters
- Cost stays linear with new data, not total data. A pipeline running on 100 GB/day stays roughly constant in cost whether the table holds 1 TB or 100 TB of history.
- Retries are safe. When orchestration tools (Airflow, Composer) retry a failed task, idempotency means you don't have to manually clean up partial state.
- Backfills compose. "Re-run last week's pipeline" is a parameter substitution, not a special procedure.
Key Technical Details
- Always overlap your watermark window by at least one full ingestion period. Late-arriving data is real; an exact-edge watermark will lose rows.
MERGEon a partitioned table prunes partitions automatically if you include the partition column in theONclause. Skip that, andMERGEscans the whole table — surprise full-table cost.- For very high-cardinality tables (billions of rows), append-only patterns can outperform
MERGE— write new rows to an "insert" table, deduplicate at read time. But this is an optimization; reach forMERGEfirst. INSERT ... SELECTis idempotent only if the source query is deterministic and the table has a unique constraint on the merge key. Don't rely on it casually.
Common Misconceptions
"Incremental means we don't need bronze." No — bronze stays the system of record. Incremental silver pipelines read from bronze; if silver gets corrupted, you replay from bronze with a wider window.
"MERGE is slow." MERGE on a partitioned, clustered table is fast and competitive with append-only patterns up to billions of rows. Profile before optimizing prematurely.
"Idempotency is for streaming." It's for everything. Batch failures, orchestrator retries, and human-triggered backfills all rely on the same property.
Connections to Other Concepts
01-bronze-to-silver-cleaning-and-conforming.md— Full rebuilds are the starting point; incremental is the optimization.- Course
06-pipeline-orchestration/01-orchestrating-with-cloud-composer.md— Idempotent tasks make orchestration sane. - Course
07-operating-the-system/01-observability-and-data-quality-monitoring.md— Track row counts per run to catch silent incremental failures.
Further Reading
- Maxime Beauchemin, "Functional Data Engineering" — The foundational essay on idempotency in data pipelines.
- BigQuery docs, "DML statements: MERGE" — Reference for the SQL feature.
- dbt docs, "Incremental models" — How dbt generates merge logic from a declarative config.