Incremental and Idempotent Pipelines

One-Line Summary: Once your tables grow past a hundred million rows, full rebuilds get too expensive — incremental and idempotent pipelines update only what changed, safely re-runnable on any failure.

Prerequisites: Lesson 03-dbt-for-versioned-transforms.md.

What's the Concept?

Two properties matter together:

Incremental — the pipeline only processes the data that's new or changed since the last run, instead of re-scanning the entire source. The win is cost and time: a 1 TB silver table becomes a 10 GB nightly update.

Idempotent — running the same pipeline twice with the same inputs produces the same outputs. No duplicate rows, no orphaned partitions, no "did the previous run finish?" anxiety. If a job fails halfway, you re-run it; the result is the same as if it had succeeded the first time.

These properties travel together because incrementality without idempotency is dangerous. An incremental pipeline that double-inserts on retry is worse than a slow-but-deterministic full rebuild.

How It Works

The canonical idempotent incremental pattern in BigQuery uses MERGE:

MERGE `myco.silver.orders` AS target
USING (
  SELECT
    JSON_VALUE(payload, '$.id')        AS order_id,
    JSON_VALUE(payload, '$.customer')  AS customer_id,
    SAFE_CAST(JSON_VALUE(payload, '$.amount') AS INT64) AS amount_cents,
    TIMESTAMP_SECONDS(SAFE_CAST(JSON_VALUE(payload, '$.created') AS INT64)) AS created_at,
    _ingestion_timestamp                AS _ingested_at
  FROM `myco.bronze.stripe_charges_raw`
  -- The watermark window — overlap is intentional
  WHERE _ingestion_date BETWEEN DATE_SUB(@run_date, INTERVAL 1 DAY) AND @run_date
    AND JSON_VALUE(payload, '$.id') IS NOT NULL
  QUALIFY ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY _ingested_at DESC) = 1
) AS source
ON target.order_id = source.order_id
WHEN MATCHED AND source._ingested_at > target._ingested_at THEN
  UPDATE SET
    customer_id = source.customer_id,
    amount_cents = source.amount_cents,
    created_at = source.created_at,
    _ingested_at = source._ingested_at
WHEN NOT MATCHED THEN
  INSERT (order_id, customer_id, amount_cents, created_at, _ingested_at)
  VALUES (source.order_id, source.customer_id, source.amount_cents, source.created_at, source._ingested_at);

Three properties this query gets right:

Watermarked. Only reads the last day plus today's bronze partition. Cheap.
Lossless on overlap. The two-day window guarantees no row falls through the cracks if yesterday's job ran late.
Idempotent. Running it twice does nothing extra — the MATCHED clause checks _ingested_at so older versions never overwrite newer ones.

For dbt users, the equivalent is the incremental materialization:

{{ config(
    materialized='incremental',
    unique_key='order_id',
    incremental_strategy='merge',
    partition_by={'field': 'created_at', 'data_type': 'timestamp'}
) }}
 
SELECT ...
FROM {{ source('bronze', 'stripe_charges_raw') }}
{% if is_incremental() %}
  WHERE _ingestion_date >= (SELECT DATE_SUB(MAX(DATE(_ingested_at)), INTERVAL 1 DAY) FROM {{ this }})
{% endif %}

dbt generates the MERGE statement automatically; you only write the SELECT and declare the merge key.

Why It Matters

Cost stays linear with new data, not total data. A pipeline running on 100 GB/day stays roughly constant in cost whether the table holds 1 TB or 100 TB of history.
Retries are safe. When orchestration tools (Airflow, Composer) retry a failed task, idempotency means you don't have to manually clean up partial state.
Backfills compose. "Re-run last week's pipeline" is a parameter substitution, not a special procedure.

Key Technical Details

Always overlap your watermark window by at least one full ingestion period. Late-arriving data is real; an exact-edge watermark will lose rows.
MERGE on a partitioned table prunes partitions automatically if you include the partition column in the ON clause. Skip that, and MERGE scans the whole table — surprise full-table cost.
For very high-cardinality tables (billions of rows), append-only patterns can outperform MERGE — write new rows to an "insert" table, deduplicate at read time. But this is an optimization; reach for MERGE first.
INSERT ... SELECT is idempotent only if the source query is deterministic and the table has a unique constraint on the merge key. Don't rely on it casually.

Common Misconceptions

"Incremental means we don't need bronze." No — bronze stays the system of record. Incremental silver pipelines read from bronze; if silver gets corrupted, you replay from bronze with a wider window.

"MERGE is slow." MERGE on a partitioned, clustered table is fast and competitive with append-only patterns up to billions of rows. Profile before optimizing prematurely.

"Idempotency is for streaming." It's for everything. Batch failures, orchestrator retries, and human-triggered backfills all rely on the same property.

Connections to Other Concepts

01-bronze-to-silver-cleaning-and-conforming.md — Full rebuilds are the starting point; incremental is the optimization.
Course 06-pipeline-orchestration/01-orchestrating-with-cloud-composer.md — Idempotent tasks make orchestration sane.
Course 07-operating-the-system/01-observability-and-data-quality-monitoring.md — Track row counts per run to catch silent incremental failures.