One-Line Summary: dbt (or Google's native Dataform) turns SQL transformations into version-controlled, testable code with dependency graphs — the difference between an ad-hoc warehouse and a maintained one.

Prerequisites: Lesson 02-silver-to-gold-modeling-for-agents.md.

What's the Concept?

A pile of CREATE OR REPLACE TABLE queries in a folder is a maintenance nightmare. Which runs first? What depends on what? How do I test a change before deploying it? How do I roll back?

dbt (data build tool) is the answer the industry standardized on around 2020. It's simple: write each table definition as a .sql file with templating, declare its dependencies, and the tool builds a DAG, runs them in order, and tests them along the way.

GCP also ships Dataform, a native equivalent integrated with BigQuery. The mental model is identical; the choice is mostly about whether you want a cloud-managed UI (Dataform) or an open-source ecosystem (dbt). For this course we describe dbt because the patterns are more widely transferable.

How It Works

A dbt project mirrors the medallion layout:

dbt-project/
├── models/
│   ├── staging/           ← silver models (one .sql per silver table)
│   │   ├── stg_orders.sql
│   │   ├── stg_customers.sql
│   │   └── stg_tickets.sql
│   └── marts/             ← gold models (one .sql per agent use case)
│       ├── billing_agent_context.sql
│       └── support_agent_context.sql
├── tests/
│   ├── orders_no_negative_amounts.sql
│   └── customers_unique_id.sql
└── dbt_project.yml

Each model is a SELECT statement; the surrounding CREATE TABLE is generated:

-- models/staging/stg_orders.sql
{{ config(
    materialized='table',
    partition_by={'field': 'created_at', 'data_type': 'timestamp'},
    cluster_by=['customer_id']
) }}
 
WITH parsed AS (
  SELECT
    JSON_VALUE(payload, '$.id')           AS order_id,
    JSON_VALUE(payload, '$.customer')     AS customer_id,
    SAFE_CAST(JSON_VALUE(payload, '$.amount') AS INT64) AS amount_cents,
    TIMESTAMP_SECONDS(SAFE_CAST(JSON_VALUE(payload, '$.created') AS INT64)) AS created_at,
    _ingestion_timestamp                  AS _ingested_at
  FROM {{ source('bronze', 'stripe_charges_raw') }}
  WHERE _ingestion_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
)
SELECT * FROM parsed
QUALIFY ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY _ingested_at DESC) = 1

Gold models reference silver models with {{ ref('stg_orders') }} instead of hard-coded table names. dbt parses these references, builds the DAG, and runs models in correct order. If stg_orders.sql fails, dependent gold models are skipped — no half-built warehouse.

Tests are SQL queries that return rows when something is wrong:

-- tests/orders_no_negative_amounts.sql
SELECT order_id FROM {{ ref('stg_orders') }}
WHERE amount_cents < 0

dbt test runs all tests; any row returned is a failure. Standard tests (unique, not_null, relationships) are declared in YAML, not SQL.

Why It Matters

  • The DAG is explicit. New engineers can see the entire transformation pipeline in one diagram. Implicit-dependency bugs disappear.
  • Tests catch regressions before production. A silver-layer schema change that breaks gold gets flagged in the CI run, not at 3am.
  • Backfills become safe. dbt run --select stg_orders+ rebuilds stg_orders and everything downstream of it, in order.
  • The warehouse becomes a code repository. PRs, code review, history, blame — all the engineering practices apply.

Key Technical Details

  • dbt has free open-source and paid Cloud variants. The open-source CLI plus a GitHub Action for CI is the cheap production setup.
  • Dataform (Google's native) is free with BigQuery and offers similar features with tighter GCP integration; pick it if you want fewer moving parts.
  • Models default to view materialization. For anything queried more than once, use table (full rebuild) or incremental (append/merge new data only).
  • Source freshness checks (dbt source freshness) detect upstream pipeline lag — useful as an early warning.

Common Misconceptions

"dbt is just SQL with extra steps." It's SQL with a dependency graph, a test framework, documentation generation, and CI integration. The "extra steps" are the actual engineering practice.

"We're too small for dbt." dbt is more useful at small scale than at large — that's when changes are most frequent and least documented. Start as soon as you have more than ~5 transformation queries.

"dbt and Dataform are competing tools." They serve the same role with different ecosystems. Pick one based on team preference; don't run both.

Connections to Other Concepts

  • 04-incremental-and-idempotent-pipelines.md — How dbt's incremental materialization works in practice.
  • Course 06-pipeline-orchestration/01-orchestrating-with-cloud-composer.md — Wiring dbt into Airflow.
  • Course 07-operating-the-system/01-observability-and-data-quality-monitoring.md — dbt tests as the first tier of data quality monitoring.

Further Reading

  • dbt Labs, dbt Best Practices Guide — Community-maintained, current, updated continuously. The closest thing to canonical for project structure, naming, and testing strategy.
  • Tristan Handy et al., dbt official documentation and "dbt Fundamentals" course (free) — Start here if you've never used dbt.
  • Google Cloud, "Dataform overview" + "Dataform SQLX reference" — The native equivalent inside BigQuery; free, tighter integration than dbt.
  • dbt Semantic Layer (GA 2024) — Worth knowing about: declarative metric definitions on top of dbt models, queryable from BI tools and (now) agent tool calls. Maps onto the agent-mart pattern in this course's Module 05.
  • Google Cloud, "BigQuery managed Apache Iceberg tables" (GA 2024) — Iceberg-formatted tables natively in BigQuery, queryable from Spark/Trino too. Worth knowing about if you'll ever need cross-engine reads on the same gold tables.