One-Line Summary: Dataform is GCP's built-in dbt — version-controlled SQL pipelines that run entirely inside BigQuery, with scheduling, dependencies, and testing as a managed service.
Prerequisites: Lessons 01-orchestrating-with-cloud-composer.md and 04-refinement-in-bigquery/03-dbt-for-versioned-transforms.md.
What's the Concept?
For pipelines that live entirely inside BigQuery — silver and gold transformations, embedding refreshes, materialized aggregations — you don't strictly need Composer or Dataflow. Dataform gives you the same dbt-style developer experience (SQL with ref(), dependency graph, tests, scheduling) without leaving BigQuery.
The pitch is operational: no Airflow cluster to manage, no Python environment to maintain, no service-account juggling. You commit SQL to a Git repo; Dataform runs it on a schedule.
For an agent pipeline whose only non-SQL step is the ingestion (often handled by Datastream or a Cloud Run puller), Dataform alone can cover the entire silver → gold → embed → quality-check chain.
How It Works
A Dataform project looks like:
my-agent-pipeline/
├── definitions/
│ ├── sources.js ← bronze source declarations
│ ├── silver_orders.sqlx ← silver model
│ ├── gold_billing_context.sqlx ← gold model
│ └── tests/
│ └── unique_order_id.sqlx
├── includes/
│ └── helpers.js ← shared JS helpers
└── workflow_settings.yamlA .sqlx model is SQL with a small config header:
-- definitions/silver_orders.sqlx
config {
type: "incremental",
uniqueKey: ["order_id"],
bigquery: {
partitionBy: "DATE(created_at)",
clusterBy: ["customer_id"]
},
description: "One row per order. Built from bronze.stripe_charges_raw."
}
SELECT
JSON_VALUE(payload, '$.id') AS order_id,
JSON_VALUE(payload, '$.customer') AS customer_id,
SAFE_CAST(JSON_VALUE(payload, '$.amount') AS INT64) AS amount_cents,
TIMESTAMP_SECONDS(SAFE_CAST(JSON_VALUE(payload, '$.created') AS INT64)) AS created_at,
_ingestion_timestamp AS _ingested_at
FROM ${ref({schema: "bronze", name: "stripe_charges_raw"})}
${when(incremental(), `
WHERE _ingestion_date >= (
SELECT DATE_SUB(MAX(DATE(_ingested_at)), INTERVAL 1 DAY)
FROM ${self()}
)
`)}Gold models reference silver with ${ref("silver_orders")}. Dataform compiles, builds the DAG, runs models in order, runs tests, and reports.
Scheduling is configured per workflow:
# workflow_settings.yaml
schedules:
- name: hourly_billing
cron: "0 * * * *"
target:
tags: ["billing_agent"]
notifications:
onFailure: ["data-alerts@myco.com"]Tags on models (tags: ["billing_agent"]) let you run subsets of the DAG on different schedules.
Why It Matters
- One less moving part. No Airflow cluster, no DAG file translation. Everything is BigQuery-native.
- Free. Dataform itself costs nothing. You only pay for the BigQuery queries it runs.
- Tight Git integration. Dataform repos live in Cloud Source Repositories or GitHub; the Dataform UI shows live diffs, run history, and test results inline.
- Same mental model as dbt. Engineers who know dbt are productive in Dataform immediately.
Key Technical Details
- Dataform supports the same materializations as dbt:
view,table,incremental. For incremental, theuniqueKeyconfig tells Dataform how to MERGE. - Tests in Dataform are assertion queries — same as dbt's "tests should return zero rows when passing."
- The
.sqlxformat includes a templating language (JavaScript-based) for the config and dynamic SQL. Cleaner than dbt's Jinja for most things. - For cross-tool orchestration (Dataform + a Cloud Run job), you can still call Dataform from Composer with the
DataformRunWorkflowOperator. Use Dataform for what it's good at, and Composer just for the cross-system bits.
Common Misconceptions
"Pick dbt or Dataform — never both." Some teams use Dataform for warehouse-internal transforms and dbt elsewhere. The conventions are similar enough that engineers move between them easily.
"Dataform is just dbt-lite." Functionally close, but tightly coupled to BigQuery. If your warehouse is BigQuery and you're not multi-cloud, Dataform's tight integration is a win, not a limitation.
"Dataform replaces Composer entirely." It replaces Composer for SQL-only pipelines. The minute you have a non-SQL step (calling an external API, kicking off a Dataflow job), you need Composer or a Cloud Workflows definition to glue them together.
Connections to Other Concepts
- Course
04-refinement-in-bigquery/03-dbt-for-versioned-transforms.md— The dbt equivalent. 01-orchestrating-with-cloud-composer.md— When you need cross-system orchestration on top.- Course
07-operating-the-system/01-observability-and-data-quality-monitoring.md— Dataform tests as quality monitors.
Further Reading
- Google Cloud, "Dataform overview" docs.
- "Dataform SQLX reference" — Syntax + config options.
- Dataform's product blog — Comparisons with dbt and migration guides.