Dataform and BigQuery-Native Pipelines

One-Line Summary: Dataform is GCP's built-in dbt — version-controlled SQL pipelines that run entirely inside BigQuery, with scheduling, dependencies, and testing as a managed service.

Prerequisites: Lessons 01-orchestrating-with-cloud-composer.md and 04-refinement-in-bigquery/03-dbt-for-versioned-transforms.md.

What's the Concept?

For pipelines that live entirely inside BigQuery — silver and gold transformations, embedding refreshes, materialized aggregations — you don't strictly need Composer or Dataflow. Dataform gives you the same dbt-style developer experience (SQL with ref(), dependency graph, tests, scheduling) without leaving BigQuery.

The pitch is operational: no Airflow cluster to manage, no Python environment to maintain, no service-account juggling. You commit SQL to a Git repo; Dataform runs it on a schedule.

For an agent pipeline whose only non-SQL step is the ingestion (often handled by Datastream or a Cloud Run puller), Dataform alone can cover the entire silver → gold → embed → quality-check chain.

How It Works

A Dataform project looks like:

my-agent-pipeline/
├── definitions/
│   ├── sources.js                 ← bronze source declarations
│   ├── silver_orders.sqlx         ← silver model
│   ├── gold_billing_context.sqlx  ← gold model
│   └── tests/
│       └── unique_order_id.sqlx
├── includes/
│   └── helpers.js                 ← shared JS helpers
└── workflow_settings.yaml

A .sqlx model is SQL with a small config header:

-- definitions/silver_orders.sqlx
config {
  type: "incremental",
  uniqueKey: ["order_id"],
  bigquery: {
    partitionBy: "DATE(created_at)",
    clusterBy: ["customer_id"]
  },
  description: "One row per order. Built from bronze.stripe_charges_raw."
}
 
SELECT
  JSON_VALUE(payload, '$.id')        AS order_id,
  JSON_VALUE(payload, '$.customer')  AS customer_id,
  SAFE_CAST(JSON_VALUE(payload, '$.amount') AS INT64) AS amount_cents,
  TIMESTAMP_SECONDS(SAFE_CAST(JSON_VALUE(payload, '$.created') AS INT64)) AS created_at,
  _ingestion_timestamp               AS _ingested_at
FROM ${ref({schema: "bronze", name: "stripe_charges_raw"})}
 
${when(incremental(), `
  WHERE _ingestion_date >= (
    SELECT DATE_SUB(MAX(DATE(_ingested_at)), INTERVAL 1 DAY)
    FROM ${self()}
  )
`)}

Gold models reference silver with ${ref("silver_orders")}. Dataform compiles, builds the DAG, runs models in order, runs tests, and reports.

Scheduling is configured per workflow:

# workflow_settings.yaml
schedules:
  - name: hourly_billing
    cron: "0 * * * *"
    target:
      tags: ["billing_agent"]
    notifications:
      onFailure: ["data-alerts@myco.com"]

Tags on models (tags: ["billing_agent"]) let you run subsets of the DAG on different schedules.

Why It Matters

One less moving part. No Airflow cluster, no DAG file translation. Everything is BigQuery-native.
Free. Dataform itself costs nothing. You only pay for the BigQuery queries it runs.
Tight Git integration. Dataform repos live in Cloud Source Repositories or GitHub; the Dataform UI shows live diffs, run history, and test results inline.
Same mental model as dbt. Engineers who know dbt are productive in Dataform immediately.

Key Technical Details

Dataform supports the same materializations as dbt: view, table, incremental. For incremental, the uniqueKey config tells Dataform how to MERGE.
Tests in Dataform are assertion queries — same as dbt's "tests should return zero rows when passing."
The .sqlx format includes a templating language (JavaScript-based) for the config and dynamic SQL. Cleaner than dbt's Jinja for most things.
For cross-tool orchestration (Dataform + a Cloud Run job), you can still call Dataform from Composer with the DataformRunWorkflowOperator. Use Dataform for what it's good at, and Composer just for the cross-system bits.

Common Misconceptions

"Pick dbt or Dataform — never both." Some teams use Dataform for warehouse-internal transforms and dbt elsewhere. The conventions are similar enough that engineers move between them easily.

"Dataform is just dbt-lite." Functionally close, but tightly coupled to BigQuery. If your warehouse is BigQuery and you're not multi-cloud, Dataform's tight integration is a win, not a limitation.

"Dataform replaces Composer entirely." It replaces Composer for SQL-only pipelines. The minute you have a non-SQL step (calling an external API, kicking off a Dataflow job), you need Composer or a Cloud Workflows definition to glue them together.

Connections to Other Concepts

Course 04-refinement-in-bigquery/03-dbt-for-versioned-transforms.md — The dbt equivalent.
01-orchestrating-with-cloud-composer.md — When you need cross-system orchestration on top.
Course 07-operating-the-system/01-observability-and-data-quality-monitoring.md — Dataform tests as quality monitors.