One-Line Summary: Dataform is GCP's built-in dbt — version-controlled SQL pipelines that run entirely inside BigQuery, with scheduling, dependencies, and testing as a managed service.

Prerequisites: Lessons 01-orchestrating-with-cloud-composer.md and 04-refinement-in-bigquery/03-dbt-for-versioned-transforms.md.

What's the Concept?

For pipelines that live entirely inside BigQuery — silver and gold transformations, embedding refreshes, materialized aggregations — you don't strictly need Composer or Dataflow. Dataform gives you the same dbt-style developer experience (SQL with ref(), dependency graph, tests, scheduling) without leaving BigQuery.

The pitch is operational: no Airflow cluster to manage, no Python environment to maintain, no service-account juggling. You commit SQL to a Git repo; Dataform runs it on a schedule.

For an agent pipeline whose only non-SQL step is the ingestion (often handled by Datastream or a Cloud Run puller), Dataform alone can cover the entire silver → gold → embed → quality-check chain.

How It Works

A Dataform project looks like:

my-agent-pipeline/
├── definitions/
│   ├── sources.js                 ← bronze source declarations
│   ├── silver_orders.sqlx         ← silver model
│   ├── gold_billing_context.sqlx  ← gold model
│   └── tests/
│       └── unique_order_id.sqlx
├── includes/
│   └── helpers.js                 ← shared JS helpers
└── workflow_settings.yaml

A .sqlx model is SQL with a small config header:

-- definitions/silver_orders.sqlx
config {
  type: "incremental",
  uniqueKey: ["order_id"],
  bigquery: {
    partitionBy: "DATE(created_at)",
    clusterBy: ["customer_id"]
  },
  description: "One row per order. Built from bronze.stripe_charges_raw."
}
 
SELECT
  JSON_VALUE(payload, '$.id')        AS order_id,
  JSON_VALUE(payload, '$.customer')  AS customer_id,
  SAFE_CAST(JSON_VALUE(payload, '$.amount') AS INT64) AS amount_cents,
  TIMESTAMP_SECONDS(SAFE_CAST(JSON_VALUE(payload, '$.created') AS INT64)) AS created_at,
  _ingestion_timestamp               AS _ingested_at
FROM ${ref({schema: "bronze", name: "stripe_charges_raw"})}
 
${when(incremental(), `
  WHERE _ingestion_date >= (
    SELECT DATE_SUB(MAX(DATE(_ingested_at)), INTERVAL 1 DAY)
    FROM ${self()}
  )
`)}

Gold models reference silver with ${ref("silver_orders")}. Dataform compiles, builds the DAG, runs models in order, runs tests, and reports.

Scheduling is configured per workflow:

# workflow_settings.yaml
schedules:
  - name: hourly_billing
    cron: "0 * * * *"
    target:
      tags: ["billing_agent"]
    notifications:
      onFailure: ["data-alerts@myco.com"]

Tags on models (tags: ["billing_agent"]) let you run subsets of the DAG on different schedules.

Why It Matters

  • One less moving part. No Airflow cluster, no DAG file translation. Everything is BigQuery-native.
  • Free. Dataform itself costs nothing. You only pay for the BigQuery queries it runs.
  • Tight Git integration. Dataform repos live in Cloud Source Repositories or GitHub; the Dataform UI shows live diffs, run history, and test results inline.
  • Same mental model as dbt. Engineers who know dbt are productive in Dataform immediately.

Key Technical Details

  • Dataform supports the same materializations as dbt: view, table, incremental. For incremental, the uniqueKey config tells Dataform how to MERGE.
  • Tests in Dataform are assertion queries — same as dbt's "tests should return zero rows when passing."
  • The .sqlx format includes a templating language (JavaScript-based) for the config and dynamic SQL. Cleaner than dbt's Jinja for most things.
  • For cross-tool orchestration (Dataform + a Cloud Run job), you can still call Dataform from Composer with the DataformRunWorkflowOperator. Use Dataform for what it's good at, and Composer just for the cross-system bits.

Common Misconceptions

"Pick dbt or Dataform — never both." Some teams use Dataform for warehouse-internal transforms and dbt elsewhere. The conventions are similar enough that engineers move between them easily.

"Dataform is just dbt-lite." Functionally close, but tightly coupled to BigQuery. If your warehouse is BigQuery and you're not multi-cloud, Dataform's tight integration is a win, not a limitation.

"Dataform replaces Composer entirely." It replaces Composer for SQL-only pipelines. The minute you have a non-SQL step (calling an external API, kicking off a Dataflow job), you need Composer or a Cloud Workflows definition to glue them together.

Connections to Other Concepts

  • Course 04-refinement-in-bigquery/03-dbt-for-versioned-transforms.md — The dbt equivalent.
  • 01-orchestrating-with-cloud-composer.md — When you need cross-system orchestration on top.
  • Course 07-operating-the-system/01-observability-and-data-quality-monitoring.md — Dataform tests as quality monitors.

Further Reading

  • Google Cloud, "Dataform overview" docs.
  • "Dataform SQLX reference" — Syntax + config options.
  • Dataform's product blog — Comparisons with dbt and migration guides.