One-Line Summary: An agent pipeline you can't see is an agent pipeline that breaks silently — and silent breakage is much worse than the loud kind because the agent keeps confidently answering with stale or wrong data.
Prerequisites: Module 06.
What's the Concept?
Three layers of observability matter for an agent data pipeline, and each watches a different failure mode:
- Pipeline observability — did each job run? How long did it take? Did it fail?
- Data quality observability — does the output look right? Row counts, freshness, null rates, distribution shifts.
- Agent-side observability — how often is each retrieval tool called, what does it return, when do the answers fail validation?
Pipeline observability catches outages. Data quality catches corruption. Agent observability catches the gap between "the pipeline ran" and "the agent gave a useful answer."
How It Works
The minimum stack:
Pipeline observability (Cloud Logging + Monitoring). Every Composer DAG, Dataflow job, and Cloud Run service emits structured logs. A handful of standing dashboards:
- "Pipelines green/red" — current run state of every DAG, color-coded.
- "Run duration over time" — line chart per DAG. Drift up means something's slowing.
- "Failure rate by task" — bar chart, last 7 days, sorted descending. The top is what to fix.
Alerting policies fire on three things: DAG failure (immediate), DAG late (an hour past expected start), task duration >2× baseline.
Data quality (a _pipeline_metrics table + dbt tests). Every silver and gold model writes one row per run to a small metrics table:
CREATE TABLE `myco.ops.pipeline_metrics` (
pipeline STRING, -- e.g., "silver_orders"
run_id STRING,
ran_at TIMESTAMP,
row_count INT64,
bytes_processed INT64,
null_rate FLOAT64, -- per critical column, optional struct
pass BOOL,
reason STRING
)
PARTITION BY DATE(ran_at);After each pipeline run, an inline SQL block writes the row. A second job — running every 15 minutes — checks for anomalies: row count more than 30% off the 7-day trailing average, freshness lag past the SLA, dbt tests with new failures. Anomalies fire alerts the same way pipeline failures do.
Agent-side observability. Every tool call gets logged with: tool name, input parameters (PII redacted), response size, latency, and a hash of the response. Aggregate dashboards show:
- "Tool calls per minute" by tool.
- "Empty response rate" — when the tool returns nothing, the agent likely fails.
- "Response p95 latency" — slowness propagates straight to user experience.
- "Validation failures" — responses that don't match the declared output schema (these are bugs).
Together these three layers tell you, within minutes, whether something's wrong and roughly where to look.
Why It Matters
- Silent data degradation is the worst failure mode. A pipeline that fails loudly gets fixed in an hour. A pipeline that runs but produces subtly wrong data poisons agent answers for weeks.
- Freshness is a first-class metric. "The agent's answer was wrong" usually traces back to "the data was three days old." Track lag-since-source-event for every gold table.
- Schema drift hides at the boundary. A vendor adds a field; bronze ingests it cleanly; silver ignores it; the agent never learns about a feature that exists. Schema-change detection should fire alerts even when nothing is technically broken.
- Costs come from somewhere. Without observability, you discover the $30,000 BigQuery bill at month end. With it, you catch the runaway query an hour after it starts.
Key Technical Details
- BigQuery
INFORMATION_SCHEMA.JOBS_BY_PROJECTgives you query-level metrics — bytes scanned, slot time, error codes — that feed straight into a quality dashboard. - For freshness alerts, the metric is
TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), MAX(updated_at), MINUTE)per table; alert when it crosses the SLA. - dbt + Dataform both emit "test result" records you can pipe into the same
_pipeline_metricstable. - Cloud Monitoring's "alert policy" syntax accepts SQL queries against BigQuery as a metric source — useful for data-quality alerts without a separate alerting service.
Common Misconceptions
"We'll add monitoring later." "Later" usually arrives after a major customer-facing incident traceable to bad data. Build the metrics writes into pipeline code from day one; they're 5 lines per pipeline.
"Tests in dbt are enough." Tests catch defined failure modes. Observability dashboards catch the undefined ones — the row count is right but the distribution suddenly skewed, the schema added a field, the timezone shifted by an hour. Tests + observability is the production combination.
"More dashboards = better observability." Three good dashboards beat thirty unloved ones. The test: which dashboards do you actually look at in an incident? Make those great; archive the rest.
Connections to Other Concepts
02-cost-control-on-bigquery-and-vertex-ai.md— Observability is the input to cost control.- Course
06-pipeline-orchestration/01-orchestrating-with-cloud-composer.md— Composer's UI is one of these dashboards out of the box. - Course
05-serving-data-to-agents/04-the-retrieval-contract-between-pipeline-and-agent.md— The contract's freshness SLA is what you alert against.
Further Reading
- Barr Moses & Lior Gavish, Data Quality Fundamentals (O'Reilly, 2022) — Co-authored by Monte Carlo's founders. The clearest single book on the data-observability discipline as it's currently practiced.
- Monte Carlo / Synq / Acryl data-observability blogs — Vendor-flavored but consistently the highest-quality material on data quality monitoring; the SLA + freshness + volume + schema + lineage framing they popularized is now industry standard.
- Charity Majors et al., Observability Engineering (O'Reilly, 2022) — General-purpose observability; the mental model (events, high cardinality, exploratory queries) transfers directly to data pipelines.
- Google Cloud, "Cloud Monitoring overview" + "BigQuery INFORMATION_SCHEMA.JOBS_BY_PROJECT" docs — The metric sources you build alerts on.
- dbt Labs, "Testing in dbt" docs — Tests as the lowest tier of data quality monitoring; necessary but not sufficient.
- Great Expectations / Soda documentation — Open-source data-quality frameworks; useful even if you only borrow the test taxonomy (
expect_column_values_to_not_be_null, etc.).