Observability and Data Quality Monitoring

One-Line Summary: An agent pipeline you can't see is an agent pipeline that breaks silently — and silent breakage is much worse than the loud kind because the agent keeps confidently answering with stale or wrong data.

Prerequisites: Module 06.

What's the Concept?

Three layers of observability matter for an agent data pipeline, and each watches a different failure mode:

Pipeline observability — did each job run? How long did it take? Did it fail?
Data quality observability — does the output look right? Row counts, freshness, null rates, distribution shifts.
Agent-side observability — how often is each retrieval tool called, what does it return, when do the answers fail validation?

Pipeline observability catches outages. Data quality catches corruption. Agent observability catches the gap between "the pipeline ran" and "the agent gave a useful answer."

How It Works

The minimum stack:

Pipeline observability (Cloud Logging + Monitoring). Every Composer DAG, Dataflow job, and Cloud Run service emits structured logs. A handful of standing dashboards:

"Pipelines green/red" — current run state of every DAG, color-coded.
"Run duration over time" — line chart per DAG. Drift up means something's slowing.
"Failure rate by task" — bar chart, last 7 days, sorted descending. The top is what to fix.

Alerting policies fire on three things: DAG failure (immediate), DAG late (an hour past expected start), task duration >2× baseline.

Data quality (a _pipeline_metrics table + dbt tests). Every silver and gold model writes one row per run to a small metrics table:

CREATE TABLE `myco.ops.pipeline_metrics` (
  pipeline   STRING,         -- e.g., "silver_orders"
  run_id     STRING,
  ran_at     TIMESTAMP,
  row_count  INT64,
  bytes_processed INT64,
  null_rate  FLOAT64,        -- per critical column, optional struct
  pass       BOOL,
  reason     STRING
)
PARTITION BY DATE(ran_at);

After each pipeline run, an inline SQL block writes the row. A second job — running every 15 minutes — checks for anomalies: row count more than 30% off the 7-day trailing average, freshness lag past the SLA, dbt tests with new failures. Anomalies fire alerts the same way pipeline failures do.

Agent-side observability. Every tool call gets logged with: tool name, input parameters (PII redacted), response size, latency, and a hash of the response. Aggregate dashboards show:

"Tool calls per minute" by tool.
"Empty response rate" — when the tool returns nothing, the agent likely fails.
"Response p95 latency" — slowness propagates straight to user experience.
"Validation failures" — responses that don't match the declared output schema (these are bugs).

Together these three layers tell you, within minutes, whether something's wrong and roughly where to look.

Why It Matters

Silent data degradation is the worst failure mode. A pipeline that fails loudly gets fixed in an hour. A pipeline that runs but produces subtly wrong data poisons agent answers for weeks.
Freshness is a first-class metric. "The agent's answer was wrong" usually traces back to "the data was three days old." Track lag-since-source-event for every gold table.
Schema drift hides at the boundary. A vendor adds a field; bronze ingests it cleanly; silver ignores it; the agent never learns about a feature that exists. Schema-change detection should fire alerts even when nothing is technically broken.
Costs come from somewhere. Without observability, you discover the $30,000 BigQuery bill at month end. With it, you catch the runaway query an hour after it starts.

Key Technical Details

BigQuery INFORMATION_SCHEMA.JOBS_BY_PROJECT gives you query-level metrics — bytes scanned, slot time, error codes — that feed straight into a quality dashboard.
For freshness alerts, the metric is TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), MAX(updated_at), MINUTE) per table; alert when it crosses the SLA.
dbt + Dataform both emit "test result" records you can pipe into the same _pipeline_metrics table.
Cloud Monitoring's "alert policy" syntax accepts SQL queries against BigQuery as a metric source — useful for data-quality alerts without a separate alerting service.

Common Misconceptions

"We'll add monitoring later." "Later" usually arrives after a major customer-facing incident traceable to bad data. Build the metrics writes into pipeline code from day one; they're 5 lines per pipeline.

"Tests in dbt are enough." Tests catch defined failure modes. Observability dashboards catch the undefined ones — the row count is right but the distribution suddenly skewed, the schema added a field, the timezone shifted by an hour. Tests + observability is the production combination.

"More dashboards = better observability." Three good dashboards beat thirty unloved ones. The test: which dashboards do you actually look at in an incident? Make those great; archive the rest.

Connections to Other Concepts

02-cost-control-on-bigquery-and-vertex-ai.md — Observability is the input to cost control.
Course 06-pipeline-orchestration/01-orchestrating-with-cloud-composer.md — Composer's UI is one of these dashboards out of the box.
Course 05-serving-data-to-agents/04-the-retrieval-contract-between-pipeline-and-agent.md — The contract's freshness SLA is what you alert against.