Build Along · Module 38·6 min build

Build: An Agentic ETL Pipeline

The companion build to Agentic ETL. Seventy lines that show the only ETL shape that survives real data: the fuzzy LLM agent in the middle, wrapped in deterministic code on both sides.

What you’ll build

  • Two vendor sources with the same data in totally different shapes.
  • An agent transform that maps any shape onto one canonical schema.
  • A deterministic validation gate that quarantines bad rows.
  • The sandwich: flexibility in the middle, strict checks on the edges.

§ 00 · THE TWO-LAYER SANDWICHWhy pure-agent ETL fails

Hand an LLM your whole pipeline and it’ll map fields beautifully for the first hundred rows, then silently corrupt row 101. Hand it nothing and you’re back to brittle hard-coded parsers. The two-layer sandwichtwo-layer sandwich. An ETL architecture where an LLM agent handles the fuzzy transform in the middle, but is wrapped by deterministic code on both sides: normalization/extraction before, and hard schema validation after. The agent never reaches the output without passing a strict check. keeps the agent’s flexibility and bounds its blast radius.

§ 01 · MESSY SOURCESSame record, two shapes

VENDOR_A = [  # snake_case, full names
    {"customer_name": "Ada Lovelace", "email_address": "ada@calc.io", "country": "UK"},
]
VENDOR_B = [  # camelCase, split names, different country key
    {"firstName": "Grace", "lastName": "Hopper", "mail": "grace@navy.mil", "nation": "United States"},
    {"firstName": "", "lastName": "", "mail": "noname@example.com", "nation": "Canada"},
]

CANONICAL = ["name", "email", "country"]

§ 02 · THE AGENT TRANSFORMThe fuzzy part, isolated

Field mapping is exactly what an LLM is good at: customer_name and firstName + lastName both mean name. We mock it deterministically so the repo runs offline; swap in a constrained model call and nothing downstream changes.

def transform(vendor, record):
    hints = FIELD_HINTS[vendor]          # canonical field -> source keys
    out = {}
    for field, source_keys in hints.items():
        parts = [str(record.get(k, "")).strip() for k in source_keys]
        out[field] = " ".join(p for p in parts if p)
    return out

§ 03 · THE VALIDATION GATEThe layer you never let the model write

def is_valid(record):
    if set(record.keys()) != set(CANONICAL):
        return False, "wrong fields"
    if not all(record[f] for f in CANONICAL):
        return False, "empty field"
    if "@" not in record["email"]:
        return False, "bad email"
    return True, "ok"
$ python pipeline.py
Loaded 3 clean rows:
  {'name': 'Ada Lovelace', 'email': 'ada@calc.io', 'country': 'UK'}
  {'name': 'Grace Hopper',  'email': 'grace@navy.mil', 'country': 'United States'}
  ...
Quarantined 1 bad rows (NOT loaded):
  [empty field] {'firstName': '', 'lastName': '', ...}

The agent mapped four differently-shaped records onto one schema; the gate caught the empty-name row before it polluted the table. The gate is plain, boring, deterministic code — and that’s the point.

CHECKIn the two-layer sandwich, which part should an LLM be responsible for?

§ · FURTHER READINGReferences & deeper sources

  1. dlt Hub (2025). Building data pipelines with LLMs · dlthub.com
  2. Anthropic (2024). Tool use (function calling) and structured output · Anthropic Docs
  3. Shankar et al. (2024). SPADE: Synthesizing Assertions for Large Language Model Pipelines · arXiv

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.