State of the Practice & Further Reading — Data Engineering for AI Agents on GCP

One-Line Summary: A curated, annotated reading list — the books, papers, blog posts, and GCP product launches that are actively reshaping how data engineering for AI agents is done as of 2026.

Prerequisites: Comfort with the rest of the course is helpful but not required — this lesson stands alone as a reference.

What's the Concept?

The course teaches a stack that works today. The literature below tells you where the field is heading. Some entries are foundational (you can't really understand the stack without them); some are leading edge (2024–2026 work that's still settling); some are pragmatic blog posts that captured a hard-won lesson better than any paper. Each is annotated with which module it complements.

Use this as a syllabus for the next month of self-study, or as a reference when you hit a problem the course didn't explicitly cover.

Foundational Textbooks

These are the "if you only read four books on the topic" picks. Modern in framing, durable in content.

Joe Reis & Matt Housley, Fundamentals of Data Engineering (O'Reilly, 2022) — The current canonical textbook. Replaces the older Kimball as the default reference for the modern data-engineer's mental model. Maps cleanly onto the bronze/silver/gold approach we use. Complements: Modules 01, 03, 04.
Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017) — Still the deepest, clearest treatment of the systems-level concerns: replication, partitioning, consistency, batch vs. stream. Chapters 10–11 are required reading before you operate any production pipeline. Complements: Modules 02, 04, 06.
Tyler Akidau, Slava Chernyak, Reuven Lax, Streaming Systems (O'Reilly, 2018) — Co-authored by the people who built Pub/Sub and Dataflow at Google. The conceptual model for windows, watermarks, and triggers that all modern streaming systems use. Complements: Modules 02, 06.
Lakshmanan & Tigani, Google BigQuery: The Definitive Guide (O'Reilly, 2019) — Dated in API specifics but still the best deep-dive on the warehouse internals. Skip the chapters that mention deprecated features; the cost/perf/architecture chapters age well. Complements: Modules 04, 07.

Modern Data-Platform Practice (2022–2026)

The discipline has settled on a few specific practices that didn't exist clearly five years ago.

Chad Sanderson, Driving Data Quality with Data Contracts (Packt, 2023) — The book that made "data contracts" a first-class engineering term. The framing of producer-consumer interfaces with schema and SLA matches what Module 05's retrieval contract is doing. Complements: Module 05, especially 04-the-retrieval-contract-between-pipeline-and-agent.
Zhamak Dehghani, Data Mesh (O'Reilly, 2022) — The decentralization argument. Whether or not you adopt mesh wholesale, the "data as a product" framing changes how you think about gold tables and tool contracts. Complements: Modules 03, 05.
Maxime Beauchemin, "Functional Data Engineering" (2018 essay) — The original argument for immutable, idempotent, partition-keyed pipelines. Free online; under 4,000 words; load-bearing for the rest of the field. Complements: Module 04, especially 04-incremental-and-idempotent-pipelines.
dbt Labs, dbt Best Practices Guide — Community-maintained, kept current. Worth re-reading annually. Complements: 04-refinement-in-bigquery/03-dbt-for-versioned-transforms.
DAMA International, DAMA-DMBOK2 (2017) — Reference manual, not a read-through. Look up "data governance," "master data management," "metadata management" as needed. Complements: 03-the-raw-data-lake/04-data-governance-from-day-one.

Retrieval Research (2020–2026)

Embedding-based retrieval is the part of the stack moving fastest. These are the papers and posts shaping what's possible in 2026.

Anthropic, "Introducing Contextual Retrieval" (Sept 2024) — The single most actionable improvement to RAG in recent memory. Before chunking, prepend a model-generated 50–100 token "context" that locates the chunk in the broader document. Reduces retrieval failures by ~49% on Anthropic's benchmarks. Should be the default for any production agent corpus. Complements: Module 05, 02-semantic-retrieval-embeddings-and-vector-search. https://www.anthropic.com/research/contextual-retrieval
Anthropic, "Building Effective Agents" (Dec 2024) — Anthropic's canonical post on agent patterns: prompt chaining, routing, parallelization, orchestrator-worker, evaluator-optimizer. The data-engineering implications — what retrieval each pattern needs — are explicit. Complements: Module 05. https://www.anthropic.com/research/building-effective-agents
Anthropic, "Natural Language Autoencoders" (May 2026) — The new interpretability technique that translates model activations into English. Relevant here because it's a glimpse at where retrieval is headed: "find activations like this thought" instead of "find rows like this text." Complements: Module 05. https://transformer-circuits.pub/2026/nla/
Microsoft Research, "GraphRAG" (Edge et al., 2024) — Constructs an entity-relationship graph from the corpus at index time; query-time retrieval traverses the graph as well as the vector index. Stronger on multi-hop questions; expensive to build. Complements: Module 05, especially the hybrid-retrieval lesson. https://arxiv.org/abs/2404.16130
Niklas Muennighoff et al., "MTEB: Massive Text Embedding Benchmark" (2022, continuously updated leaderboard) — How you actually pick an embedding model in 2026. The leaderboard at huggingface.co/spaces/mteb/leaderboard is the source of truth; rerun your shortlist on your own corpus before committing. Complements: 05/02-semantic-retrieval-embeddings-and-vector-search.
Karpukhin et al., "Dense Passage Retrieval for Open-Domain Question Answering" (2020) — Foundational. Establishes the dual-encoder pattern still used by most production embedding models. Complements: Module 05. https://arxiv.org/abs/2004.04906
Khattab & Zaharia, "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT" (2020) — Late-interaction retrieval. More accurate than single-vector embeddings, more expensive. Worth knowing about; not yet the default. Complements: Module 05. https://arxiv.org/abs/2004.12832
Liu et al., "Lost in the Middle: How Language Models Use Long Contexts" (2023) — Why retrieval ranking matters even when you can fit everything. Information in the middle of a long prompt gets attended to less. Inform your top-k tuning. Complements: 05/03-hybrid-retrieval-structured-plus-semantic. https://arxiv.org/abs/2307.03172
Cohere, "Rerank v3" docs / Voyage AI "rerank-2" — Cross-encoder rerankers that go on top of an ANN candidate set. A two-stage retrieval (cheap ANN → expensive rerank) is the production pattern when relevance matters more than latency. Complements: Module 05.
Anthropic, "Prompt caching" docs (2024) — Not retrieval research per se, but it changes the cost calculus. Caching the system prompt + tool definitions makes large prompts cheap on repeat calls — relevant to how aggressively you can pre-stuff agent context. Complements: Modules 05, 07.

Streaming and Lakehouse Architecture

The boundary between "warehouse" and "lake" is dissolving fast, especially on GCP.

Google Cloud, "BigQuery managed Apache Iceberg tables" (GA 2024) — Iceberg is now first-class inside BigQuery, with full DML, time-travel, and external-engine compatibility. Means you can have one storage layer that both BigQuery and Spark/Trino read natively. The "lake-house" pattern, productized. Complements: Modules 03, 04.
Google Cloud, "BigQuery Continuous Queries" (GA 2024) — Streaming SQL inside BigQuery itself, no Dataflow needed for many use cases. SQL that runs forever against an arriving stream. Reduces the number of pipelines that need to step out to Dataflow. Complements: Module 06, 02-dataflow-for-heavy-transforms.
Google Cloud, "Pub/Sub to BigQuery direct subscriptions" (GA 2023) — No-code streaming ingest. Should be the default for raw bronze landing of event streams. Complements: 02-ingestion-patterns/02-event-streams-with-pub-sub.
Tyler Akidau, "The Beam Model" (blog, 2015–present) — Akidau keeps writing about streaming model evolution; the most recent posts cover lambda → kappa → unified, and how managed services have changed the calculus. Complements: Module 06.
Armbrust et al., "Delta Lake: High-Performance ACID Table Storage Over Cloud Object Stores" (VLDB, 2020) — The Delta Lake paper; the conceptual sibling of Iceberg. Even if you don't use Delta, the paper is the clearest explanation of how transactional semantics layer over object storage. Complements: Module 03.

GCP-Specific Product Launches Worth Tracking

The GCP data stack is evolving meaningfully every quarter; these are the launches that materially change the course's recommended architecture.

Vertex AI Vector Search — Used to be called Matching Engine. Now integrated with Vertex AI Embeddings end-to-end. Use when BigQuery's native VECTOR_SEARCH outgrows you (~10M+ vectors with strict latency SLOs).
text-embedding-005 and gemini-embedding-001 — The current Vertex AI embedding models as of 2026. text-embedding-005 is the cheap workhorse; gemini-embedding-001 is higher-quality and multilingual. Both directly callable from BigQuery via ML.GENERATE_EMBEDDING.
Dataform native in BigQuery — Now free, no separate billing, integrated with Cloud Source Repositories and GitHub. The lowest-friction option for SQL-only pipelines.
Vertex AI Agent Builder — Google's first-party agent framework. Calls tools the same way other frameworks do; this course's retrieval-tool contract pattern works against it unchanged.
BigFunctions / Remote Functions — Lets you call external services (including non-Vertex AI models) from BigQuery SQL. Useful for in-warehouse enrichment without standing up a Dataflow job.
Cloud Workflows — Worth knowing as a Composer alternative for orchestration patterns that don't need a full Airflow. Cheaper for low-volume pipelines.
Datastream for BigQuery — CDC from Postgres/MySQL/Oracle straight into BigQuery, no GCS intermediary required. The course recommends GCS intermediary for replay-ability, but the direct path is faster and equally fresh.

Production Blogs and Newsletters

The fastest way to stay current. Subscribe to two or three.

Maxime Beauchemin's writing (Airflow's creator, Preset cofounder) — Almost everything he writes about the data engineering profession is worth reading.
Locally Optimistic (locallyoptimistic.com) — Analytics engineering thinking from the trenches; less about tools, more about how teams actually work.
Monte Carlo / Synq / Acryl blogs — Data observability vendors; vendor-flavored but consistently the best material on data quality monitoring.
The MAD (Machine Learning, AI, Data) Landscape (Matt Turck, mad.firstmark.com) — Annual landscape of the entire space. Useful for spotting which categories are consolidating vs. fragmenting.
Tristan Handy (dbt Labs founder), Analytics Engineering Roundup — Monthly newsletter; the closest thing to a "state of the discipline" digest.
Pinterest / Netflix / Uber engineering blogs — Three of the few teams that publish substantive data-platform writing. Pinterest's posts on Iceberg adoption, Netflix's on event sourcing, and Uber's on their internal warehouse are all worth reading once.

Agent + Data Intersection

The papers and writings most directly relevant to how data engineering meets agent design.

Yao et al., "ReAct: Synergizing Reasoning and Acting in Language Models" (2022) — The reason-act-observe loop that every modern agent runs. Worth understanding before designing any retrieval tool. Complements: Module 05. https://arxiv.org/abs/2210.03629
Schick et al., "Toolformer: Language Models Can Teach Themselves to Use Tools" (2023) — The training-time perspective on tool use. Even if you're not training, the framing of "tools as a controllable distribution over actions" is useful. Complements: Module 05.
Anthropic Tool Use Cookbook — Practical, current; the closest thing to a reference for "what does a well-shaped tool actually look like." Updated alongside model releases.
Vertex AI Agent Builder docs — Google's tool-use framework. Same conceptual shape; different SDK.
LangChain & LlamaIndex docs — Practical patterns for chunking, retrieval composition, and agent orchestration. Move fast; check publish dates.

How to Use This Reading List

If you've never built a production pipeline: Reis & Housley + Kleppmann chapters 10–11 + Beauchemin's "Functional Data Engineering" essay. That's ~600 pages of reading and it sets up everything else.
If you're shipping an agent next month: Skip to Anthropic's "Contextual Retrieval" and "Building Effective Agents" posts, plus the Vertex AI Agent Builder docs. ~3 hours; pays for itself immediately.
If you're debugging a slow / wrong retrieval: "Lost in the Middle," the MTEB leaderboard, and the Cohere/Voyage reranker docs. The combination of those three usually explains 80% of "the agent doesn't find the right answer."
If you're operating a pipeline that's grown beyond toy size: Monte Carlo / Synq blogs for observability, plus Akidau's recent streaming posts for the cost/freshness trade-offs.

A Note on Half-Lives

References don't age uniformly. Foundational papers (DPR, ColBERT, ReAct) and systems-level books (Kleppmann, Akidau) age slowly — measure their relevance in decades. Vendor docs and blog posts age in months — re-check publish dates before relying on them. The MTEB leaderboard updates weekly; "best embedding model" is a moving target.

The course's architecture (bronze → silver → gold; structured + semantic retrieval; tools as contracts) is built on the slow-aging ideas. The specific GCP services and embedding models will rotate; the shape of the pipeline shouldn't.

Connections to Other Concepts

Every preceding lesson — this lesson is the appendix to the whole course.
The Brain Drip Drip on nla (Natural Language Autoencoders) — covers the interpretability research that's adjacent to retrieval.
The Brain Drip course on Multi-Skill Agents — the agent-side counterpart to this data-side course.