One-Line Summary: A map of the GCP services this course relies on, what each does, and where it sits in the bronze → silver → gold flow.

Prerequisites: Lesson 03-the-medallion-pattern-bronze-silver-gold.md.

What's the Concept?

GCP has a lot of services, and most data engineering tasks can be done five different ways. To keep this course coherent we pick one canonical option per job. You can substitute later — the patterns transfer — but the lessons assume the stack below.

How It Works

The reference stack for this course:

LAYER             SERVICE                       ROLE IN THE PIPELINE
─────             ───────                       ────────────────────
Ingest            Cloud Run + APIs              custom HTTP pullers
                  Pub/Sub                       event streams + buffering
                  Datastream                    CDC from Postgres/MySQL
                  Storage Transfer Service      bulk file moves into GCS
 
Bronze            Cloud Storage (GCS)           raw, immutable, partitioned
 
Silver / Gold     BigQuery                      SQL-native warehouse
                  Dataflow                      streaming / heavy transforms
                  Dataform (or dbt)             versioned SQL pipelines
 
Embed / Serve     Vertex AI Embeddings API      text-embedding-gecko etc.
                  BigQuery VECTOR + ANN         in-warehouse vector search
                  Vertex AI Vector Search       managed ANN at scale
 
Orchestrate       Cloud Composer (Airflow)      cross-service DAGs
                  Cloud Scheduler + Eventarc    lighter event/cron triggers
 
Operate           Cloud Logging + Monitoring    pipeline observability
                  IAM + VPC-SC                  security perimeter
                  Cloud DLP                     PII detection / redaction

For most of the course you'll touch four services in earnest: GCS (bronze landing), BigQuery (silver / gold + structured retrieval), Vertex AI (embeddings + vector search), and Cloud Composer or Cloud Run (the thing that runs the pipeline). The rest are situational.

Why It Matters

  • GCP's strongest opinion is that BigQuery is the warehouse. This is unusually concentrated — on AWS the equivalent role splits across Redshift, Athena, and S3 — and it's an advantage. Most of your transformation logic lives in one place.
  • Vertex AI is the embedding + vector tier, not a separate database. Treating it as a managed feature of the warehouse, not a competing system, simplifies the architecture a lot.
  • You can do the entire pipeline on the cheap, then scale. GCS, BigQuery on-demand, and Vertex AI embeddings all have generous free tiers. Production-grade stacks usually start under $50/month and only grow when usage justifies it.

Key Technical Details

  • BigQuery's on-demand pricing is per-TB scanned, not per-table-stored. Partitioning + clustering is how you keep that bill rational.
  • Pub/Sub guarantees at-least-once delivery; downstream processors need to be idempotent (we'll cover this in Module 02).
  • Cloud Composer is a managed Airflow, with the same DAG semantics. It has cold-start latency and a non-trivial minimum cost; for small pipelines, Cloud Scheduler + Cloud Run is often cheaper.
  • Vertex AI's text-embedding-gecko produces 768-dim vectors; BigQuery's VECTOR column supports up to 2048 dims and indexes them with IVF or ScaNN.

Common Misconceptions

"You need Dataflow for everything." You don't. Dataflow is the right tool for high-volume streaming or complex stateful transforms. For most batch SQL transforms, BigQuery itself plus Dataform/dbt is simpler and cheaper.

"You need a separate vector database." Not at the scale of typical agent workloads. BigQuery's native vector search handles tens of millions of vectors comfortably; you only graduate to Vertex AI Vector Search when query latency or scale forces it.

Connections to Other Concepts

  • Course 02-ingestion-patterns/* — Detailed coverage of the ingest services.
  • Course 05-serving-data-to-agents/* — Vertex AI Embeddings + BigQuery ANN as the retrieval tier.
  • Course 06-pipeline-orchestration/* — Composer, Dataflow, Eventarc compared.

Further Reading

  • Google Cloud, "Data Analytics" product family overview.
  • "Google Cloud Reference Architectures: Data lake to data warehouse" — Official narrative for this stack.
  • Lak Lakshmanan & Jordan Tigani, "Google BigQuery: The Definitive Guide" (O'Reilly) — Deep reference on the warehouse layer.