The GCP Data Stack at a Glance

What's the Concept?

GCP has a lot of services, and most data engineering tasks can be done five different ways. To keep this course coherent we pick one canonical option per job. You can substitute later — the patterns transfer — but the lessons assume the stack below.

How It Works

The reference stack for this course:

LAYER             SERVICE                       ROLE IN THE PIPELINE
─────             ───────                       ────────────────────
Ingest            Cloud Run + APIs              custom HTTP pullers
                  Pub/Sub                       event streams + buffering
                  Datastream                    CDC from Postgres/MySQL
                  Storage Transfer Service      bulk file moves into GCS
 
Bronze            Cloud Storage (GCS)           raw, immutable, partitioned
 
Silver / Gold     BigQuery                      SQL-native warehouse
                  Dataflow                      streaming / heavy transforms
                  Dataform (or dbt)             versioned SQL pipelines
 
Embed / Serve     Vertex AI Embeddings API      text-embedding-gecko etc.
                  BigQuery VECTOR + ANN         in-warehouse vector search
                  Vertex AI Vector Search       managed ANN at scale
 
Orchestrate       Cloud Composer (Airflow)      cross-service DAGs
                  Cloud Scheduler + Eventarc    lighter event/cron triggers
 
Operate           Cloud Logging + Monitoring    pipeline observability
                  IAM + VPC-SC                  security perimeter
                  Cloud DLP                     PII detection / redaction

For most of the course you'll touch four services in earnest: GCS (bronze landing), BigQuery (silver / gold + structured retrieval), Vertex AI (embeddings + vector search), and Cloud Composer or Cloud Run (the thing that runs the pipeline). The rest are situational.

Why It Matters

GCP's strongest opinion is that BigQuery is the warehouse. This is unusually concentrated — on AWS the equivalent role splits across Redshift, Athena, and S3 — and it's an advantage. Most of your transformation logic lives in one place.
Vertex AI is the embedding + vector tier, not a separate database. Treating it as a managed feature of the warehouse, not a competing system, simplifies the architecture a lot.
You can do the entire pipeline on the cheap, then scale. GCS, BigQuery on-demand, and Vertex AI embeddings all have generous free tiers. Production-grade stacks usually start under $50/month and only grow when usage justifies it.

Key Technical Details

BigQuery's on-demand pricing is per-TB scanned, not per-table-stored. Partitioning + clustering is how you keep that bill rational.
Pub/Sub guarantees at-least-once delivery; downstream processors need to be idempotent (we'll cover this in Module 02).
Cloud Composer is a managed Airflow, with the same DAG semantics. It has cold-start latency and a non-trivial minimum cost; for small pipelines, Cloud Scheduler + Cloud Run is often cheaper.
Vertex AI's text-embedding-gecko produces 768-dim vectors; BigQuery's VECTOR column supports up to 2048 dims and indexes them with IVF or ScaNN.

Common Misconceptions

"You need Dataflow for everything." You don't. Dataflow is the right tool for high-volume streaming or complex stateful transforms. For most batch SQL transforms, BigQuery itself plus Dataform/dbt is simpler and cheaper.

"You need a separate vector database." Not at the scale of typical agent workloads. BigQuery's native vector search handles tens of millions of vectors comfortably; you only graduate to Vertex AI Vector Search when query latency or scale forces it.

Connections to Other Concepts

Ingestion Patterns — Detailed coverage of the ingest services.
Serving Data To Agents — Vertex AI Embeddings + BigQuery ANN as the retrieval tier.
Pipeline Orchestration — Composer, Dataflow, Eventarc compared.