One-Line Summary: GCS is the GCP-native object store, and with a consistent path convention plus lifecycle rules it functions as the bronze layer of your warehouse without any other moving parts.

Prerequisites: Module 02.

What's the Concept?

A "data lake" sounds like a separate product. In GCP, it's a GCS bucket with conventions. There's no proprietary lake service to provision, no special filesystem to learn. The lake is just bytes in buckets; what makes it a lake is the discipline you bring to organizing those bytes.

The core idea: raw data lands in one well-known bucket (or a small set of buckets), in a deterministic path layout, in immutable files. Anything downstream — BigQuery, Dataflow, dbt, Vertex AI — can read it without coordinating with anything upstream. That separation of write-once, read-many is what makes the lake valuable.

How It Works

A typical small-to-medium production lake is just two buckets:

gs://myco-lake-bronze/        ← raw, as-received
   source=stripe/
      entity=charges/
         ingestion_date=2026-05-13/
            page=00000.jsonl.gz
            page=00001.jsonl.gz
      entity=customers/
         ingestion_date=2026-05-13/
            page=00000.jsonl.gz
   source=salesforce/
      entity=opportunities/
         ...
 
gs://myco-lake-curated/       ← (optional) intermediate Parquet
   silver/
      orders/
         year=2026/month=05/day=13/
            part-000.parquet

The bronze layout encodes everything downstream needs to find data:

  • source= — where it came from (Stripe, Salesforce, etc.)
  • entity= — what kind of record (charges, customers, orders)
  • ingestion_date= — when we received it (not the same as event timestamp)
  • the file inside is whatever the ingester produced

You'll see this pattern called Hive-style partitioning. It's not Hive-specific — BigQuery, Dataflow, Spark, and most modern tools all parse it natively. Using it consistently means downstream code never needs to know the specifics of any one source.

The optional curated/ bucket holds Parquet-converted bronze for sources that benefit from columnar reads. You only need it if you're doing analytical queries directly against bronze; if everything refines into BigQuery, you can skip it.

Why It Matters

  • Convention replaces coordination. Anyone who knows the path convention can find data without asking. New downstream consumers don't need new APIs.
  • Storage costs are negligible. Standard storage at 0.004; Archive at 200 with proper tiering.
  • The lake is the system's archaeology. Every transformation downstream is replayable as long as bronze exists. This is your insurance policy.

Key Technical Details

  • One project, one or a few buckets. Multi-project sprawl is the most common mis-step — you'll regret it when you need cross-source joins.
  • Region: keep bronze in the same region as BigQuery to avoid egress costs. us-central1 and us (multi-region) are the common picks.
  • Lifecycle rules: Standard → Nearline at 30 days, Nearline → Coldline at 90, optional Coldline → Archive at 365 for compliance retention.
  • Object versioning ON for bronze. The cost is small; the safety net is large.
  • Set a retention policy if you have compliance requirements — it prevents accidental deletion at the bucket level.

Common Misconceptions

"I should use Dataplex / a fancy lake catalog from day one." Catalog services (Dataplex, Data Catalog) are useful at scale — when you have hundreds of datasets and many teams. Skip them until you actually feel the pain of not having them.

"Raw means unstructured." Bronze data has structure; it just hasn't been conformed. Most bronze ends up as JSONL or Parquet, both highly structured. "Raw" here means "as received," not "without schema."

Connections to Other Concepts

  • 02-bucket-layouts-and-partitioning.md — More detail on path conventions and the partition contract.
  • 03-schema-on-read-vs-on-write.md — The strategic question behind keeping bronze structureless or not.
  • Course 04-refinement-in-bigquery/01-bronze-to-silver-cleaning-and-conforming.md — The first transformation that reads from this lake.

Further Reading

  • Google Cloud, "Building a data lake on Google Cloud" — Reference architecture guide.
  • "Hive-style partitioning" — Apache documentation on the convention.
  • Lakshmanan & Tigani, "BigQuery: The Definitive Guide" — chapter on external tables explains how BigQuery reads this layout natively.