One-Line Summary: GCS is the GCP-native object store, and with a consistent path convention plus lifecycle rules it functions as the bronze layer of your warehouse without any other moving parts.
Prerequisites: Module 02.
What's the Concept?
A "data lake" sounds like a separate product. In GCP, it's a GCS bucket with conventions. There's no proprietary lake service to provision, no special filesystem to learn. The lake is just bytes in buckets; what makes it a lake is the discipline you bring to organizing those bytes.
The core idea: raw data lands in one well-known bucket (or a small set of buckets), in a deterministic path layout, in immutable files. Anything downstream — BigQuery, Dataflow, dbt, Vertex AI — can read it without coordinating with anything upstream. That separation of write-once, read-many is what makes the lake valuable.
How It Works
A typical small-to-medium production lake is just two buckets:
gs://myco-lake-bronze/ ← raw, as-received
source=stripe/
entity=charges/
ingestion_date=2026-05-13/
page=00000.jsonl.gz
page=00001.jsonl.gz
entity=customers/
ingestion_date=2026-05-13/
page=00000.jsonl.gz
source=salesforce/
entity=opportunities/
...
gs://myco-lake-curated/ ← (optional) intermediate Parquet
silver/
orders/
year=2026/month=05/day=13/
part-000.parquetThe bronze layout encodes everything downstream needs to find data:
source=— where it came from (Stripe, Salesforce, etc.)entity=— what kind of record (charges, customers, orders)ingestion_date=— when we received it (not the same as event timestamp)- the file inside is whatever the ingester produced
You'll see this pattern called Hive-style partitioning. It's not Hive-specific — BigQuery, Dataflow, Spark, and most modern tools all parse it natively. Using it consistently means downstream code never needs to know the specifics of any one source.
The optional curated/ bucket holds Parquet-converted bronze for sources that benefit from columnar reads. You only need it if you're doing analytical queries directly against bronze; if everything refines into BigQuery, you can skip it.
Why It Matters
- Convention replaces coordination. Anyone who knows the path convention can find data without asking. New downstream consumers don't need new APIs.
- Storage costs are negligible. Standard storage at 0.004; Archive at 200 with proper tiering.
- The lake is the system's archaeology. Every transformation downstream is replayable as long as bronze exists. This is your insurance policy.
Key Technical Details
- One project, one or a few buckets. Multi-project sprawl is the most common mis-step — you'll regret it when you need cross-source joins.
- Region: keep bronze in the same region as BigQuery to avoid egress costs.
us-central1andus(multi-region) are the common picks. - Lifecycle rules: Standard → Nearline at 30 days, Nearline → Coldline at 90, optional Coldline → Archive at 365 for compliance retention.
- Object versioning ON for bronze. The cost is small; the safety net is large.
- Set a retention policy if you have compliance requirements — it prevents accidental deletion at the bucket level.
Common Misconceptions
"I should use Dataplex / a fancy lake catalog from day one." Catalog services (Dataplex, Data Catalog) are useful at scale — when you have hundreds of datasets and many teams. Skip them until you actually feel the pain of not having them.
"Raw means unstructured." Bronze data has structure; it just hasn't been conformed. Most bronze ends up as JSONL or Parquet, both highly structured. "Raw" here means "as received," not "without schema."
Connections to Other Concepts
02-bucket-layouts-and-partitioning.md— More detail on path conventions and the partition contract.03-schema-on-read-vs-on-write.md— The strategic question behind keeping bronze structureless or not.- Course
04-refinement-in-bigquery/01-bronze-to-silver-cleaning-and-conforming.md— The first transformation that reads from this lake.
Further Reading
- Google Cloud, "Building a data lake on Google Cloud" — Reference architecture guide.
- "Hive-style partitioning" — Apache documentation on the convention.
- Lakshmanan & Tigani, "BigQuery: The Definitive Guide" — chapter on external tables explains how BigQuery reads this layout natively.