Cloud Storage as a Lake

One-Line Summary: GCS is the GCP-native object store, and with a consistent path convention plus lifecycle rules it functions as the bronze layer of your warehouse without any other moving parts.

Prerequisites: Module 02.

What's the Concept?

A "data lake" sounds like a separate product. In GCP, it's a GCS bucket with conventions. There's no proprietary lake service to provision, no special filesystem to learn. The lake is just bytes in buckets; what makes it a lake is the discipline you bring to organizing those bytes.

The core idea: raw data lands in one well-known bucket (or a small set of buckets), in a deterministic path layout, in immutable files. Anything downstream — BigQuery, Dataflow, dbt, Vertex AI — can read it without coordinating with anything upstream. That separation of write-once, read-many is what makes the lake valuable.

How It Works

A typical small-to-medium production lake is just two buckets:

gs://myco-lake-bronze/        ← raw, as-received
   source=stripe/
      entity=charges/
         ingestion_date=2026-05-13/
            page=00000.jsonl.gz
            page=00001.jsonl.gz
      entity=customers/
         ingestion_date=2026-05-13/
            page=00000.jsonl.gz
   source=salesforce/
      entity=opportunities/
         ...
 
gs://myco-lake-curated/       ← (optional) intermediate Parquet
   silver/
      orders/
         year=2026/month=05/day=13/
            part-000.parquet

The bronze layout encodes everything downstream needs to find data:

source= — where it came from (Stripe, Salesforce, etc.)
entity= — what kind of record (charges, customers, orders)
ingestion_date= — when we received it (not the same as event timestamp)
the file inside is whatever the ingester produced

You'll see this pattern called Hive-style partitioning. It's not Hive-specific — BigQuery, Dataflow, Spark, and most modern tools all parse it natively. Using it consistently means downstream code never needs to know the specifics of any one source.

The optional curated/ bucket holds Parquet-converted bronze for sources that benefit from columnar reads. You only need it if you're doing analytical queries directly against bronze; if everything refines into BigQuery, you can skip it.

Why It Matters

Convention replaces coordination. Anyone who knows the path convention can find data without asking. New downstream consumers don't need new APIs.
Storage costs are negligible. Standard storage at $0.020/ GB - m o n t h; C o l d l in e a t$ 0.004; Archive at $0.0012. E v e na t 10 T B, m o n t h l y cos t i s u n d er$ 200 with proper tiering.
The lake is the system's archaeology. Every transformation downstream is replayable as long as bronze exists. This is your insurance policy.

Key Technical Details

One project, one or a few buckets. Multi-project sprawl is the most common mis-step — you'll regret it when you need cross-source joins.
Region: keep bronze in the same region as BigQuery to avoid egress costs. us-central1 and us (multi-region) are the common picks.
Lifecycle rules: Standard → Nearline at 30 days, Nearline → Coldline at 90, optional Coldline → Archive at 365 for compliance retention.
Object versioning ON for bronze. The cost is small; the safety net is large.
Set a retention policy if you have compliance requirements — it prevents accidental deletion at the bucket level.

Common Misconceptions

"I should use Dataplex / a fancy lake catalog from day one." Catalog services (Dataplex, Data Catalog) are useful at scale — when you have hundreds of datasets and many teams. Skip them until you actually feel the pain of not having them.

"Raw means unstructured." Bronze data has structure; it just hasn't been conformed. Most bronze ends up as JSONL or Parquet, both highly structured. "Raw" here means "as received," not "without schema."

Connections to Other Concepts

02-bucket-layouts-and-partitioning.md — More detail on path conventions and the partition contract.
03-schema-on-read-vs-on-write.md — The strategic question behind keeping bronze structureless or not.
Course 04-refinement-in-bigquery/01-bronze-to-silver-cleaning-and-conforming.md — The first transformation that reads from this lake.