Files and Bulk Loads Into GCS

One-Line Summary: File-based ingestion — partners drop CSV/Parquet/JSON files into a bucket, you pick them up and land them into bronze — is the boring, durable ingestion pattern that handles most enterprise data exchange.

Prerequisites: Lesson 01-batch-ingestion-from-apis.md.

What's the Concept?

A surprising amount of real-world data moves as files. Partners SFTP a daily CSV; an internal system dumps a Parquet export to a shared bucket; a vendor's product feed lands as a JSON file every morning. None of these are APIs and none are streams. You need an ingestion path that recognizes "new file arrived" and lands it in bronze, no matter how big or weird it is.

GCS is built for this. A bucket can be the drop zone, and you can detect new files either by polling or — more elegantly — by reacting to GCS object-creation events.

How It Works

The event-driven version (preferred):

   partner SFTP / direct PUT
        │
        ▼
   ┌──────────────────────────────────┐
   │  GCS bucket: myco-inbox          │   triggers
   │  /partners/acme/orders.csv       │  ───────────▶  Eventarc /
   └──────────────────────────────────┘                Cloud Run
                                                            │
                                                            │ validate + relocate
                                                            ▼
                                   ┌────────────────────────────────────────┐
                                   │  myco-lake-bronze                       │
                                   │  /source=acme/entity=orders/             │
                                   │  ingestion_date=2026-05-14/orders.csv    │
                                   └────────────────────────────────────────┘

The Cloud Run handler does three things, no more:

Validate the file. Sanity-check the format (CSV header? Parquet schema? non-empty?). Reject corrupt files into a quarantine/ prefix; you'll want to look at those later.
Move to bronze with the canonical path. Same convention as every other source: source=…/entity=…/ingestion_date=…/…. This is what makes downstream silver jobs uniform.
Emit a manifest event. Drop a tiny JSON record (file name, byte count, row count if cheap to compute, hash) into a manifest topic. Downstream consumers listen to manifests instead of polling.

For very large bulk loads — initial migration of a terabyte from a partner, for instance — Storage Transfer Service is the right tool. It handles parallel transfers, resumes, and checksums; you don't want to roll your own.

Why It Matters

Files are the universal interchange format. Almost every external system can produce one. Almost every internal system can consume one. The format is boring; the operational simplicity is the payoff.
GCS has no realistic size limit. Single objects up to 5 TiB. You can land tens of millions of files in a bucket and still query them as a single dataset.
Cost is dominated by storage, not movement. GCS writes are cheap; the recurring cost is the bytes you keep around. Lifecycle rules archive old bronze to Coldline / Archive class for cents-per-GB-month.

Key Technical Details

Use object versioning on the inbox bucket so a partner overwriting orders.csv doesn't silently destroy the previous payload.
Set a lifecycle rule on bronze to transition objects older than 90 days to Nearline, and older than 365 days to Coldline. Bronze rarely needs to be queried directly past a few months.
Eventarc lets you wire google.cloud.storage.object.v1.finalized events directly to Cloud Run, with retries built in.
For multi-gigabyte files, prefer Parquet over CSV/JSON; BigQuery reads Parquet 5–10× faster.

Common Misconceptions

"Just point BigQuery at the bucket and use external tables." Sometimes fine for read-once exploration. For pipelines, copy into BigQuery (or a bronze table backed by Parquet) — external table queries are slower and harder to monitor.

"SFTP is legacy." It's also the dominant data-exchange protocol in finance, healthcare, government, and large enterprise. If your data lives in those worlds, expect SFTP. The pattern above wraps it cleanly.

Connections to Other Concepts

Course 03-the-raw-data-lake/02-bucket-layouts-and-partitioning.md — Why the path convention matters.
Course 06-pipeline-orchestration/04-event-driven-pipelines-with-eventarc.md — The trigger plumbing in detail.
01-batch-ingestion-from-apis.md — The same shape, just with HTTP instead of file system as the input.