One-Line Summary: A scheduled puller that hits an external HTTP API, paginates through the response, and lands raw JSON in GCS — the simplest and most common ingestion shape.
Prerequisites: Module 01.
What's the Concept?
The most common source for an agent's data is a third-party API: Stripe billing, Salesforce records, a partner's product catalog, an internal microservice that holds the source of truth. None of these are warehouses. You can't query them directly without rate-limit pain. The job is to copy their data — on a schedule that matches your freshness requirement — into bronze, where everything else can take over.
The naive version is a Python script in a cron job. The production version handles pagination, rate limits, retries, schema drift, late-arriving data, and provenance. The shape is otherwise identical.
How It Works
A batch puller's anatomy:
# Pseudo-code for a typical batch puller
def pull_orders(since: datetime) -> int:
"""Pull orders from the source API into GCS bronze."""
bucket = "myco-lake-bronze"
prefix = f"source=salesforce/entity=orders/ingestion_date={today()}"
cursor = None
page = 0
written = 0
while True:
resp = http.get(
"https://api.salesforce.com/orders",
params={"updated_since": since, "cursor": cursor, "limit": 500},
headers=auth_headers(),
timeout=30,
)
resp.raise_for_status() # let retries handle 5xx
rows = resp.json()["data"]
if not rows:
break
# Land raw — do not transform here
gcs_write_jsonl(
bucket=bucket,
path=f"{prefix}/page={page:05d}.jsonl.gz",
rows=rows,
metadata={"source_cursor": cursor, "pulled_at": now_iso()},
)
written += len(rows)
cursor = resp.json().get("next_cursor")
if not cursor:
break
page += 1
return writtenKey choices visible above:
- Land JSONL, gzipped, partitioned by
ingestion_date. Bronze is append-only; the partition key is the date of ingestion, not the date of the event (those are different and both matter). - Watermark on
updated_since. The puller pulls only what's changed since its last run, plus a small overlap window for safety. This is how you keep batch jobs cheap as the dataset grows. - Don't transform. Every transformation belongs downstream of bronze. Save the raw payload and a few
_metafields. That's it.
For GCP, the typical home for this puller is Cloud Run (HTTP service) triggered by Cloud Scheduler, or for heavier jobs, a Cloud Composer Airflow DAG. Both write to the same GCS bucket using the same path convention.
Why It Matters
- APIs are the dominant data source for SaaS-shaped businesses. Most agent stacks ingest from at least one API; many ingest from five or ten.
- Batch is "good enough" for most freshness requirements. Hourly batches give sub-hour freshness, which is fine for the vast majority of agent use cases.
- Done right, batch is cheap. A Cloud Run puller pulling 100k rows hourly costs single-digit dollars per month.
Key Technical Details
- Always pass an idempotency key when writing to GCS — the path itself is fine. Re-running the same job should overwrite the same file, not create a parallel one.
- Respect rate limits with backoff (exponential, capped at ~60s). Most APIs publish a
Retry-Afterheader; honor it. - Keep watermarks in a small BigQuery table (
ingest_watermarks) or Firestore. Don't store them in environment variables — they need to survive deploys. - Standard timeout / retry budget: connect 5s, read 30s, three retries, total request budget ≤2 minutes.
Common Misconceptions
"Just full-load every time." Works at small scale; becomes infeasible at production scale. Watermarked incremental pulls are the production pattern.
"Schema validation belongs in the puller." No — validate in the bronze→silver step. The puller's job is to capture, not judge. Schema drift should produce a silver-layer alert, not a dropped record.
Connections to Other Concepts
02-event-streams-with-pub-sub.md— When batch isn't fresh enough.04-files-and-bulk-loads-into-gcs.md— The same shape for files instead of APIs.- Course
06-pipeline-orchestration/01-orchestrating-with-cloud-composer.md— Scheduling and managing the puller in production.
Further Reading
- Google Cloud, "Build a data pipeline with Cloud Run + Cloud Scheduler" tutorial.
- Singer.io spec — A common protocol for incremental API extraction, worth reading even if you don't use it.
- Fivetran / Airbyte docs — Hosted versions of this same pattern, useful for understanding the production behaviors you'll need.