Step One: Ingesting Product Docs

One-Line Summary: A Cloud Run service, scheduled hourly, clones the docs repo, packages each Markdown file as a JSONL record, and writes the batch into the bronze GCS bucket.

Prerequisites: Lesson 01-capstone-overview-and-architecture.md and Module 02.

What's the Concept?

The first leg of the pipeline is the boring but load-bearing one: get the Markdown files out of GitHub and into a GCS bucket on a predictable schedule. No transformation, no parsing of any depth — just capture.

The source is a GitHub repo (braindrip-docs). Every Markdown file under /docs/** is a doc the agent might need. The capture mechanism: a Cloud Run service that, on each invocation, clones the repo at the current default-branch commit, walks the file tree, packages files into a JSONL batch, and writes it to bronze.

How It Works

The Cloud Run service, in outline:

# main.py — the entire ingester is about 120 lines
import os, json, hashlib, subprocess, datetime, gzip, tempfile, pathlib
from google.cloud import storage
 
REPO = "https://github.com/myorg/braindrip-docs.git"
BUCKET = os.environ["BRONZE_BUCKET"]   # e.g., "myco-lake-bronze"
 
def ingest_docs() -> dict:
    today = datetime.date.today().isoformat()
    pulled_at = datetime.datetime.utcnow().isoformat() + "Z"
 
    with tempfile.TemporaryDirectory() as tmp:
        subprocess.run(["git", "clone", "--depth", "1", REPO, tmp], check=True)
        commit = subprocess.check_output(
            ["git", "-C", tmp, "rev-parse", "HEAD"]
        ).decode().strip()
 
        records = []
        for path in pathlib.Path(tmp, "docs").rglob("*.md"):
            text = path.read_text(encoding="utf-8")
            rel = str(path.relative_to(tmp))
            records.append({
                "doc_path": rel,
                "content": text,
                "content_hash": hashlib.sha256(text.encode()).hexdigest(),
                "_repo": REPO,
                "_commit": commit,
                "_pulled_at": pulled_at,
            })
 
    storage_client = storage.Client()
    bucket = storage_client.bucket(BUCKET)
    blob_path = (
        f"source=github/entity=docs/"
        f"ingestion_date={today}/"
        f"commit={commit[:8]}/docs.jsonl.gz"
    )
    blob = bucket.blob(blob_path)
 
    body = "\n".join(json.dumps(r) for r in records).encode("utf-8")
    blob.upload_from_string(gzip.compress(body), content_type="application/gzip")
 
    return {
        "files_written": 1,
        "records_written": len(records),
        "commit": commit,
        "blob_path": blob_path,
    }
 
# Cloud Run handler
from flask import Flask, jsonify
app = Flask(__name__)
 
@app.route("/", methods=["GET", "POST"])
def handler():
    result = ingest_docs()
    return jsonify(result), 200

The choices worth highlighting:

Idempotent path key. The bronze path includes the commit hash, so re-running the ingester at the same commit overwrites the same blob. No duplicates.
No transformation. Every doc lands verbatim with its file path, content, and a content hash. Parsing — chunking, frontmatter, link rewriting — is a downstream Dataform concern.
Provenance. _repo, _commit, _pulled_at are baked into every record. The agent's eventual response can attribute back to a specific commit if it needs to.
One blob per run. A few thousand docs at typical doc-page sizes fits comfortably in a single gzipped JSONL under 10 MB. Splitting into pages would be premature optimization here.

Deployment

The full deploy, end to end:

# 1. Build and deploy the Cloud Run service
gcloud run deploy docs-ingester \
  --source . \
  --region us-central1 \
  --service-account ingest-docs-sa@myco-prod.iam.gserviceaccount.com \
  --set-env-vars BRONZE_BUCKET=myco-lake-bronze \
  --no-allow-unauthenticated \
  --memory 512Mi \
  --timeout 600
 
# 2. Schedule it hourly with Cloud Scheduler
gcloud scheduler jobs create http docs-ingester-hourly \
  --location us-central1 \
  --schedule "0 * * * *" \
  --uri "https://docs-ingester-<hash>-uc.a.run.app/" \
  --http-method POST \
  --oidc-service-account-email scheduler-sa@myco-prod.iam.gserviceaccount.com

The service account ingest-docs-sa has exactly two roles: Cloud Run Invoker (for Scheduler) and Storage Object Creator on the bronze bucket. Nothing else.

Why It Matters

The boring tier is what makes the rest possible. Without reliable bronze ingestion, every downstream model is suspect.
The pipeline holds the line between "source" and "warehouse." Once a doc is in bronze, you have a permanent record of "what we ingested, when, and from which commit." That trail is invaluable for debugging.

Key Technical Details

Cloud Run's max request duration on the standard tier is 60 minutes; this puller finishes in under 30 seconds for a few-thousand-doc repo.
The --no-allow-unauthenticated flag means the service requires an OIDC token to invoke — protects against random callers.
For repos with private access, mount the GitHub PAT from Secret Manager and pass it in the clone URL: https://<user>:<pat>@github.com/....
Cloud Scheduler's OIDC integration means Scheduler signs each request with its own service-account identity — no shared secret to manage.

Common Misconceptions

"Why not just have the agent read from GitHub directly?" Latency, rate limits, no provenance, and no consistent snapshot across the corpus. Bronze gives all of those.

"Hourly is too frequent — docs don't change that fast." Hourly is a sensible default. Bronze writes are cheap. The downstream steps detect "no change" via content hash and skip work — so frequent ingestion costs almost nothing when there's nothing to ingest.

Connections to Other Concepts

Course 02-ingestion-patterns/01-batch-ingestion-from-apis.md — The general pattern this lesson specializes.
Course 03-the-raw-data-lake/02-bucket-layouts-and-partitioning.md — Why the path convention matters.
03-step-two-refining-into-an-agent-ready-corpus.md — Where bronze gets read next.