Handling PII and Redaction Pipelines

One-Line Summary: PII the agent doesn't need shouldn't reach the agent at all; pipelines that detect, redact, or tokenize sensitive data before it enters gold tables are how you keep both compliant and useful.

Prerequisites: Lesson 03-iam-and-security-for-agent-data-paths.md.

What's the Concept?

Some PII is essential — an agent helping a customer needs to know who the customer is. Other PII is incidental — a free-text support ticket might mention a credit card number, an address, a phone number that's irrelevant to the answer. Letting incidental PII flow into the agent context creates legal and reputational risk without product benefit.

The pattern: at the silver or gold layer, run a PII detection + redaction step that replaces incidental PII with safe tokens before the agent ever sees the row. Required PII stays (typed columns like email, phone); free-text fields get cleaned.

Cloud DLP is GCP's purpose-built service for this. It detects 150+ types of sensitive data, can de-identify in place, and integrates with BigQuery, Pub/Sub, and GCS natively.

How It Works

A typical PII redaction step in the silver → gold transition:

# Pseudocode for a Dataflow PII redaction step
from google.cloud import dlp_v2
 
dlp_client = dlp_v2.DlpServiceClient()
parent = f"projects/myco-prod/locations/us-central1"
 
def redact_ticket_body(row: dict) -> dict:
    """Strip PII from a support ticket's body field before it lands in gold."""
    response = dlp_client.deidentify_content(
        request={
            "parent": parent,
            "deidentify_config": {
                "info_type_transformations": {
                    "transformations": [
                        {
                            "info_types": [
                                {"name": "CREDIT_CARD_NUMBER"},
                                {"name": "US_SOCIAL_SECURITY_NUMBER"},
                                {"name": "PHONE_NUMBER"},
                                {"name": "EMAIL_ADDRESS"},
                                {"name": "STREET_ADDRESS"},
                            ],
                            "primitive_transformation": {
                                "replace_with_info_type_config": {}
                                # Replaces "555-1212" with "[PHONE_NUMBER]"
                            },
                        }
                    ]
                }
            },
            "item": {"value": row["body"]},
        }
    )
    row["body_redacted"] = response.item.value
    return row

A ticket like "Hi, my card 4111 1111 1111 1111 was charged twice" becomes "Hi, my card [CREDIT_CARD_NUMBER] was charged twice". The agent can still understand the question; the sensitive data is gone before retrieval.

Three styles of redaction, suited to different needs:

Replace with info type — the example above. Best for free text the agent reads.
Tokenize (FPE — format-preserving encryption) — replaces sensitive data with a deterministic surrogate that preserves format. Useful when you need to join on the value later but never expose it.
Mask (partial) — keep some characters, mask others (****-1212). Useful when the agent needs to identify but not learn.

In BigQuery you can use dynamic data masking for column-level redaction with no copy:

ALTER TABLE `myco.silver.customers`
ALTER COLUMN ssn
SET OPTIONS (
  masking_policy = "high_security_only"   -- policy tag with masking rule
);

Roles without the Fine-Grained Reader permission see masked values; roles with it see real values. The agent's service account doesn't have the permission; the human compliance reviewer does.

Why It Matters

Regulators care, even if you don't have a breach. GDPR, HIPAA, PCI, CCPA all impose strict requirements on what data can be processed and for what purpose. An agent that can quote a customer's SSN is a finding waiting to happen.
Even non-regulated PII is a foot-gun. A model that learns to confidently quote phone numbers from a CRM is one prompt-injection away from leaking them on the wrong channel.
Redaction is a pipeline concern, not an agent concern. Asking the model to "never reveal X" via prompt is not a control. Removing X before it ever reaches the model is.

Key Technical Details

DLP's de-identification operates per-record; for batch redaction of millions of rows, Dataflow + DLP is the standard pattern (there's a Google-provided template).
DLP detection isn't free — pricing is per byte inspected. Run it once per row during the silver → gold step, then store the redacted output. Don't re-DLP on every read.
Custom info types let you add domain-specific patterns (your internal account numbers, product codes that count as confidential).
For deterministic tokenization (same input → same token), use FPE with a managed KMS key. Cloud DLP supports this directly.
Column-level masking with policy tags is the cheapest control — no pipeline change, just a SQL ALTER and an IAM grant.

Common Misconceptions

"Redact at the agent layer." You can, but you shouldn't. The pipeline layer is the only place where redaction is unbypassable. A clever prompt or a misconfigured tool can defeat agent-layer rules; it cannot un-redact data that was never stored.

"DLP catches everything." It catches well-defined patterns. Free-form admissions of personally-identifying detail ("I live next to the church on 5th Street") will slip through. Treat DLP as the floor, not the ceiling.

"We can re-derive original values when needed." Sometimes (FPE tokenization, yes). Often (replace with info type, no). Plan the round-trip up front. Lossy redaction is the right default; reversible is the exception.

Connections to Other Concepts

03-iam-and-security-for-agent-data-paths.md — Defense in depth; redaction is one tier.
Course 03-the-raw-data-lake/04-data-governance-from-day-one.md — Tagging PII at the source is what makes downstream redaction routable.
Course 04-refinement-in-bigquery/02-silver-to-gold-modeling-for-agents.md — The right place to bake redaction into the pipeline.