Harness Engineering
When your agent fails, the model is rarely the problem. The harness — the code wrapping the model call — is where reliability is decided. Five layers cover most of it, and Datadog’s March 2026 data tells you which ones matter most.
§ 00 · THE HARNESS IS THE SYSTEMThe model is a function. The harness is the system.
Most production AI outages don’t look like model failures. They look like one of: the model returned a 429, the retry loop ran forever, the cost dashboard exploded, a single tenant starved everyone else, a downstream tool failed and the agent looped on it for 47 attempts. None of these are problems with the model. They’re problems with what surrounds the model.
Sarah Chen’s April 2026 essay made the framing explicit: the model is a stateless function with no error handling and no guarantees about availability. Everything else — retries, circuit breakers, rate budgets, fallback routing, observability, cost controls — lives in the harness. The harness is where reliability is decided. The model is just one of the dependencies it has to keep alive.
This essay walks through the five layers that show up in every production harness worth copying. Each layer addresses a different failure mode. None of them is sufficient on its own; all five stacked make a 4% upstream error rate disappear to the user. The interactive lab in §08 lets you watch the stack work end-to-end.
§ 01 · RETRIES WITH SANE BACKOFFThe cheapest reliability layer, and the easiest to break
Bounded retries handle the “the upstream blipped” case. Three attempts max. Exponential backoff (250ms, 1s, 4s feels right). Jitter on every delay so a thundering herd of clients doesn’t retry in lockstep. Retry only on the codes that make retry sensible — 429, 503, network timeouts — not on 400-class errors that will always fail.
async function withRetry<T>(fn: () => Promise<T>, max = 3) {
let lastErr: unknown;
for (let i = 0; i < max; i++) {
try {
return await fn();
} catch (err) {
if (!isRetryable(err)) throw err;
lastErr = err;
const base = 250 * Math.pow(4, i); // 250ms, 1s, 4s
const jitter = Math.random() * base * 0.25;
await sleep(base + jitter);
}
}
throw lastErr;
}The mistake most teams make: unbounded retries, or retries that don’t know what to retry. An agent that loops on a 400 Bad Request 50 times costs you 50× the tokens without any chance of succeeding. Cap the count, gate on the error type, and trust the next layer.
§ 02 · CIRCUIT BREAKERSStop retrying once the upstream is clearly down
Retries handle a single bad request. They don’t handle the case where 30% of your requests are failing because the upstream is genuinely struggling. In that scenario, every retry adds load to a system that’s already overloaded — a textbook amplification. The circuit breaker pattern, lifted directly from Hystrix circa 2014 and now standard in agent harnesses, fixes this.
The state machine is simple. The breaker has three states:
- Closed. Normal operation. Requests flow through. Failures and successes are tallied in a rolling 60-second window.
- Open. Once the failure rate exceeds a threshold (25% is a reasonable default), the breaker opens. Subsequent requests fail fast without ever calling the upstream. This is what stops the retry storm.
- Half-open. After a cool-down (30–60 seconds), one probe request is allowed through. If it succeeds, the breaker closes and traffic resumes. If it fails, the breaker stays open and the cool-down resets.
Per-tool circuit breakers — not one global breaker — are the right granularity for agentic systems. The agent might be using five different MCP servers, three of which are healthy. A global breaker would stall the whole agent because one tool is down. Per-tool breakers let the agent route around the failure and keep working.
§ 03 · CAPACITY ENGINEERINGTreat LLM capacity like any other constrained resource
Here’s the number that should change how you think about AI reliability: 60% of all LLM errors observed by Datadog in March 2026 were rate-limit errors. Not model errors. Not network errors. Rate limits. 8.4 million of them in a single month, across production AI workloads.
Your prompt is fine. Your throughput is the bottleneck. The team that ships the most reliable agent isn’t the one with the best prompt — it’s the one that thinks about their token budget the way a database engineer thinks about connection pools. Three patterns matter:
- Per-key (sub-key) rate limits.Your provider gives you a top-level quota. You partition it among tenants / features / cron jobs so no single consumer can starve the rest. A noisy batch job at 2am shouldn’t take down interactive traffic.
- Backpressure on the queue.When you can’t serve, reject the work immediately rather than queueing it for later. A queue that grows unbounded during a capacity incident becomes a multi-hour outage even after the incident ends.
- Token-aware admission. Reject incoming requests if the expected token cost would push you past your current per-minute budget. Better to fail one request fast than to fail every request later.
§ 04 · BOUNDED SCOPEThe agent that refuses things is the one that ships
Reliability isn’t just “does it return successfully under load.” It’s also “does it return the right thing.” The Data Science Collective’s April 2026 piece on bounded-scope agents reframed this: the best production agents are narrow, and they know what they don’t own.
A support agent handles tickets. It doesn’t touch billing. The boundary is the safety mechanism. When a user asks the support agent to refund a charge, the right behavior is to refuse and route to the billing system — not to attempt the refund and hope nothing breaks.
The implementation pattern is an allow-list of actions enforced at the harness level, not at the model level. The model can’t hallucinate its way past it because the tool invocation goes through a router that checks the action against the allow-list before any side effect runs.
const SUPPORT_AGENT_ALLOWS = new Set([
"read_ticket",
"search_kb",
"create_internal_note",
"escalate_to_billing",
]);
async function callTool(toolName: string, args: unknown) {
if (!SUPPORT_AGENT_ALLOWS.has(toolName)) {
return {
ok: false,
error: `Tool '${toolName}' not in this agent's scope. Use escalate_to_billing.`,
};
}
return await TOOLS[toolName](args);
}Refusal rate becomes a quality metric. An agent that never refuses is suspicious — it’s either operating with too broad a scope or it’s papering over things it shouldn’t. An agent that refuses cleanly and routes the request elsewhere is operating inside its lane.
§ 05 · MODEL ROUTINGSonnet for the hard parts. Haiku for the rest.
Most production agents pay frontier prices for tasks a smaller model handles fine. Classifying a support ticket. Summarizing a meeting transcript. Extracting a structured field from a form. These don’t need Sonnet. Sending every request to Sonnet is the “everything is a select * query” of production AI — it works, and it’s 10× more expensive than it needs to be.
Model routing is the harness layer that decides per request which tier to call. The classic shape:
- Classify the request (a small cheap model labels difficulty: trivial / standard / complex).
- Route by tier — Haiku for trivial, Sonnet for standard, Opus for complex.
- Quality fallback— if Haiku’s output fails an eval check, retry with Sonnet. The fallback is rare if the classifier is honest.
- Measure per tier. Track cost and quality separately for each routing tier so you can tune the classifier’s thresholds.
The published economics: most teams report ~70% cost reduction from honest routing, with no measurable quality loss on the routed tiers. The fallback catches the edge cases.
§ 06 · PROMPT CACHING90% discount most teams skip
Anthropic’s prompt caching, plus the equivalents at OpenAI and Google, change the economics of long system prompts. Mark a block as cacheable with one flag (cache_control: { type: “ephemeral” }) and subsequent calls reading the same prefix cost ~10% of the normal input price.
The pattern that fits most production agents:
- The system prompt + AGENTS.md + any stable retrieved context sits at the start of the request, marked cacheable.
- The user’s current turn and any volatile context comes after the cache breakpoint.
- Verify cache hits via the API’s response metadata — you’ll see a
cache_read_input_tokensfield that distinguishes cached vs uncached input tokens.
For an agent making 1M calls a day with 8K of stable prefix each, that’s roughly $21,600/month in input-token cost saved from the cache alone. Layer multiple breakpoints (system prompt → KB → recent context → current turn) to maximize cache hits across different stability tiers.
§ 07 · AGENTS.MD AS THE CONVENTION LAYEROne file. Every tool. Source-controlled.
The harness isn’t just runtime code. It’s also the conventions the agent follows — codebase rules, style, the contract between the agent and the project. April 2026 brought a quiet but real win here: Anthropic’s CLAUDE.md, OpenAI’s agent.md, and Cursor’s rules format converged on a single shared spec: AGENTS.md. GitHub Copilot, Cursor, Claude Code, and Cline all read it natively.
The shape is mundane and that’s the point. A markdown file at the root of the repo, four sections deep — the spec (what AGENTS.md must include), the codebase conventions, the constraints the agent must follow, and a list of tools and their access scopes. Every major agent CLI reads it automatically. Commit it, review it, evolve it like any other code artifact.
Treating conventions as code is what makes the rest of the harness possible. A retry policy means nothing if a different engineer’s prompt expects a different error format. Bounded scope means nothing if a new tool gets added without updating the allow-list. AGENTS.md is the shared substrate every other harness layer leans against.
§ 08 · THE FIVE LAYERS IN ONE DIAGRAMToggle each one and watch the system change
The lab below ties the layers together. The scenario is a 10 million request/month production agent at the Datadog-observed median of 4% upstream error rate. Toggle each layer on or off and watch three numbers move: visible failure rate (what the user sees), p95 latency, and monthly cost. Sweep the upstream error rate slider to see what each layer does under an incident.
All layers off. This is what shipping straight against the model API looks like under load.
The Datadog State of AI Engineering (March 2026) measured 8.4 million rate-limit errors in a single month across production AI workloads. 60% of all LLM errors observed were rate limits — which means per-key budgets and fallback models carry more reliability weight than people realize. Stack all five and a 4% upstream error becomes invisible to the user.
Three things to notice. Fallback model is the biggest single reliability win.Routing 429s to Haiku takes the visible failure rate from “perceivable” to “invisible” under typical incident conditions, and cheaply. Prompt caching doesn’t change reliability but pays for the rest of the stack. The cost reduction is enough that the harness layers above it are essentially free. Per-key rate limits prevent the most common outage class— a single tenant’s burst exhausting your shared quota — without any need for code in the agent itself.
§ · FURTHER READINGReferences & deeper sources
- (2026). Harness Engineering: The Layer That Decides Reliability · harness-engineering.ai
- (2026). State of AI Engineering — March 2026 · Datadog Engineering
- (2026). Bounded Scope: Production Agents That Know What They Don't Own · Data Science Collective
- (2026). Circuit Breakers for AI Tools — When the Tool Fails, the Agent Loops · Google Developers Blog
- (2025). Prompt Caching with cache_control · Anthropic API Documentation
- (2026). AGENTS.md — A Unified Convention for Agent Context Files · Spec, April 2026
- (2014). Latency and Fault Tolerance for Distributed Systems · Netflix Tech Blog
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.