Agent Harness Engineering: Building the Deterministic Shell Around Your LLM
TL;DR — A non-deterministic model is only as reliable as the deterministic shell you wrap around it. This practical guide covers the engineering of the agent harness — tool surface, approval gates, context management, and orchestration — with vendor-neutral patterns and concrete examples.
An LLM is brilliant and unpredictable; the harness is what makes it shippable.
Agent harness engineering is the discipline of building the deterministic "shell" around a non-deterministic model — the tool surface, the permission and sandbox boundaries, the context injection and compaction logic, the approval gates, the verification steps, the tracing, and the sub-agent orchestration. The model decides what to do; the harness decides what the model is allowed to do, what it can see, and what happens after it acts. Get the harness right and a probabilistic model becomes a controllable, auditable production system. Get it wrong and you have a clever demo that occasionally deletes the wrong row.
The notes below were cross-verified across three independent model families — Anthropic's Claude (Opus 4.8), an OpenAI-family model, and a Google-family model — and the patterns held up consistently across all three, which is a useful signal that they are architectural rather than vendor-specific.
The Core Concept: You Own the Boundaries
The model never touches your database, your filesystem, or your payment provider. It emits tool calls — structured requests — and your harness executes them. That gap is the entire game. Everything the harness can intercept, validate, gate, render, or log lives in the shape of those tool calls.
A useful mental model: the model is a very capable contractor who works only through written work orders. If every work order is the same opaque format ("run this shell command"), you can't tell a safe read from a destructive write. If work orders are typed by intent (refund_order, search_contracts, deploy), your harness can treat each one according to its real risk.
Best Practices
1. Design tools as an agent-computer interface (ACI)
Treat your tool set the way you'd treat a public API — because the model is the consumer. Each tool definition should state, in its description:
- When to use it (and when not to)
- The input schema with per-field descriptions and enums for fixed value sets
- Side effects — does it mutate state, send a message, spend money?
- Retry-safety — is it idempotent?
- Error modes — what failures can it return, and how should the model react?
Promote business actions to dedicated, typed tools when you need to gate, render, audit, or parallelize them — refund_order(order_id, amount) is trivially gated; bash -c "curl -X POST ..." is not. Keep a restricted bash/sandbox tool for open-ended exploration where breadth matters more than control. The rule of thumb: start with bash for reach, promote to dedicated tools for safety.
# A dedicated tool the harness can gate, the model can't misformat
{
"name": "refund_order",
"description": "Refund a customer order. Use ONLY after confirming the order ID and amount with the user. Side effect: issues a real payment reversal. Not retry-safe — do not call twice for the same order.",
"input_schema": {
"type": "object",
"properties": {
"order_id": {"type": "string"},
"amount_cents": {"type": "integer"},
"reason": {"type": "string", "enum": ["defective", "not_received", "duplicate"]}
},
"required": ["order_id", "amount_cents", "reason"]
}
}
2. Treat context as a finite, ordered resource
Context isn't free storage — it's a budget with a cost, a latency tax, and a focus cost. Order it by stability:
- Static first: system prompt, tool schemas. These rarely change and should sit at the front so they can be cached.
- Dynamic last: user data, timestamps, per-request IDs. Volatile content at the front silently invalidates any prompt cache behind it.
For long-running agents, plan for compaction (summarizing earlier turns), tool-result clearing (pruning stale outputs), and external memory (writing durable state to a file or store the agent re-reads). When you measure tokens, use the model provider's own counting endpoint — a count_tokens API for the model you're actually calling — rather than a tokenizer borrowed from a different vendor, which can be off by 15–20% or more and quietly wreck your budget math.
3. Put approval gates at side-effect boundaries
The harness should pause and ask a human before any action that is hard to reverse:
| Action class | Examples | Gate |
|---|---|---|
| Destructive | delete, drop table, rm -rf |
Always confirm |
| Financial | refund, charge, transfer | Always confirm |
| Deploy / release | push to prod, publish | Always confirm |
| Bulk mutation | edit N files, mass update | Confirm above a threshold |
| Sensitive external calls | privileged API / MCP tool | Confirm or scope-limit |
| Read-only | search, glob, get | Auto-allow |
Implement gates with a pause/resume loop: when the model emits a gated tool call, the harness suspends, surfaces the request to the user, and only executes on approval. A clean manual agentic loop makes this natural — you inspect each tool_use before running it, rather than letting an auto-runner fire everything.
4. Order the prompt for cache hits
Caching is a prefix match: any byte change invalidates everything after it. Freeze the system prompt and tool list, serialize tool schemas deterministically (sort keys), and keep timestamps/IDs after the last cache breakpoint. The payoff is large — cache reads typically cost a fraction of fresh input tokens — but it evaporates the moment you interpolate now() into your system prompt.
5. Reach for sub-agents only when the work genuinely differs
A single agent is cheaper, easier to trace, and easier to debug. Split into sub-agents only when the sub-task needs a different policy, tool set, model, or isolated context — for example, a read-only "explore" agent on a cheaper model, or a fresh-context verifier that grades the main agent's output. Each sub-agent adds prompt surface, approval surface, and trace surface; pay that cost deliberately.
Common Misconceptions
"One bash tool is enough." It gives the model maximum reach and gives your harness minimum leverage. Every action arrives as the same opaque string, so you can't enforce per-action permissions, validate inputs, render custom UI, mark read-only calls as parallel-safe, or produce a meaningful audit trail. Bash is a starting point, not an architecture.
"The framework handles harness design." Frameworks give you a loop and some glue. They do not decide your tool boundaries, your approval policy, your compaction strategy, or your trace schema — those are your application's design, and they're the parts that determine whether the agent is safe in production. Treat traceable prompt/tool/state boundaries as first-class code you own.
"Bigger context means stuff it all in." More tokens cost more, add latency, and dilute the model's focus. A lean, well-ordered context usually beats a maximal one. Curate; don't dump.
"Split everything into multi-agent." Multi-agent is a tax, not a default. It multiplies the surfaces you have to prompt, approve, and trace. Reserve it for genuinely independent or parallel workstreams.
"Compaction is lossless enough." The most dangerous failure mode is dropping load-bearing facts during compaction — decisions already made, IDs already assigned, open blockers, and verification results. Pin those explicitly so a summarization pass can't quietly discard them. When in doubt, write them to external memory rather than trusting the summary.
Verification and Tracing: the Non-Negotiables
A harness without logging is a harness you can't debug. Record every tool call, its arguments, its result, the approval decision, and the model's stated reasoning where available. Capture a request ID per model call so you can trace a failure end-to-end. And build verification into the loop — run tests, re-read a file after editing it, or spin up a fresh-context checker — rather than trusting the agent's self-report that it "fixed" something.
A Guiding Principle
Every piece of harness logic has an expiration date. Models get better at instruction-following, providers ship new primitives (server-side compaction, tool search, managed orchestration), and yesterday's defensive scaffolding becomes today's overtriggering prompt. Favor simple, composable patterns over elaborate machinery you'll have to unwind in six months. A typed tool, a clear gate, a deterministic prompt order, and an honest log will outlast almost any clever abstraction.
Wrap-Up
Agent harness engineering is where a non-deterministic model becomes a dependable product. The five moves that matter most: design tools as a typed agent-computer interface with explicit side-effect and retry semantics; treat context as a finite, cache-aware, stability-ordered resource; place approval gates at every hard-to-reverse boundary; use sub-agents only when policy or context truly diverges; and log and verify everything so you can trace and trust the system. The misconceptions to retire — "bash is enough," "the framework handles it," "bigger context is better," and "split everything" — all share a root cause: outsourcing decisions that are fundamentally your application's design.
Start small. Take one agent you already run, list its tools, and ask of each: can the harness gate it, render it, audit it, and parallelize it safely? Promote the ones that fail. Add an approval gate at your most destructive action. Put a request ID in your logs. Those three changes alone will move you measurably toward an agent you can ship — and explain the next time something goes wrong.
For the adjacent pieces of this series, see our companion guides on building agent skills and optimizing LLM token cost.