Harness Engineering
Harness engineering is the discipline of designing environments, specifying intent, and building feedback loops that allow agents to do reliable, high-quality work. The term comes from OpenAI's framing of what changes when AI can write code: the primary engineering job shifts from writing to enabling — breaking large goals into well-scoped tasks, giving agents the skills and context to execute them, and building the feedback mechanisms that make quality measurable and improvable.
The harness is what turns a capable model into a reliable system. A model is a function — tokens in, tokens out. The harness is the environment that function runs inside: intent specified, context grounded, tools wired, feedback routed, behavior bounded. Two teams can use the identical model and ship wildly different systems based on harness quality alone.
This section covers both dimensions: the discipline (how to think about environment design, intent specification, and feedback loops) and the implementation (the runtime code — loops, state, tools, safety, observability — that puts those ideas into production).
The harness, layer by layer
Maintaining the harness keeps the whole thing healthy over time; observability wraps every run; orchestration runs many loops; each run moves through a task lifecycle and a core loop that assembles context — drawing on the durable grounding beneath it — dispatches tools through a safety layer, and calls a model through a gateway. The sections below build from the inside out.
The sections
| Group | What it covers |
|---|---|
| The Run Loop | The core execution cycle — turn structure, budgets, halting, error recovery, and streaming & steering |
| Context Management | Assembling the prompt each turn, compacting under a token budget, and persisting session state |
| Grounding | The durable knowledge agents can't infer — the context hierarchy from a file, to a repo's .agents/, to a system, to the org |
| Task Lifecycle | Bounding a task — scope and feature lists, verification and definition-of-done, initialization and clean handoff |
| Tools & Capabilities | Dispatching tool calls, and registry and discovery without flooding the window |
| Safety & Trust | Sandboxing, least-privilege permissions and approval gates, and prompt-injection defense |
| Orchestration | Sub-agents, concurrency limits, durable resume, and scheduled or triggered runs |
| Model Access | A gateway that abstracts providers, routes models, and exploits prompt caching |
| Observability | The metrics that matter, plus traces, telemetry, cost accounting, and replay-driven online evaluation |
And one cross-cutting page: Maintaining the Harness — auditing, ablating, and simplifying the harness as models improve, because it is software and rots like software.
Two rules
- The model decides, the harness controls. Anything that must be reliable — budgets, permissions, retries, persistence, limits — belongs in the harness, where it's deterministic and testable, not in a prompt where it's a suggestion.
- Specify intent precisely; measure outcomes honestly. An agent's output quality is bounded by the clarity of its task definition and the rigor of its feedback loop. Vague prompts and unmeasured results are a harness problem, not a model problem.