Harness Engineering

Harness engineering is the discipline of designing environments, specifying intent, and building feedback loops that allow agents to do reliable, high-quality work. The term comes from OpenAI's framing of what changes when AI can write code: the primary engineering job shifts from writing to enabling — breaking large goals into well-scoped tasks, giving agents the skills and context to execute them, and building the feedback mechanisms that make quality measurable and improvable.

The harness is what turns a capable model into a reliable system. A model is a function — tokens in, tokens out. The harness is the environment that function runs inside: intent specified, context grounded, tools wired, feedback routed, behavior bounded. Two teams can use the identical model and ship wildly different systems based on harness quality alone.

This section covers both dimensions: the discipline (how to think about environment design, intent specification, and feedback loops) and the implementation (the runtime code — loops, state, tools, safety, observability — that puts those ideas into production).

The harness, layer by layer

Maintaining the harness keeps the whole thing healthy over time; observability wraps every run; orchestration runs many loops; each run moves through a task lifecycle and a core loop that assembles context — drawing on the durable grounding beneath it — dispatches tools through a safety layer, and calls a model through a gateway. The sections below build from the inside out.

The sections

Group	What it covers
The Run Loop	The core execution cycle — turn structure, budgets, halting, error recovery, and streaming & steering
Context Management	Assembling the prompt each turn, compacting under a token budget, and persisting session state
Grounding	The durable knowledge agents can't infer — the context hierarchy from a file, to a repo's `.agents/`, to a system, to the org
Task Lifecycle	Bounding a task — scope and feature lists, verification and definition-of-done, initialization and clean handoff
Tools & Capabilities	Dispatching tool calls, and registry and discovery without flooding the window
Safety & Trust	Sandboxing, least-privilege permissions and approval gates, and prompt-injection defense
Orchestration	Sub-agents, concurrency limits, durable resume, and scheduled or triggered runs
Model Access	A gateway that abstracts providers, routes models, and exploits prompt caching
Observability	The metrics that matter, plus traces, telemetry, cost accounting, and replay-driven online evaluation

And one cross-cutting page: Maintaining the Harness — auditing, ablating, and simplifying the harness as models improve, because it is software and rots like software.

Two rules

The model decides, the harness controls. Anything that must be reliable — budgets, permissions, retries, persistence, limits — belongs in the harness, where it's deterministic and testable, not in a prompt where it's a suggestion.
Specify intent precisely; measure outcomes honestly. An agent's output quality is bounded by the clarity of its task definition and the rigor of its feedback loop. Vague prompts and unmeasured results are a harness problem, not a model problem.

The harness, layer by layer​

The sections​

Two rules​

The harness, layer by layer

The sections

Two rules