Published

Harness Engineering (1): A Framework Designed for Long-Running Application Development

Published

Length

5,352 Words

The layer beyond the model is becoming the core infrastructure of the AI era.

HarnessAgentEngineering

It's Almost Never the Model

Two years building Agents, and I keep running into the same pattern.

Every time an Agent breaks, everyone's first instinct is — the model isn't smart enough. But when you actually sit down to fix it, it's almost never the model you end up fixing.

Earlier Coding Agents didn't have MCP, Skills, or any of that. It was just what people jokingly called "talking to God on the left sidebar" — early Coding Agents had only Tools and Rules (like Cursor). Back then, working on a large project meant every module ended up with a functionally identical utils file. The codebase would pile up with duplicated stone mountains.

Surprisingly, everyone kept blaming the same thing: "the model isn't smart enough." But that wasn't actually the problem. So prompt engineering started getting serious — people stuffed more and more stylistically diverse prompts into Agents. Context ballooned. Problems went unsolved. Hallucinations grew worse.

Earlier this year, I stumbled across an article about Cloudflare using AI to replicate Next.js. After reading the source, the implementation was poor. What was interesting: the tests were lifted almost verbatim. So I started adding tests to my own code — using Vitest. Token consumption went up, but the output quality nearly matched requirements. Correctness shot up. From there I built Vite-browser, a self-review CLI for debugging Vite apps with AI. Then more tools came into view: Agent-browser, next-browser — a wave of CLI tools for Agent self-review and debugging. They were all adapting their frameworks to work better with AI. That was the first shift: once humans stop writing code, the toolchain should adapt to humans.

Other interesting signals: documentation sites started adding two buttons — "copy as markdown" and "open in chat." The earliest I noticed were Superpower, Spec-Kit, and Ralph. They were all trying to fit more of the development workflow around Agents, building conventions that let Agents run autonomously.

In 2026, this layer got a name.


Agent = Model + Harness

Before Mitchell Hashimoto wrote that equation in February 2026, many engineers were already doing this work — they just didn't have a shared word for it.

He noticed a habit that had developed while using AI Agents to write code: whenever an Agent made a mistake, he never fixed that specific output. Instead, he made a permanent fix in the Agent's environment. If the Agent kept leaving ticket numbers out of commit messages, he wouldn't remind the Agent each time — he'd add a rule to AGENTS.md, making that mistake impossible going forward. He called this "engineering the harness."

Within weeks, both OpenAI and Anthropic published engineering blogs expanding on the term. Thoughtworks' Birgitta Böckeler compressed it into one line:

Agent = Model + Harness

The Harness is everything in an Agent except the model itself. Tools, prompts, repo structure, feedback loops, validation mechanisms, execution environment — all of it is Harness. Harness Engineering means treating this layer as the primary engineering object to design and optimize.

The definition holds because it clarifies three distinct engineering relationships.

Prompt engineering operates on a single turn — how you talk to the model in one API call. Changing "review this code" to "act as a senior engineer. Review this code for security vulnerabilities, performance issues, and maintainability. List each finding with severity and suggested fix." That's prompt engineering — making the instruction clear in one request.

Context engineering operates on a single task — what the model should and shouldn't see. When asking an Agent to fix a bug, you decide whether to include the full related file, whether to add test files, whether to include git blame, whether to add the last three commit messages. These decisions determine what cards the model is holding, but not yet how it plays them.

Harness engineering operates across the entire Agent lifecycle — how the whole environment is designed when the Agent runs autonomously for hours and makes hundreds of decisions. Imagine an Agent dropped into an unfamiliar repo, tasked with building a product prototype with 14 features in 6 hours. It needs to know where to find requirements, what order to tackle them in, how to save progress after each feature, how the next Agent picks up where this one left off, and whether to retry or escalate on test failure. None of these questions can be answered by "how to write the prompt." This is Harness territory.

The three layers aren't substitutes for each other. Their scope expands outward, but no layer can rescue the one above it from failure.

You can perfect prompt engineering — every system prompt tuned through iterations, loaded with few-shot examples, role settings, output format constraints. You can perfect context engineering — RAG hit rate at 95%, sending only the most relevant chunks to the model. But if your Harness is broken — if the Agent loses context on every restart, if subagents don't share memory, if failed tasks have no structured retry mechanism — the Agent still runs unstable. Every individual call will be brilliant. The larger task that spans calls will never complete.

The reverse is equally true. In an environment where the Harness is solid, a middling model can complete surprisingly complex tasks. OpenAI wrote a million lines of production code with Codex — not their strongest model — by building the entire environment around it.


When Something Fails, Don't Try Harder

In February 2026, OpenAI published an internal experiment. A three-person team, five months, a million lines of production code written with Codex, 1,500 PRs, no human writing code throughout. Per engineer, that's 3.5 PRs merged to main per day.

The numbers alone are striking. But the real value is in how they attributed failure:

"When something failed, the fix was almost never 'try harder.' Because the only way to make progress was to get Codex to do the work, human engineers always stepped into the task and asked: 'what capability is missing, and how do we make it both legible and enforceable for the agent?'"

— OpenAI, Harness engineering: leveraging Codex in an agent-first world

Translated into engineering actions, this explicitly rules out three things:

Don't write longer prompts. Imagine this: Agent can't find a workspace dependency in a monorepo, so you add a line in the prompt: "check the workspace's package.json for dependencies." Next time the Agent hits the same issue in a new workspace, you add another line. Three months later your prompt is 2,000 words, 1,500 of which document past mistakes. That's not a fix — it's a patch job.

Don't switch to a stronger model. If you fail with Sonnet, switching to Opus might temporarily help, but the same class of problem will resurface at a new complexity threshold. You've only delayed it by three months.

Don't just retry more. If the same task fails two out of three runs, the problem isn't in the lucky successful run — it's in a systemic design flaw. Retrying is just probabilistic cover.

What it demands instead: install a permanent fix in the Agent's environment, making this class of error impossible to occur again.

For the "can't find dependency" example, a permanent fix means: add an entry to AGENTS.md at the repo root — "monorepo workspace dependencies live in each workspace's own package.json." Or more robustly, write an agent-callable tool find_dependency(package_name) so the Agent can locate dependencies without knowing project structure. The error can no longer happen, because the environment doesn't allow it.

Anthropic's Context Anxiety Case

Anthropic's long-task harness blog includes a very concrete example.

They were running Claude Sonnet 4.5 on a long task — asking the Agent to build a 2D retro game maker with React + Vite + FastAPI, including a level editor, sprite editor, entity behavior system, and game mode. One prompt, six continuous hours.

They observed a specific failure pattern: when the context window approached what the Agent perceived as its limit, it would start wrapping up sloppily. Not because the work was done — because it "felt" like space was running out. The last few features in a sprint started getting stubbed out, tests became perfunctory, documentation shrank from detailed to TODO. Anthropic called this context anxiety.

If you stop at "the model isn't smart enough," the solution is to wait for the next generation. Anthropic didn't wait.

They changed the Harness. Specifically: after each sprint, the Agent actively cleared its context, spawned a fresh Agent to continue, and carried state across sessions through a structured handoff file (recording sprint deliverables, current code state, open bugs, and the next sprint's goals). The new Agent starts with a clean context, doesn't see "the previous session consumed 80k tokens," and doesn't trigger anxiety.

This combination — which they called context reset — let Sonnet 4.5 complete the six-hour task it couldn't finish otherwise.

When they iterated to Opus 4.6, they found the model could handle cross-session task decomposition on its own. 4.6 no longer exhibited anxiety in long context — it could judge "how much space is left and how to allocate it" independently. So they did something that looked like a step backward: deleted the sprint structure entirely. The Generator runs a continuous session of two-plus hours without switching sessions.

But the Evaluator stayed. On tasks 4.6 can handle independently, it's overhead. But on tasks at the edge of model capability, it still provides critical lift. An Agent wrote a DAW application — visually polished, features seemingly complete — but when the Evaluator went through it with Playwright, it found: "Audio recording is still stub-only. Clip resize by edge drag not implemented." That kind of deep bug the model won't catch on its own. You need an external verifier with a different perspective.

The line from Anthropic's original post is worth keeping:

"The space of interesting harness combinations doesn't shrink as models improve. Instead, it moves."

Harness isn't scaffolding that disappears once models get stronger. It's an operating layer at the frontier — as models can do more, Harness does work at the new frontier. The sprint segmentation that was necessary in the Sonnet 4.5 era isn't needed in the 4.6 era. But the 4.6-era Evaluator might not even have been possible to tune in the 4.5 era.


Compaction is the Devil

Geoffrey Huntley saw the same thing from a completely different angle.

Ralph is an Agent coding technique he proposed in May 2025 that caught fire in the community by early 2026. Its core insight came from a line he said in a podcast:

"Compaction is the devil."

Compaction is the mechanism Claude Code and similar tools use to automatically compress conversation history when context fills up. It sounds reasonable — context windows are finite, and long conversations need a way to be managed.

But Huntley noticed something. Imagine an Agent working on a long task, with a critical instruction earlier in the context: "all API endpoints must use zod schema for input validation." By round 80, that instruction has been compressed to "use type validation." The Agent reads this vague summary and decides TypeScript interfaces are sufficient — after all, "type validation." The next 20 endpoints all miss the runtime zod validation, but the Agent thinks it did the right thing.

Once history is compressed, the Agent can only rely on its own generated summaries to continue. Those summaries are lossy compression — understanding of original intent starts to drift. One drift you won't notice. After five drifts, the Agent is doing something entirely different from what you originally asked for.

Anthropic's solution: context reset + handoff artifact — actively controlling when to clear, using structured documents to carry critical state. Huntley's solution is more extreme: just have the Agent exit after completing each task, and start fresh next round. Never give compaction a chance to happen.

Two teams, two completely different solutions. But the same root cause: an Agent running unstable isn't because the model is weak — it's because the Harness failed to handle a specific situation at a specific point. Find that point. Make a permanent fix.


Ralph and Spec-Kit: Two Very Different Engineering Paths

Before Harness Engineering had a name, a set of independent engineering practices was already emerging in parallel.

Multiple teams hitting the same problem in different contexts and arriving at similar solutions — that's a signal. This layer of engineering is a real need, not hype.

Ralph and Spec-Kit are the two most-discussed practices right now. They represent two very different directions.

Ralph: Execution-First Persistent Loop

Ralph's core implementation is simple enough to not look like a technique:

code
while :; do cat PROMPT.md | claude -p; done

An infinite bash loop. Each iteration starts a fresh Claude Code process, reads PROMPT.md, does something, then exits. Next iteration starts again.

The simplicity ends there. The cleverness begins.

Ralph's state isn't in conversation history — it's in a few files on disk:

  • PROMPT.md — the main instruction, loaded every iteration
  • AGENTS.md — project conventions and build instructions
  • specs/ — spec file directory
  • IMPLEMENTATION_PLAN.md — current task list, maintained by the Agent itself

Imagine a concrete Ralph run. Iteration 1: Agent starts, reads the instruction from PROMPT.md — "check IMPLEMENTATION_PLAN.md, pick an incomplete task, implement it, run tests, commit if tests pass, then exit." Agent reads the plan, sees the first task is "implement user login endpoint," writes code, runs tests, passes, commits, exits.

Iteration 2: a brand new Claude Code process starts — it has no idea what happened in iteration 1. But it reads PROMPT.md and IMPLEMENTATION_PLAN.md, sees the first task is marked done, and the second task is "implement user logout endpoint," so it continues with that. It sees src/routes/auth/login.ts already exists and uses it as a style reference for logout.ts. Knowledge isn't in its context — knowledge is in the repo.

Run this for 50 or 100 iterations, and the entire feature gets built. No single iteration's Agent context ever exceeds what it can handle, but the project completes in its entirety.

Huntley summed up the core insight himself:

"carve off small bits of work into independent context windows."

Cut work into small pieces, each with its own clean context. Not making the Agent run longer — acknowledging that "Agents drift over long runs" as fact, then using loops and disk state to route around it.

Control flow lives in three places:

  • Scope disciplinePROMPT.md explicitly requires "do one thing at a time, commit when tests pass." The Agent won't try to do everything at once; scope is hardcoded to "one at a time."
  • Backpressure — if tests or build fail, the Agent must fix before committing. If login endpoint tests fail, the Agent can't exit — it has to keep debugging. Failure itself is feedback; the loop mechanism handles retries automatically.
  • Natural completion — the Agent exits naturally when a task is done, and the loop naturally moves to the next task. No "global progress controller" — progress lives in the files.

Ralph's aesthetic is here — deterministically bad in an undeterministic world (Huntley's words). It doesn't pursue correctness in a single iteration; it pursues convergence over enough iterations. Each iteration might go wrong, might be low quality, might take a longer path — but state accumulates on disk, tests stay as gatekeepers, and it converges toward the goal.

Spec-Kit: Specification-First Structured Flow

Spec-Kit is a toolkit GitHub open-sourced in September 2025, and it goes almost the opposite direction from Ralph. Its core isn't a loop — it's a strict front-loaded process.

Imagine you're developing a feature: letting users subscribe to a blog via email. The Spec-Kit flow looks roughly like this:

Step 1, /specify. You tell the Agent "I want readers to be able to subscribe to the blog and automatically receive new articles each week." The Agent doesn't start writing code. It produces a spec.md: "User story: as a blog reader, I want to enter my email to subscribe, so I receive new articles weekly. Acceptance criteria: ① subscription form appears at the bottom of articles ② user receives confirmation email after subscribing ③ every Sunday at 9am send an email with this week's new articles ④ one-click unsubscribe." The Agent might ask: "Should the unsubscribe link be in the email or on a separate page?" — that kind of back-and-forth is normal spec-phase iteration.

Step 2, /plan. The Agent reads your finalized spec and produces plan.md: "Stack: Next.js + Resend. Storage: new subscribers table in Supabase (email, confirmed_at, unsubscribed_at, token). Core flow: ① POST /api/subscribe receives email, writes to database, sends confirmation via Resend ② confirmation link with token goes to /confirm?token=xxx, updates confirmed_at ③ Vercel cron triggers /api/weekly-digest every Sunday, queries articles from the past week, sends to all confirmed subscribers." It also produces contracts/subscribe.ts defining API schema and data-model.md describing the database table structure.

Step 3, /tasks. The Agent reads spec and plan, produces tasks.md:

code
T1. Create subscribers table migration
T2. Implement POST /api/subscribe
T3. Implement Resend email template (confirmation)
T4. Implement /confirm page
T5. Implement /api/weekly-digest cron [depends on T1, T2]
T6. Implement Resend email template (weekly digest)
T7. Implement unsubscribe link

And marks which tasks can run in parallel (T3 and T4 have no dependency, can run concurrently).

Step 4, implementation. An Executor agent (or a Ralph-style loop) starts completing tasks from tasks.md one by one. Each completed task gets marked done and committed.

There's also a cross-feature constitution.md — non-negotiable project-wide principles. Things like "all user input must be validated with zod," "all email templates must be previewed in the Resend dashboard," "all cron jobs must have observability." These don't belong to any single feature — they belong to the entire project.

GitHub's core belief about this flow is stated directly: Specs as executable artifacts — specs are no longer documents sitting unread in Confluence. They're ground truth the Agent reads and uses to make decisions on every run.

In sharp contrast to Ralph, Spec-Kit invests heavily in spec alignment before the Agent writes a line of code. But once the spec is done, the Agent has a clear, structured contract to execute against — no guessing intent during implementation. "Where should this button go?" "Which email service?" "How should the database table be structured?" — all settled in the spec phase. In the implementation phase, the Agent just executes.

Behind the Differences: Three Shared Underlying Designs

Ralph and Spec-Kit look like opposites, but they share three underlying designs. These are more worth noting than the differences.

All state lives in git-versioned markdown files.

Not in conversation history, not in memory, not in some orchestrator's database. Right in the repo. The Agent reads files, modifies files, commits files.

The power of this design: the Agent's work now fits perfectly into human engineering workflows. You can git log to see what the Agent did, git diff to see what changed, git blame to understand why a line is the way it is, git revert to undo it. PR review workflows don't need changing. Code review tools don't need changing. CI/CD doesn't need changing. The Agent is like an engineer who works very fast but needs constraints — its output is structurally identical to a human engineer's.

Both distinguish artifacts with different lifecycles.

Ralph has PROMPT.md (loaded every iteration), AGENTS.md (stable conventions), specs/ (specifications), IMPLEMENTATION_PLAN.md (dynamic tasks). Spec-Kit has constitution.md (cross-feature constitution), spec.md (single-feature what), plan.md (technical plan), tasks.md (task queue).

Notice both are distinguishing information by "rate of change." constitution.md and AGENTS.md might change a few times a year; spec.md stays stable during a feature development cycle; tasks.md changes every day. Mix all this into one file and you're committing tiny changes to a 1,000-line file every day. Git history becomes useless. Separate them and each file's commit history tells a coherent story — when was a rule added, when did a feature complete, when was a bug fixed.

Both arrived at the same realization: stuffing all information into one file is wrong, because this information changes at different rates.

Neither is a framework.

They're a combination of conventions + scripts + markdown templates. No runtime dependencies, no SDK, no "install this library and pass your agent."

Ralph's entire implementation fits into 20 lines of bash script and a few markdown templates. Spec-Kit is a Python CLI, but all it does is copy template files into your repo and register a few slash commands the Agent can invoke. No agent runtime, no orchestration engine, no state machine.

Harness engineering currently wins through conventions and artifacts, not code abstractions. That's a meaningful signal — it tells us this layer is still in the "discovering the right patterns" phase, nowhere near "encapsulating into a framework." Anything claiming to be "THE Harness Framework" right now is almost certainly too early.


Three Files, Three Lifecycles

Community practice is still mostly focused on single-feature development flow — spec → plan → task. For a long-lived evolving project, that granularity isn't enough.

While developing z0, I evolved a three-file division of responsibility. Later I found Amazon's Kiro uses the exact same three filenames: requirements.md / design.md / tasks.md.

Two teams independently converging on the same structure in different contexts. That itself suggests this is a natural engineering solution.

requirements.md — What This Feature Needs to Deliver

Describes the what from a user perspective. No implementation details.

Three core sections: user stories, acceptance criteria, explicit non-goals.

User stories in "As a ... I want to ... so that ..." form — "As a blog reader, I want to subscribe with my email, so that I can get weekly updates without visiting the site."

Acceptance criteria specific enough to describe "what can the user do, and what do they see after" — "Within 3 seconds of clicking subscribe, the user receives a confirmation email. After clicking the link in that email, the page shows 'Subscription confirmed.'"

Non-goals matter more than goals. This is the strongest defense against later scope creep. "This feature does not support category-based subscriptions, does not support RSS, and will not implement paid membership." Each non-goal is a guardrail that deflects one "should we also add..." impulse later in the feature's life. Features without non-goals always balloon into half a product.

This file stays stable through the feature's lifecycle, changing only when requirements change. Its readers: the planner Agent, product reviewers, newly onboarded engineers.

design.md — How to Implement It Technically

Describes the how from an engineering perspective.

Four core sections: architecture decisions, key interfaces, tradeoff record, risks and open questions.

Architecture decisions include technology stack, data flow, module structure — "frontend uses Next.js 16 app router, backend uses Supabase Edge Functions. Email via Resend. Cron via Vercel Cron Job."

Key interfaces include API schema, data model, event formats. The subscribe API accepts {email: string} and returns {subscribed: boolean, confirmationSent: boolean}. The subscribers table has five columns: email, confirmed_at, unsubscribed_at, token, created_at.

Tradeoff records use ADR (Architecture Decision Record) style: why A over B. "Why Resend instead of SendGrid — Resend's React Email templates write in JSX, matching our tech stack; SendGrid requires an additional template language. Tradeoff: Resend's deliverability data isn't as rich as SendGrid's." The value of these records: three months later when you (or another engineer, or another Agent) come back to modify this code, you know "why this decision was made then," avoiding blind reversals.

Risks and open questions should be explicitly marked — don't pretend everything has been figured out. "Risk: if subscribers exceed 10,000, Resend's free tier won't be enough. Unresolved: what threshold triggers upgrading to a paid plan?" This kind of annotation lets future developers or Agents recognize "this is a known but unresolved point" — they won't mistake it for intentional design.

This file updates densely during the design phase, then stays relatively stable once implementation begins. Its readers: executor Agents, code reviewers, engineers who take over later.

tasks.md — What Independent Steps to Break Into

Describes the do what from an execution perspective.

Core content: a task list where each task is scoped to "completable within one Agent session."

For the blog subscription feature, tasks.md looks roughly like this:

code
## Phase 1: Infrastructure
- [x] T1. Create subscribers table migration
- [x] T2. Set up Resend API integration
 
## Phase 2: Subscription Flow
- [ ] T3. Implement POST /api/subscribe endpoint [deps: T1, T2]
- [ ] T4. Design confirmation email template [parallel with T3]
- [ ] T5. Implement /confirm?token=xxx page [deps: T1]
 
## Phase 3: Weekly Digest
- [ ] T6. Implement /api/weekly-digest cron handler [deps: T1]
- [ ] T7. Design weekly digest email template [parallel with T6]
- [ ] T8. Configure Vercel cron [deps: T6]
 
## Phase 4: Unsubscribe
- [ ] T9. Implement unsubscribe link and endpoint [deps: T1]

Each task has four elements: status (todo/doing/done), description, dependencies, acceptance criteria.

Status lets the Agent know which task to pick — always the first todo task with no incomplete dependencies.

Dependencies make parallelism safe. T3 and T4 have no dependency on each other — they can be handed to two subagents to run in parallel. T5 must wait for T1.

Acceptance criteria are usually "what test has to pass." T3's acceptance criteria: "pnpm test api/subscribe passes." This lets the Agent verify its own work without human review before moving to the next task.

This file changes every day. Its primary reader is the executor Agent — the entire workflow revolves around it.


Why Three Files, Not One or Five

Not one. These three types of information differ by an order of magnitude in change frequency.

Imagine merging all three into a single feature.md. Every day the Agent needs to update task status a dozen times — "mark T3 as doing," "T3 done, mark done," "discovered T3.1 subtask, add it." After a month, the file has 500 commits, 480 of which are task status updates.

Then you want to look back at "why we chose Resend over SendGrid." git blame on that line shows the most recent task-status commit. Tracing back to the actual architecture decision commit requires manually scrolling through hundreds of entries. Git history has effectively ceased to function.

With three files, requirements.md's commit history only covers requirement changes, design.md's only covers technical decision evolution, tasks.md's only covers execution progress. Each file's history tells a coherent narrative.

Not five. These three layers exactly correspond to the three contexts an Agent needs when making decisions — what problem am I solving (requirements), how am I solving it (design), what do I do next (tasks).

More files fragment the context loaded at each step. Someone might suggest a separate risks.md for risks, a separate decisions.md for tradeoff records. But risks and tradeoffs are inherently part of design decisions — pull them out of design.md and the Agent now needs to load three files simultaneously for complete design context. Every additional file adds another "might get missed" risk.

Three isn't an aesthetic choice — it's the natural minimum partition for this information.

Directory Structure

These three files don't sit at the repo root. They're organized by feature:

code
specs/
├── 001-user-auth/
│   ├── requirements.md
│   ├── design.md
│   └── tasks.md
├── 002-payment-flow/
│   ├── requirements.md
│   ├── design.md
│   └── tasks.md
├── 003-email-subscription/
│   ├── requirements.md
│   ├── design.md
│   └── tasks.md
└── ...
AGENTS.md          # cross-feature navigation and constraints

Each feature is self-contained in its own directory. When working on feature-003, the Agent is scoped to load only files under specs/003-*/ — it won't be contaminated by feature-001 or 002's implementation details. It can't see user-auth implementation specifics — those details are irrelevant to its current task, and loading them only wastes context and dilutes attention.

This is Ralph's "independent context window" insight applied at the project level. Ralph makes each loop iteration an independent context. We make each feature an independent context. The same principle applied at different layers.


AGENTS.md is a Signpost, Not a Manual

This three-file system operates within a single feature's scope. Each feature has its own set under specs/{feature-id}/.

The entire repo also needs a stable cross-feature layer. That role belongs to AGENTS.md (or CLAUDE.md). It answers a different question: what should an Agent know first when entering this repo for the first time?

AGENTS.md has three core content types:

  • Project navigation — this repo's structure, where to find what. " apps/web is frontend, apps/api is backend, packages/ui is shared components, specs/ is feature specs."
  • Core constraints — what must be done, what is forbidden, applicable across all features. "All APIs must use zod for input validation." "No any types." "Commit messages must reference ticket numbers."
  • Index — pointers to specs/ directory, design decisions in docs/design/, domain-specific rule files.

Ideally AGENTS.md should be very short — 50 to 200 lines. It's a signpost, not a manual.

OpenAI specifically addressed this lesson in their harness engineering blog. They initially stuffed all conventions into a single massive AGENTS.md — test standards, code style, architecture principles, debug checklists, common pitfalls — all in one file. The result: AGENTS.md grew to 3,000 lines.

Problems appeared immediately. Every time an Agent entered context, those 3,000 lines loaded first, consuming roughly 15k tokens. The actual task, relevant code, and historical context had very little space left. Agents frequently missed key constraints or over-optimized for irrelevant rules buried in the 3,000-line file.

They identified three failure modes:

"Context is a scarce resource. A giant instruction file crowds out the task, the code, and the relevant docs — so the agent either misses key constraints or starts optimizing for the wrong ones."

"Too much guidance becomes non-guidance. When everything is 'important,' nothing is."

"It rots instantly. A monolithic manual turns into a graveyard of stale rules."

Their solution: progressive disclosure — the top-level AGENTS.md holds only navigation and core constraints. Specific conventions are distributed into small files in submodules. When an Agent enters apps/api/, it reads API design standards from apps/api/AGENTS.md. When it enters packages/ui/, it reads component naming conventions from packages/ui/AGENTS.md. Information loaded at each step is relevant to the current task.

A general Harness artifact design principle emerges: using directory structure for implicit context loading beats using large files for explicit full loads.

Directory structure itself is a form of routing. Wherever an Agent enters, it automatically loads that level's rules. It doesn't need to figure out "which rules in AGENTS.md should I ignore right now" — it only sees rules relevant to its current directory level. A rule's scope is determined by file location, not file content.

This is a deeper Harness principle: move constraints from text into structure. The same rule — "backend code must use zod for validation" — can live in the root AGENTS.md (text constraint) or in apps/api/AGENTS.md (structural constraint). The former requires the Agent to judge each time "am I currently writing backend code?" The latter means the Agent only sees this rule when writing backend code. The latter is more reliable, because it removes "judging applicability" from the Agent's reasoning entirely.


What's Next

This is the first piece in the series. I've done one thing: explained what Harness Engineering is, why it emerged, and what the current mainstream practices look like.

A lot of important territory wasn't covered here. The series will address each in turn:

  • Context Engineering in Practice — context as a resource with budget, lifecycle, and structure. How do you allocate a 200k token context budget for an Agent task? What information deserves a permanent slot? What should be loaded only on demand?
  • MCP in Practice — tool calls evolving from ad hoc implementations into a protocol. What actually changed, and what didn't. From custom tool schemas to standard MCP servers — when is the switch worth making?
  • Skill as Rules — turning project rules from natural language descriptions into activatable declarations. In a multi-person Codex collaboration project at school, this approach dropped norm violation rates significantly below pure AGENTS.md baseline. Specific data and eval design will be in that piece.
  • Skill as Memory — turning project context into on-demand-loadable memory units. A skill isn't just rules — it can be domain knowledge, a debugging workflow, an API reference.
  • Memory Stratification — storage strategies, retrieval mechanisms, and invalidation rules for the Rules / Memory / Session three-layer architecture. Why a single vector database can't solve all memory problems.
  • Domain Harness — frontend, backend, trading, contract auditing, code review — each domain's constraints have their own shape. Frontend Harness needs to handle visual regression, backend needs API contract enforcement, trading needs financial safety boundaries. These are different concrete expressions of the same theory.
  • Planner / Executor Architecture — the actual decisions made in z0. When this layering is worth it, when it isn't, the mistakes I made, and the tradeoffs that survived.

Each piece will include concrete implementations, evals, and mistakes already made.


References

  • Mitchell Hashimoto, Engineer the Harness (2026.02)
  • OpenAI, Harness engineering: leveraging Codex in an agent-first world (2026.02)
  • Anthropic, Harness design for long-running application development (2026.03)
  • Martin Fowler / Birgitta Böckeler, Harness engineering for coding agent users (2026.04)
  • Geoffrey Huntley, Ralph Wiggum as a software engineer (2025.05)
  • GitHub, Spec-Driven Development with AI (github.com/github/spec-kit)
  • Amazon Kiro, IDE documentation (aws.amazon.com/kiro)