Evolution Tree Search for Agent Harness Optimization
An LLM agent's performance depends not only on the model but on the harness—the code that constructs prompts, defines tools, and orchestrates execution. Meta-Harness showed that an LLM proposer with full trace access can optimize harnesses automatically. But on TerminalBench-2, their optimization loop failed 7 out of 8 times, finding only one useful change after ~$2,400 in compute. The trace-access insight is right. The search strategy is wrong.
EvoHarness keeps Meta-Harness's trace access and better-harness's surface decomposition, but replaces linear hill climbing with evolution tree search. In one iteration, we turned 9 previously-failing tasks into passes—more than Meta-Harness achieved across all 8 of its iterations.
| Rank | Agent | Model | Org | Accuracy |
|---|---|---|---|---|
| 1 | Pilot | Claude Opus 4.6 | QuantFlow | 82.9% ± 1.4 |
| 2 | ForgeCode | GPT-5.4 | OpenAI | 81.8% ± 2.0 |
| 3 | ForgeCode | Claude Opus 4.6 | Anthropic | 81.8% ± 1.7 |
| 4 | TongAgents | Gemini 3.1 Pro | 80.2% ± 2.6 | |
| 5 | SageAgent | GPT-5.3-Codex | OpenAI | 78.4% ± 2.2 |
| 6 | ForgeCode | Gemini 3.1 Pro | 78.4% ± 1.8 | |
| 7 | EvoHarness (ours, 1 iter) | Claude Opus 4.6 | — | 77.3% |
| 7 | Droid | GPT-5.3-Codex | OpenAI | 77.3% ± 2.2 |
| 8 | Capy | Claude Opus 4.6 | Anthropic | 75.3% ± 2.4 |
After just 1 iteration (~$400), EvoHarness ties rank 7 on the official leaderboard. Meta-Harness achieved 76.4% after 8 iterations (~$2,400).
Meta-Harness on TerminalBench-2 (8 iterations): Iter 1: bundled prompt + structural fixes → REGRESSION Iter 2: different bundle → REGRESSION Iter 3: (diagnosed confound) Iter 4: isolated completion logic fix → REGRESSION Iter 5: another completion logic fix → REGRESSION Iter 6: yet another completion logic fix → REGRESSION Iter 7: additive env bootstrapping → ACCEPTED (+1.7%) Iter 8: compose with prior → NEUTRAL 8 iterations. 1 useful change. ~$2,400 spent.
Three structural problems:
completion_logic. All regressed.
The proposer had no record of prior failures.
baseline (69%)
├── branch_001: system_prompt change ── 9 flipped ── ACCEPT
├── branch_002: env_bootstrap change ── 1 unique flip ── ACCEPT
└── merge(001 + 002) ── full eval 89 tasks ── +7 net ── Iter 2 Base (77%)
├── branch_003: (next iteration)
└── branch_004: ...
PROPOSE Claude Code agent reads failing traces + notebook
Proposes ONE change to ONE surface
Avoids HIGH-fragility surfaces and dead ends
|
PRESCREEN Run on 3 failing tasks ($9)
0/3 flip → PRUNE (done, $9 wasted)
1+ flip → proceed
|
EVAL Run on all failing tasks + 5 passing canaries ($75)
Count flips and regressions
|
GATE pass_count improved AND regressions ≤ 1 → ACCEPT
otherwise → REJECT
|
NOTEBOOK ACCEPT: findings.md updated, fragility stays low
REJECT: dead_ends.md updated, fragility increases
Each proposer is a Claude Code agent with full filesystem access to a
structured workspace. It can grep across traces, drill into
specific failures, and check the notebook before proposing.
proposer_workspace/
TASK.md ← Instructions
surfaces/
manifest.json ← Risk ratings, fragility scores per surface
system_prompt.txt ← Current prompt
env_bootstrap.py ← Current bootstrap code
traces/ ← Full traces for every FAILING task
build-pmars/
trial_0.json ← Every command, output, error
cancel-async-tasks/
differential.json ← Passing vs failing comparison
notebook/
findings.md ← What works
dead_ends.md ← What doesn't (DO NOT repeat)
surface_risk.md ← Fragility ratings
Every iteration produces a self-contained harness. No file swapping. Any iteration can be independently re-evaluated:
runs/experiment/
iter_001_branch_001/
harness/ ← Complete, runnable
agent.py
prompt-templates/terminus-kira.txt ← Modified
pyproject.toml
proposal.json ← What was proposed and why
iter_001_branch_002/
harness/ ← Different modification
iter_002_base/
harness/ ← Merged harness
# Re-evaluate any iteration:
cd runs/experiment/iter_001_branch_001/harness
harbor run -d terminal-bench@2.0 -m anthropic/claude-opus-4-6 -k 5
Starting from the same baseline as Meta-Harness: Terminus-KIRA + environment bootstrapping, Claude Opus 4.6, 89 tasks.
system_prompt 9 TASKS FLIPPEDProposer diagnosis
Claude Code agent ($1.50, ~5 min) grepped across 23 failing traces.
Found: agents call task_complete without running the test command
stated in the task. Evidence: build-pmars had an explicit test
command—agent never ran it.
Change
Result
9/23 flipped, including 4 hard tasks. scientific-computing: 38% → 75%.
env_bootstrap 4 TASKS FLIPPEDProposer diagnosis
Second Claude Code agent ran in parallel. Found: agents waste early turns discovering installed packages.
Change
Expanded _gather_env_snapshot(): GPU, disk space,
20 Python packages, extra tools (ffmpeg, make, mips-linux-gnu-gcc).
Result
4/23 flipped. dna-assembly only flipped with the bootstrap
change—confirming the two changes are orthogonal.
| Flipped (fail→pass) | Regressed (pass→fail) | Net | |
|---|---|---|---|
| Count | 8 | 1 (mailman) | +7 |
7 apparent regressions were resource-sensitive (recovered on matched resources).
Only mailman is a genuine regression.
| Meta-Harness | EvoHarness | |
|---|---|---|
| Baseline eval | $267 | $267 |
| Search iterations | 8 × $267 = $2,136 | 2 branches × $69 = $138 |
| Full validation | (same benchmark) | $267 |
| Total | ~$2,443 | ~$675 |
| Useful changes | 1 | 2 |
| Net improvement | +1.7% | +8.0% |
This experiment was run during a one-day hackathon. Several limitations apply:
tool_definitions and other surfaces.Iter 2 Base (77.3%, ~20 failing) ├── iter_002_branch_001: proposer targets remaining failures ├── iter_002_branch_002: different surface ├── iter_002_branch_003: different approach └── merge → Iter 3 Base → ...
# 1. Baseline harbor run -d terminal-bench@2.0 -m anthropic/claude-opus-4-6 \ --agent-import-path agent:AgentHarness -e modal -k 1 -n 89 \ -o jobs/baseline -y --env-file .env # 2. Import traces python -m meta import jobs/baseline/baseline --output runs/experiment # 3. Evolve (propose + apply + eval) python -m meta.run_iteration --experiment-dir runs/experiment \ --iteration 1 --parent-variant baseline \ --proposer-model sonnet --eval-env modal # 4. Validate best harness (leaderboard conditions) cd runs/experiment/iter_001_branch_001/harness harbor run -d terminal-bench@2.0 -m anthropic/claude-opus-4-6 -k 5
Built at the Paradigm Automated Research Hackathon. Extends Meta-Harness (Lee et al.), better-harness (LangChain), and AutoHarness (DeepMind). Inner agent: Terminus-KIRA by KRAFTON AI. Evaluated on TerminalBench-2.