EvoHarness

Evolution Tree Search for Agent Harness Optimization

Paradigm Automated Research Hackathon — April 9, 2026

An LLM agent's performance depends not only on the model but on the harness—the code that constructs prompts, defines tools, and orchestrates execution. Meta-Harness showed that an LLM proposer with full trace access can optimize harnesses automatically. But on TerminalBench-2, their optimization loop failed 7 out of 8 times, finding only one useful change after ~$2,400 in compute. The trace-access insight is right. The search strategy is wrong.

EvoHarness keeps Meta-Harness's trace access and better-harness's surface decomposition, but replaces linear hill climbing with evolution tree search. In one iteration, we turned 9 previously-failing tasks into passes—more than Meta-Harness achieved across all 8 of its iterations.

TerminalBench-2 Leaderboard (after 1 iteration)

RankAgentModelOrgAccuracy
1PilotClaude Opus 4.6QuantFlow82.9% ± 1.4
2ForgeCodeGPT-5.4OpenAI81.8% ± 2.0
3ForgeCodeClaude Opus 4.6Anthropic81.8% ± 1.7
4TongAgentsGemini 3.1 ProGoogle80.2% ± 2.6
5SageAgentGPT-5.3-CodexOpenAI78.4% ± 2.2
6ForgeCodeGemini 3.1 ProGoogle78.4% ± 1.8
7EvoHarness (ours, 1 iter)Claude Opus 4.677.3%
7DroidGPT-5.3-CodexOpenAI77.3% ± 2.2
8CapyClaude Opus 4.6Anthropic75.3% ± 2.4

After just 1 iteration (~$400), EvoHarness ties rank 7 on the official leaderboard. Meta-Harness achieved 76.4% after 8 iterations (~$2,400).

The Problem with Linear Hill Climbing

Meta-Harness on TerminalBench-2 (8 iterations):

  Iter 1: bundled prompt + structural fixes    → REGRESSION
  Iter 2: different bundle                     → REGRESSION
  Iter 3: (diagnosed confound)
  Iter 4: isolated completion logic fix        → REGRESSION
  Iter 5: another completion logic fix         → REGRESSION
  Iter 6: yet another completion logic fix     → REGRESSION
  Iter 7: additive env bootstrapping           → ACCEPTED (+1.7%)
  Iter 8: compose with prior                   → NEUTRAL

  8 iterations. 1 useful change. ~$2,400 spent.

Three structural problems:

1. No memory of dead ends. Iterations 4, 5, and 6 all tried modifying completion_logic. All regressed. The proposer had no record of prior failures.
Our fix: Surface fragility tracking + research notebook. Each surface gets an auto-updated fragility score. The notebook records why approaches failed, preventing the same class of mistake on any surface.
2. No pre-screening. Every proposal costs $267 (full 89-task eval), even bad ones.
Our fix: 3-task pre-screening ($9). If none flip, prune immediately. 30x cheaper to reject a bad idea.
3. Linear chain blocks exploration. One proposal at a time. No parallel exploration, no composing orthogonal improvements.
Our fix: Tree search with parallel proposals and branch merging. Thompson sampling selects which branches to extend. Accepted branches are merged to test if improvements compose.

How It Works

The evolution tree

baseline (69%)
├── branch_001: system_prompt change ── 9 flipped ── ACCEPT
├── branch_002: env_bootstrap change ── 1 unique flip ── ACCEPT
└── merge(001 + 002) ── full eval 89 tasks ── +7 net ── Iter 2 Base (77%)
    ├── branch_003: (next iteration)
    └── branch_004: ...

Each iteration

  PROPOSE    Claude Code agent reads failing traces + notebook
             Proposes ONE change to ONE surface
             Avoids HIGH-fragility surfaces and dead ends
                 |
  PRESCREEN  Run on 3 failing tasks ($9)
             0/3 flip → PRUNE (done, $9 wasted)
             1+ flip → proceed
                 |
  EVAL       Run on all failing tasks + 5 passing canaries ($75)
             Count flips and regressions
                 |
  GATE       pass_count improved AND regressions ≤ 1 → ACCEPT
             otherwise → REJECT
                 |
  NOTEBOOK   ACCEPT: findings.md updated, fragility stays low
             REJECT: dead_ends.md updated, fragility increases

The proposer workspace

Each proposer is a Claude Code agent with full filesystem access to a structured workspace. It can grep across traces, drill into specific failures, and check the notebook before proposing.

proposer_workspace/
  TASK.md               ← Instructions
  surfaces/
    manifest.json       ← Risk ratings, fragility scores per surface
    system_prompt.txt   ← Current prompt
    env_bootstrap.py    ← Current bootstrap code
  traces/               ← Full traces for every FAILING task
    build-pmars/
      trial_0.json      ← Every command, output, error
    cancel-async-tasks/
      differential.json ← Passing vs failing comparison
  notebook/
    findings.md         ← What works
    dead_ends.md        ← What doesn't (DO NOT repeat)
    surface_risk.md     ← Fragility ratings

Reproducible harness directories

Every iteration produces a self-contained harness. No file swapping. Any iteration can be independently re-evaluated:

runs/experiment/
  iter_001_branch_001/
    harness/                 ← Complete, runnable
      agent.py
      prompt-templates/terminus-kira.txt  ← Modified
      pyproject.toml
    proposal.json            ← What was proposed and why
  iter_001_branch_002/
    harness/                 ← Different modification
  iter_002_base/
    harness/                 ← Merged harness

# Re-evaluate any iteration:
cd runs/experiment/iter_001_branch_001/harness
harbor run -d terminal-bench@2.0 -m anthropic/claude-opus-4-6 -k 5

Experiment: TerminalBench-2

Starting from the same baseline as Meta-Harness: Terminus-KIRA + environment bootstrapping, Claude Opus 4.6, 89 tasks.

Baseline

61/88
Baseline (1 trial, Modal)
76.4%
Published Meta-Harness (5 trials)
27
Failing tasks with full traces

Iteration 1, Branch 1: Verification Protocol

system_prompt 9 TASKS FLIPPED

Claude Code agent ($1.50, ~5 min) grepped across 23 failing traces. Found: agents call task_complete without running the test command stated in the task. Evidence: build-pmars had an explicit test command—agent never ran it.

prompt-templates/terminus-kira.txt
- Before calling task_complete, verify minimal state changes...
+ Before calling task_complete, you MUST:
+ 1. Run the task's test command (if provided)
+ 2. Verify required outputs exist
+ 3. Do NOT delete solution files

9/23 flipped, including 4 hard tasks. scientific-computing: 38% → 75%.

Iteration 1, Branch 2: Expanded Bootstrap

env_bootstrap 4 TASKS FLIPPED

Second Claude Code agent ran in parallel. Found: agents waste early turns discovering installed packages.

Expanded _gather_env_snapshot(): GPU, disk space, 20 Python packages, extra tools (ffmpeg, make, mips-linux-gnu-gcc).

4/23 flipped. dna-assembly only flipped with the bootstrap change—confirming the two changes are orthogonal.

Iteration 2 Base: Merged Harness

Full validation on all 89 tasks, default resources COMPLETE

Flipped (fail→pass)Regressed (pass→fail)Net
Count81 (mailman)+7

7 apparent regressions were resource-sensitive (recovered on matched resources). Only mailman is a genuine regression.

Results

69.3%
Baseline (61/88)
77.3%
After Iteration 1 (68/88, +7 net)

Cost comparison

Meta-HarnessEvoHarness
Baseline eval$267$267
Search iterations8 × $267 = $2,1362 branches × $69 = $138
Full validation(same benchmark)$267
Total~$2,443~$675
Useful changes12
Net improvement+1.7%+8.0%

Limitations

This experiment was run during a one-day hackathon. Several limitations apply:

Future work

Iter 2 Base (77.3%, ~20 failing)
├── iter_002_branch_001: proposer targets remaining failures
├── iter_002_branch_002: different surface
├── iter_002_branch_003: different approach
└── merge → Iter 3 Base → ...

Reproducing

# 1. Baseline
harbor run -d terminal-bench@2.0 -m anthropic/claude-opus-4-6 \
  --agent-import-path agent:AgentHarness -e modal -k 1 -n 89 \
  -o jobs/baseline -y --env-file .env

# 2. Import traces
python -m meta import jobs/baseline/baseline --output runs/experiment

# 3. Evolve (propose + apply + eval)
python -m meta.run_iteration --experiment-dir runs/experiment \
  --iteration 1 --parent-variant baseline \
  --proposer-model sonnet --eval-env modal

# 4. Validate best harness (leaderboard conditions)
cd runs/experiment/iter_001_branch_001/harness
harbor run -d terminal-bench@2.0 -m anthropic/claude-opus-4-6 -k 5

Built at the Paradigm Automated Research Hackathon. Extends Meta-Harness (Lee et al.), better-harness (LangChain), and AutoHarness (DeepMind). Inner agent: Terminus-KIRA by KRAFTON AI. Evaluated on TerminalBench-2.