EvoHarness

Evolution Tree Search for Agent Harness Optimization

Paradigm Automated Research Hackathon — April 9, 2026

Problem How It Works Experiment Results Reproduce Code

An LLM agent's performance depends not only on the model but on the harness—the code that constructs prompts, defines tools, and orchestrates execution. Meta-Harness showed that an LLM proposer with full trace access can optimize harnesses automatically. But on TerminalBench-2, their optimization loop failed 7 out of 8 times, finding only one useful change after ~$2,400 in compute. The trace-access insight is right. The search strategy is wrong.

EvoHarness keeps Meta-Harness's trace access and better-harness's surface decomposition, but replaces linear hill climbing with evolution tree search. In one iteration, we turned 9 previously-failing tasks into passes—more than Meta-Harness achieved across all 8 of its iterations.

TerminalBench-2 Leaderboard (after 1 iteration)

Rank	Agent	Model	Org	Accuracy
1	Pilot	Claude Opus 4.6	QuantFlow	82.9% ± 1.4
2	ForgeCode	GPT-5.4	OpenAI	81.8% ± 2.0
3	ForgeCode	Claude Opus 4.6	Anthropic	81.8% ± 1.7
4	TongAgents	Gemini 3.1 Pro	Google	80.2% ± 2.6
5	SageAgent	GPT-5.3-Codex	OpenAI	78.4% ± 2.2
6	ForgeCode	Gemini 3.1 Pro	Google	78.4% ± 1.8
7	EvoHarness (ours, 1 iter)	Claude Opus 4.6	—	77.3%
7	Droid	GPT-5.3-Codex	OpenAI	77.3% ± 2.2
8	Capy	Claude Opus 4.6	Anthropic	75.3% ± 2.4

After just 1 iteration (~$400), EvoHarness ties rank 7 on the official leaderboard. Meta-Harness achieved 76.4% after 8 iterations (~$2,400).

The Problem with Linear Hill Climbing

Meta-Harness on TerminalBench-2 (8 iterations):

  Iter 1: bundled prompt + structural fixes    → REGRESSION
  Iter 2: different bundle                     → REGRESSION
  Iter 3: (diagnosed confound)
  Iter 4: isolated completion logic fix        → REGRESSION
  Iter 5: another completion logic fix         → REGRESSION
  Iter 6: yet another completion logic fix     → REGRESSION
  Iter 7: additive env bootstrapping           → ACCEPTED (+1.7%)
  Iter 8: compose with prior                   → NEUTRAL

  8 iterations. 1 useful change. ~$2,400 spent.

Three structural problems:

1. No memory of dead ends. Iterations 4, 5, and 6 all tried modifying completion_logic. All regressed. The proposer had no record of prior failures.

Our fix: Surface fragility tracking + research notebook. Each surface gets an auto-updated fragility score. The notebook records why approaches failed, preventing the same class of mistake on any surface.

2. No pre-screening. Every proposal costs $267 (full 89-task eval), even bad ones.

Our fix: 3-task pre-screening ($9). If none flip, prune immediately. 30x cheaper to reject a bad idea.

3. Linear chain blocks exploration. One proposal at a time. No parallel exploration, no composing orthogonal improvements.

Our fix: Tree search with parallel proposals and branch merging. Thompson sampling selects which branches to extend. Accepted branches are merged to test if improvements compose.

How It Works

The evolution tree

baseline (69%)
├── branch_001: system_prompt change ── 9 flipped ── ACCEPT
├── branch_002: env_bootstrap change ── 1 unique flip ── ACCEPT
└── merge(001 + 002) ── full eval 89 tasks ── +7 net ── Iter 2 Base (77%)
    ├── branch_003: (next iteration)
    └── branch_004: ...

Each iteration

  PROPOSE    Claude Code agent reads failing traces + notebook
             Proposes ONE change to ONE surface
             Avoids HIGH-fragility surfaces and dead ends
                 |
  PRESCREEN  Run on 3 failing tasks ($9)
             0/3 flip → PRUNE (done, $9 wasted)
             1+ flip → proceed
                 |
  EVAL       Run on all failing tasks + 5 passing canaries ($75)
             Count flips and regressions
                 |
  GATE       pass_count improved AND regressions ≤ 1 → ACCEPT
             otherwise → REJECT
                 |
  NOTEBOOK   ACCEPT: findings.md updated, fragility stays low
             REJECT: dead_ends.md updated, fragility increases

The proposer workspace

Each proposer is a Claude Code agent with full filesystem access to a structured workspace. It can grep across traces, drill into specific failures, and check the notebook before proposing.

proposer_workspace/
  TASK.md               ← Instructions
  surfaces/
    manifest.json       ← Risk ratings, fragility scores per surface
    system_prompt.txt   ← Current prompt
    env_bootstrap.py    ← Current bootstrap code
  traces/               ← Full traces for every FAILING task
    build-pmars/
      trial_0.json      ← Every command, output, error
    cancel-async-tasks/
      differential.json ← Passing vs failing comparison
  notebook/
    findings.md         ← What works
    dead_ends.md        ← What doesn't (DO NOT repeat)
    surface_risk.md     ← Fragility ratings

Reproducible harness directories

Every iteration produces a self-contained harness. No file swapping. Any iteration can be independently re-evaluated:

runs/experiment/
  iter_001_branch_001/
    harness/                 ← Complete, runnable
      agent.py
      prompt-templates/terminus-kira.txt  ← Modified
      pyproject.toml
    proposal.json            ← What was proposed and why
  iter_001_branch_002/
    harness/                 ← Different modification
  iter_002_base/
    harness/                 ← Merged harness

# Re-evaluate any iteration:
cd runs/experiment/iter_001_branch_001/harness
harbor run -d terminal-bench@2.0 -m anthropic/claude-opus-4-6 -k 5

Experiment: TerminalBench-2

Starting from the same baseline as Meta-Harness: Terminus-KIRA + environment bootstrapping, Claude Opus 4.6, 89 tasks.

Baseline

61/88

Baseline (1 trial, Modal)

76.4%

Published Meta-Harness (5 trials)

Failing tasks with full traces

Iteration 1, Branch 1: Verification Protocol

`system_prompt` 9 TASKS FLIPPED

Proposer diagnosis

Claude Code agent ($1.50, ~5 min) grepped across 23 failing traces. Found: agents call task_complete without running the test command stated in the task. Evidence: build-pmars had an explicit test command—agent never ran it.

Change

prompt-templates/terminus-kira.txt

- Before calling task_complete, verify minimal state changes...

+ Before calling task_complete, you MUST:

+ 1. Run the task's test command (if provided)

+ 2. Verify required outputs exist

+ 3. Do NOT delete solution files

Result

9/23 flipped, including 4 hard tasks. scientific-computing: 38% → 75%.

Iteration 1, Branch 2: Expanded Bootstrap

`env_bootstrap` 4 TASKS FLIPPED

Proposer diagnosis

Second Claude Code agent ran in parallel. Found: agents waste early turns discovering installed packages.

Change

Expanded _gather_env_snapshot(): GPU, disk space, 20 Python packages, extra tools (ffmpeg, make, mips-linux-gnu-gcc).

Result

4/23 flipped. dna-assembly only flipped with the bootstrap change—confirming the two changes are orthogonal.

Iteration 2 Base: Merged Harness

Full validation on all 89 tasks, default resources COMPLETE

	Flipped (fail→pass)	Regressed (pass→fail)	Net
Count	8	1 (`mailman`)	+7

7 apparent regressions were resource-sensitive (recovered on matched resources). Only mailman is a genuine regression.

Results

69.3%

Baseline (61/88)

77.3%

After Iteration 1 (68/88, +7 net)

Cost comparison

	Meta-Harness	EvoHarness
Baseline eval	$267	$267
Search iterations	8 × $267 = $2,136	2 branches × $69 = $138
Full validation	(same benchmark)	$267
Total	~$2,443	~$675
Useful changes	1	2
Net improvement	+1.7%	+8.0%

Limitations

This experiment was run during a one-day hackathon. Several limitations apply:

1 iteration completed. The framework is designed for 5–10 iterations, with each iteration expected to yield 1–3 additional task flips as remaining failures become harder. We completed one full iteration cycle (2 branches + merge). More iterations would likely find additional improvements in tool_definitions and other surfaces.
1 trial per task. TerminalBench-2’s official scoring uses 5 trials. With 1 trial, individual task results have variance—a task that passes 3/5 times may randomly fail on our single trial. Our 69.3% baseline (1 trial) vs Meta-Harness’s 76.4% (5 trials) likely reflects this variance, not a weaker harness. A proper comparison requires running both on 5 trials.

Future work

Iter 2 Base (77.3%, ~20 failing)
├── iter_002_branch_001: proposer targets remaining failures
├── iter_002_branch_002: different surface
├── iter_002_branch_003: different approach
└── merge → Iter 3 Base → ...

Reproducing

# 1. Baseline
harbor run -d terminal-bench@2.0 -m anthropic/claude-opus-4-6 \
  --agent-import-path agent:AgentHarness -e modal -k 1 -n 89 \
  -o jobs/baseline -y --env-file .env

# 2. Import traces
python -m meta import jobs/baseline/baseline --output runs/experiment

# 3. Evolve (propose + apply + eval)
python -m meta.run_iteration --experiment-dir runs/experiment \
  --iteration 1 --parent-variant baseline \
  --proposer-model sonnet --eval-env modal

# 4. Validate best harness (leaderboard conditions)
cd runs/experiment/iter_001_branch_001/harness
harbor run -d terminal-bench@2.0 -m anthropic/claude-opus-4-6 -k 5

Built at the Paradigm Automated Research Hackathon. Extends Meta-Harness (Lee et al.), better-harness (LangChain), and AutoHarness (DeepMind). Inner agent: Terminus-KIRA by KRAFTON AI. Evaluated on TerminalBench-2.

EvoHarness

TerminalBench-2 Leaderboard (after 1 iteration)

The Problem with Linear Hill Climbing

How It Works

The evolution tree

Each iteration

The proposer workspace

Reproducible harness directories

Experiment: TerminalBench-2

Baseline

Iteration 1, Branch 1: Verification Protocol

system_prompt 9 TASKS FLIPPED

Iteration 1, Branch 2: Expanded Bootstrap

env_bootstrap 4 TASKS FLIPPED

Iteration 2 Base: Merged Harness

Full validation on all 89 tasks, default resources COMPLETE

Results

Cost comparison

Limitations

Future work

Reproducing

`system_prompt` 9 TASKS FLIPPED

`env_bootstrap` 4 TASKS FLIPPED