The Eval Engine — SubQ Code Eval Framework

Deep Dives

Data pipeline from task YAML to comparison report. Orchestrator, adapters, enrichers, normalizer—plus the full type lattice.

The Three Agents

Head-to-head capability matrix. 10 features, 3 agents, verdict badges. Why each adapter exists.

The Hermetic Loop

Solve et coagula. Five-stage self-evolution cycle: execute, analyze, build, evaluate, select. The hero page.

Prior Art & Lineage

EvoSkill, Hermes, GEPA, LangChain harness engineering, and 7 more projects that shaped this framework.

The Full Plan

The complete 1,182-line specification: architecture, types, CLI commands, security model, and all 12 reviewer findings.

The Build Sequence

Seven phases, 14 modules, 15+ types, 3 agents—from foundation types to self-evolving prompt optimization. Phase 1 is built and tested. The remaining six phases lay a path from adapter integration through the Hermetic Loop.

Foundation

Complete

Adapters

Planned

Enricher & Normalizer

Planned

Orchestrator & CLI

Planned

Comparison & Judge

Planned

Reporting

Planned

Self-Evolution

Planned

Phase 1 · Foundation

Types, Tasks, Worktrees, Cost

Core type system with TypeBox validation. Task loader for YAML definitions. Git worktree allocator with sequential creation. Cost tracking per-agent.

types.ts task.ts worktree.ts cost.ts

Phase 2–3 · Adapters & Enrichment

Process Spawning & JSONL Parsing

One adapter per agent manages process lifecycle. Enrichers parse RawAgentOutput into canonical EnrichedEvent[] via Bun.JSONL.parseChunk().

pi-agent.ts claude-code.ts enricher.ts

Phase 4–6 · Orchestration & Analysis

CLI, Comparison, Reporting

Commander.js wiring into TOP_LEVEL_COMMANDS. Milestone alignment, quality rubric, LLM judge. EvalReport with --robot JSON output.

orchestrator.ts compare.ts reporter.ts

Phase 7a–c · Self-Evolution

The Hermetic Loop

EvoSkill three-agent loop + GEPA prompt mutation. Frontier-based selection on git branches. Tiered optimization: skills → tools → prompts.

proposer.ts builder.ts frontier.ts

Sample Run

subq eval run — fix-auth-race.yaml

$ subq eval run fix-auth-race.yaml --agents pi-agent,claude-code --runs 3

worktree Creating eval/run-001/pi-agent... allocating

worktree Creating eval/run-001/claude-code... allocating

pi-agent Spawning: subq code --json "Fix the auth race condition..." running

claude Spawning: claude -p "Fix the auth race..." --bare running

-----

pi-agent ◆ first_file_read at 4.2s (src/auth/middleware.ts)

claude ◆ first_file_read at 2.1s (src/auth/middleware.ts)

pi-agent ◆ first_test_run at 38.7s (bun run test -- --filter auth)

claude ◆ first_file_edit at 12.4s (src/auth/middleware.ts)

compare Milestone alignment complete. pi-agent: 7/9, claude-code: 8/9

report EvalReport written to eval-runs/run-001/report.json complete

CLI Commands

subq eval — command reference

eval run Execute agents on a task YAML. --agents, --runs, --timeout flags.

eval list List available task definitions from a directory.

eval compare Generate comparison report from completed eval runs.

eval report Render EvalReport as terminal table or --robot JSON.

eval clean Prune stale worktrees and archived eval runs.

eval knob View or mutate KnobConfig for an agent. --set, --diff flags.

Design Principles

Three-Knob Model

System prompt, tools, and middleware—the only levers that matter. Each tunable independently, each measurable.

LangChain: 13.7pt gain from harness alone

Worktree Isolation

Every agent gets a dedicated git worktree. Sequential allocation avoids .git/config.lock race. Clean state per run.

Consensus primitive across 4 projects

Trace Auditing

Every eval run captures full JSONL transcripts. Anti-gaming heuristics detect benchmark cheating. Scores without traces are worthless.

Berkeley/Penn: top 3 submissions cheated

Env Allowlist

Child processes receive only approved environment variables. No API key leakage across agent boundaries. TypeBox validates all input.

Security-first subprocess isolation