SubQ

The Eval Engine

Comparative evaluation of coding agents on identical tasks in isolated worktrees. Head-to-head analysis of SubQ Code, Claude Code, and Codex CLI—system prompts, tool usage, timing, and failure patterns.

Deep Dives

The Build Sequence

Seven phases, 14 modules, 15+ types, 3 agents—from foundation types to self-evolving prompt optimization. Phase 1 is built and tested. The remaining six phases lay a path from adapter integration through the Hermetic Loop.

1
Foundation
Complete
2
Adapters
Planned
3
Enricher & Normalizer
Planned
4
Orchestrator & CLI
Planned
5
Comparison & Judge
Planned
6
Reporting
Planned
7
Self-Evolution
Planned

Phase 1 · Foundation

Types, Tasks, Worktrees, Cost

Core type system with TypeBox validation. Task loader for YAML definitions. Git worktree allocator with sequential creation. Cost tracking per-agent.

types.ts task.ts worktree.ts cost.ts

Phase 2–3 · Adapters & Enrichment

Process Spawning & JSONL Parsing

One adapter per agent manages process lifecycle. Enrichers parse RawAgentOutput into canonical EnrichedEvent[] via Bun.JSONL.parseChunk().

pi-agent.ts claude-code.ts enricher.ts

Phase 4–6 · Orchestration & Analysis

CLI, Comparison, Reporting

Commander.js wiring into TOP_LEVEL_COMMANDS. Milestone alignment, quality rubric, LLM judge. EvalReport with --robot JSON output.

orchestrator.ts compare.ts reporter.ts

Phase 7a–c · Self-Evolution

The Hermetic Loop

EvoSkill three-agent loop + GEPA prompt mutation. Frontier-based selection on git branches. Tiered optimization: skills → tools → prompts.

proposer.ts builder.ts frontier.ts

Sample Run

subq eval run — fix-auth-race.yaml
$ subq eval run fix-auth-race.yaml --agents pi-agent,claude-code --runs 3
worktree Creating eval/run-001/pi-agent... allocating
worktree Creating eval/run-001/claude-code... allocating
pi-agent Spawning: subq code --json "Fix the auth race condition..." running
claude Spawning: claude -p "Fix the auth race..." --bare running
-----
pi-agent ◆ first_file_read at 4.2s (src/auth/middleware.ts)
claude ◆ first_file_read at 2.1s (src/auth/middleware.ts)
pi-agent ◆ first_test_run at 38.7s (bun run test -- --filter auth)
claude ◆ first_file_edit at 12.4s (src/auth/middleware.ts)
compare Milestone alignment complete. pi-agent: 7/9, claude-code: 8/9
report EvalReport written to eval-runs/run-001/report.json complete

CLI Commands

subq eval — command reference
eval run Execute agents on a task YAML. --agents, --runs, --timeout flags.
eval list List available task definitions from a directory.
eval compare Generate comparison report from completed eval runs.
eval report Render EvalReport as terminal table or --robot JSON.
eval clean Prune stale worktrees and archived eval runs.
eval knob View or mutate KnobConfig for an agent. --set, --diff flags.

Design Principles

Three-Knob Model

System prompt, tools, and middleware—the only levers that matter. Each tunable independently, each measurable.

LangChain: 13.7pt gain from harness alone

Worktree Isolation

Every agent gets a dedicated git worktree. Sequential allocation avoids .git/config.lock race. Clean state per run.

Consensus primitive across 4 projects

Trace Auditing

Every eval run captures full JSONL transcripts. Anti-gaming heuristics detect benchmark cheating. Scores without traces are worthless.

Berkeley/Penn: top 3 submissions cheated

Env Allowlist

Child processes receive only approved environment variables. No API key leakage across agent boundaries. TypeBox validates all input.

Security-first subprocess isolation