SubQ Code

The Eval Engine

Comparative evaluation of coding agents on identical tasks in isolated worktrees. Head-to-head analysis of SubQ Code, Claude Code, and Codex CLI—system prompts, tool usage, timing, and failure patterns.

Deep Dives

The Build Sequence

Seven phases, 14 modules, 15+ types, 3 agents—from foundation types to self-evolving prompt optimization. Phase 1 is built and tested. The remaining six phases lay a path from adapter integration through the Hermetic Loop.

1
Foundation
Complete
2
Adapters
Planned
3
Enricher & Normalizer
Planned
4
Orchestrator & CLI
Planned
5
Comparison & Judge
Planned
6
Reporting
Planned
7
Self-Evolution
Planned

Phase 1 · Foundation

Types, Tasks, Worktrees, Cost

Core type system with TypeBox validation. Task loader for YAML definitions. Git worktree allocator with sequential creation. Cost tracking per-agent.

types.ts task.ts worktree.ts cost.ts

Phase 2–3 · Adapters & Enrichment

Process Spawning & JSONL Parsing

One adapter per agent manages process lifecycle. Enrichers parse RawAgentOutput into canonical EnrichedEvent[] via Bun.JSONL.parseChunk().

pi-agent.ts claude-code.ts enricher.ts

Phase 4–6 · Orchestration & Analysis

CLI, Comparison, Reporting

Commander.js wiring into TOP_LEVEL_COMMANDS. Milestone alignment, quality rubric, LLM judge. EvalReport with --robot JSON output.

orchestrator.ts compare.ts reporter.ts

Phase 7a–c · Self-Evolution

The Hermetic Loop

EvoSkill three-agent loop + GEPA prompt mutation. Frontier-based selection on git branches. Tiered optimization: skills → tools → prompts.

proposer.ts builder.ts frontier.ts

Sample Run

subq eval run — fix-auth-race.yaml
$ subq eval run fix-auth-race.yaml --agents pi-agent,claude-code --runs 3
worktree Creating eval/run-001/pi-agent... allocating
worktree Creating eval/run-001/claude-code... allocating
pi-agent Spawning: subq code --json "Fix the auth race condition..." running
claude Spawning: claude -p "Fix the auth race..." --bare running
-----
pi-agent ◆ first_file_read at 4.2s (src/auth/middleware.ts)
claude ◆ first_file_read at 2.1s (src/auth/middleware.ts)
pi-agent ◆ first_test_run at 38.7s (bun run test -- --filter auth)
claude ◆ first_file_edit at 12.4s (src/auth/middleware.ts)
compare Milestone alignment complete. pi-agent: 7/9, claude-code: 8/9
report EvalReport written to eval-runs/run-001/report.json complete

CLI Commands

subq eval — command reference
eval run Execute agents on a task YAML. --agents, --runs, --timeout flags.
eval list List available task definitions from a directory.
eval compare Generate comparison report from completed eval runs.
eval report Render EvalReport as terminal table or --robot JSON.
eval clean Prune stale worktrees and archived eval runs.
eval knob View or mutate KnobConfig for an agent. --set, --diff flags.

Design Principles

Three-Knob Model

System prompt, tools, and middleware—the only levers that matter. Each tunable independently, each measurable.

Worktree Isolation

Every agent gets a dedicated git worktree. Sequential allocation avoids .git/config.lock race. Clean state per run.

Consensus primitive across 4 projects

Trace Auditing

Every eval run captures full JSONL transcripts. Anti-gaming heuristics detect benchmark cheating. Scores without traces are worthless.

Env Allowlist

Child processes receive only approved environment variables. No API key leakage across agent boundaries. TypeBox validates all input.

Security-first subprocess isolation