The Full Plan — SubQ Code Eval Framework

Context

SubQ Code is an AI coding agent orchestration and measurement platform that already parses sessions from 12 different agents and computes leverage metrics. However, there is no way to run agents head-to-head on the same task and deeply compare their full transcripts—system prompts, tool usage, token efficiency, timing, and failure patterns.

The goal: build a system that orchestrates headless sessions of SubQ Code, Claude Code, and Codex CLI on identical tasks in isolated git worktrees, captures enriched JSONL transcripts, and generates side-by-side analysis reports. The ultimate purpose—identify what it would take to make SubQ Code a more competent coding agent, primarily by analyzing where its behavior diverges from stronger agents and modifying its system prompts at various pipeline stages.

Design principle: LangChain’s harness engineering work proved that system prompt + tools + middleware optimization alone moved an agent 13.7 points on Terminal Bench 2.0 (52.8% → 66.5%) without changing the model. This “three-knob model” is the core design axis—each knob tunable and measurable independently.

Architecture

Six-stage data pipeline from task definition to comparison report.

subq eval run <task> --agents pi-agent,claude-code --runs 3

1. orchestrator Parse task YAML, allocate worktrees sequentially, launch agents in parallel

2. adapters One per agent. Manage process lifecycle. Return RawAgentOutput (stdout + exit code + timestamps)

3. enrichers Parse RawAgentOutput into EnrichedEvent[] via Bun.JSONL.parseChunk(). Reuse parsers/ internals

4. normalizer EnrichedEvent[] → EnrichedSession with canonical tool names and detected milestones

5. comparison Align sessions by milestone. Compute efficiency metrics. Diff system prompts. Run LLM judge

6. reporter Render TUI dashboard, JSON (--robot), or HTML report

File Structure

14 files under apps/cli/src/eval/, reduced from 27 in the original plan.

apps/cli/src/eval/

core types.ts — all types + TypeBox schemas

core task.ts — YAML task loading (custom only for MVP)

core worktree.ts — git worktree allocation/cleanup

core orchestrator.ts — parallel agent execution with timeout

-----

adapters types.ts — AgentAdapter, AgentRunHandle, ProcessSpawner

adapters index.ts — ALL_ADAPTERS barrel + registry

adapters pi-agent.ts — SubQ: spawn --json, pipe stdout

adapters claude-code.ts — Claude: spawn -p stream-json, pipe stdout

-----

pipeline enricher.ts — unified enrichment (reuses parsers/ internals)

pipeline normalizer.ts — events → EnrichedSession + canonical tool names

pipeline milestones.ts — milestone detection (tool-name-agnostic)

analysis comparison.ts — alignment + scoring + prompt diff

output report.ts — EvalReport + formatRobot()

output cost.ts — cost normalization across providers

Core Types

15+ interfaces form the type lattice. Every type uses epoch-ms timestamps for safe JSON round-tripping. TypeBox schemas validate all YAML-loaded types at parse boundaries.

EvalAgentId

Identity

Extract<AgentType, "claude-code" | "codex" | "pi-agent">

derived type 5 reviewers flagged

RawAgentOutput

Capture

stdout bytes, stderr bytes, exit code, timestamps, optional disk JSONL path

adapter output no parsing

EnrichedEvent

Canonical

Agent-agnostic event: message, tool_call, tool_result, thinking, system_prompt, error, cost, milestone

8 kinds agent-agnostic

EnrichedSession

Complete

Messages, milestones, system prompt injections, token usage, wall clock time, exit code, task resolved

extends Session full picture

Milestone

Progress

9 kinds: first_file_read, first_file_edit, first_test_run, first_bash_command, first_search, first_error_recovery, task_completion, verification_pass/fail

epoch-ms 9 kinds

QualityRubric

Scoring

5 dimensions: correctness, completeness, code quality, minimal diff, verification. Each 0.0–1.0

multi-dimensional heuristic + LLM

TokenUsage

Input, output, cache read/write tokens, total, cost USD. Per-message and per-session aggregation.

cost attribution · cache-aware

SystemPromptInjection

5 sources: initial, reminder, compaction, tool_result, other. Captures content, label, turn index.

prompt archaeology · per-turn

EvalTask (TypeBox)

YAML task definition: id, name, repoPath, baseCommit, prompt, verifyCommand, timeout, complexity, tags.

validated at parse · js-yaml v4

KnobConfig (TypeBox)

Per-agent knob overrides: system prompt file/append, tool allow/deny lists, budget cap, model override, custom env.

three-knob model · per-agent

Key Design Decisions

Decision

Rationale

Source

Bun.spawn() over Node execFile

Need streaming stdout; existing spawn.ts buffers everything

Framework docs

EvalAgentId = Extract<AgentType, ...>

Prevents type drift from parallel string unions

5 reviewers

Git worktrees over Docker

Consensus primitive; faster, lighter, native macOS

Research

Serialized worktree creation

.git/config.lock race condition (Claude Code #47266)

Best practices

Milestone alignment (not message index)

Agents take different paths; milestones capture functional progress

Research

--runs 3 default

Non-determinism is the defining property of agent evaluation

Eval methodology

Env allowlist for child processes

Prevents API key leakage to spawned agents

Security

Epoch-ms timestamps

Date objects break on JSON round-trip; number is serialization-safe

Code architect

Frontier-based selection (top-N)

Maintains diversity, prevents premature convergence

EvoSkill

Human-gated deployment

Evolution discovers candidates; human decides what ships

Hermes guardrails

Implementation Phases

Phase 1: Foundation

Types + Tasks + Worktrees

All enriched type definitions + TypeBox schemas. YAML task loading with js-yaml v4. Worktree allocation/cleanup with serialized creation and crash handlers. Cost normalization.

types.ts task.ts worktree.ts cost.ts

Phase 2: Adapters

SubQ + Claude Code

AgentAdapter and ProcessSpawner interfaces. SubQ adapter with --json stdout pipe. Claude adapter with --bare --stream-json --bypassPermissions. Env allowlist. ANTHROPIC_API_KEY stripped.

pi-agent.ts claude-code.ts ProcessSpawner

Phase 3: Enricher + Normalizer + Milestones

Stream Processing

Unified enrichment via Bun.JSONL.parseChunk(). Canonical tool name map (42 aliases → 8 canonical names). Milestone detection from enriched event stream with fixed TEST_COMMAND_PATTERNS.

enricher.ts normalizer.ts milestones.ts

Phase 4: Orchestrator + CLI

Command Surface

Parallel Bun.spawn() execution with AbortSignal timeout. Ops layer following ops/leverage.ts pattern. Commander.js commands: run, compare, list, prompt-diff, tasks, clean. --robot on all subcommands.

orchestrator.ts ops/eval.ts commands/eval.ts

Phase 5: Comparison + Prompt Analysis + LLM Judge

Analysis Engine

Milestone alignment on elapsedMs. 5-dimension quality rubric. System prompt segmentation into 10 semantic categories. Three diff formats: matrix, unified, imperative extraction. Dual LLM judge with position-swap protocol (Claude + Kimi K2.5).

comparison.ts PoLL pattern anti-gaming

Phase 6: Reporting

JSON --robot First

EvalReport with schemaVersion. formatRobot() as JSON.stringify(report, null, "\t"). Streaming progress events for long eval runs.

report.ts --robot schemaVersion

Phase 7a: Self-Evolution Loop

EvoSkill Integration

The capstone: transform subq eval from measurement tool to evolution engine. Proposer analyzes failure traces, Skill Builder implements proposals, Frontier maintains top-N configurations as git branches. Feedback memory prevents circular proposals.

proposer.ts skill-builder.ts frontier.ts guardrails.ts loop.ts

Phase 7b: GEPA-Style Prompt Evolution

Reflective Analysis

Imperative extraction + mutation of individual prompt rules. Section reordering for cache efficiency. Prompt A/B testing with statistical comparison across tasks.

imperative extraction cache impact prompt-ab

Phase 7c: Extensions

Codex + TUI + SWE-bench

Codex CLI adapter via proc.exited. Ink TUI dashboard. HTML report template. SWE-bench task source. wterm observation mode for live browser-based observation.

codex.ts TUI SWE-bench wterm

CLI Commands

subq eval — all commands support --robot

run subq eval run <task> [--agents pi-agent,claude-code] [--runs 3] [--knobs-file <path>] [--timeout <s>]

compare subq eval compare <run-id> [--robot] [--html <path>]

list subq eval list [--days <n>] [--robot]

prompt-diff subq eval prompt-diff <run-id> [--robot]

tasks subq eval tasks [--source custom] [--robot]

clean subq eval clean [--dry-run]

-----

evolve subq eval evolve <task-dir> [--iterations 10] [--frontier-size 3] [--knob-types prompt,skill,tool]

frontier subq eval frontier [--robot] | frontier deploy <id> | frontier diff <a> <b>

prompt-ab subq eval prompt-ab <task> --variant-a baseline.txt --variant-b evolved.txt --runs 5

Security Model

Risk

Mitigation

Phase

API key leakage

Env allowlist—only PATH, HOME, TERM + agent-specific keys

Phase 2

YAML deserialization

js-yaml v4 load() safe by default + TypeBox validation

Phase 1

Agent escapes worktree

--bare on Claude Code skips hooks/CLAUDE.md; HOME set to temp dir

Phase 2

Benchmark gaming

Anti-gaming trace audit heuristics + private holdout tasks (.gitignored)

Phase 5

Session data exposure

Eval run dirs created with 0o700; --redact flag in Phase 7

Phase 6

Budget overrun

--max-budget-usd always passed to Claude adapter

Phase 2

System Prompt Analysis

The key deliverable for the stated goal—“make SubQ Code more competent.” Four operations on system prompts across agents.

1. Extract

Capture system prompts at every injection point for each agent: initial, reminder, compaction, tool_result, other.

per-turn capture · 5 sources

2. Segment

10 semantic categories: identity, environment, workflow, tool_instructions, behavioral_rules, safety, error_recovery, output_format, injected_context, meta.

semantic sections · 10 categories

3. Diff

Three formats: semantic section matrix (presence/absence/token count), unified diff (line-level), imperative extraction (all “Do X” / “Never Y” rules compared).

matrix · unified · imperatives

4. Correlate

Map prompt sections to success/failure patterns via LLM trace attribution. Produces actionable prompt patches.

trace attribution · prompt patches

Imperative extraction algorithm: For each sentence, detect imperative mood (starts with verb, contains must/should/never/always). Classify as REQUIRE | PROHIBIT | PREFER | RECOVER. Compare across agents: rules only in Claude Code, only in SubQ, or in both with different phrasing. This directly produces prompt patches.

Research Findings

12 review agents and 3 external research integrations produced these critical corrections and insights.

Critical Corrections

Blocking Issues (3)

EvalAgentType must use Extract<AgentType>—5 reviewers flagged. "eval" must be added to TOP_LEVEL_COMMANDS—3 flagged. Child process env must use allowlist, not process.env inheritance.

BLOCKING 5+3+3 reviewers

Evaluation Methodology

Statistical Rigor

Non-determinism is the defining property. Minimum 3 runs per comparison. Multi-dimensional rubrics, not single metrics. Position-swap protocol for pairwise judges. Self-enhancement bias: use PoLL pattern (Claude + non-Claude judge).

3+ runs PoLL panel position swap

ProcessSpawner Injection

Bun.spawn() is not available in Vitest. Injectable ProcessSpawner interface enables testing with synthetic JSONL fixtures.

testability · HIGH priority

TEST_PATTERNS Fix

Original regex /test|pytest/ matches cat test.txt. Fixed: match against command field, require test as program name or subcommand.

false positive elimination

Cache Impact Metrics

Prompt modifications invalidate Anthropic’s prefix cache. Static sections must precede dynamic. Track baselineCacheableTokens vs patchedCacheableTokens.

cost optimization · section ordering

Crash Recovery

SIGKILL prevents cleanup handlers. Worktrees accumulate. Fix: process.on(“exit”) + SIGTERM handlers + subq eval clean scans for stale worktrees.

orphan detection · stale cleanup

Dependencies

Runtime

Required

js-yaml (v4, safe by default). Claude Code binary. SubQ Code (in-repo). Bun runtime.

js-yaml claude subq

Phase 7

Deferred

Codex CLI (v0.124.0). wterm (observation mode). SWE-bench task source.

codex wterm SWE-bench

Reference Only

Not Runtime

EvoSkill (Apache 2.0)—pattern reimplemented in TypeScript. Hermes Self-Evolution (MIT)—tiered optimization patterns.

EvoSkill Hermes GEPA