Reference · 1,182 Lines

The Full Plan

The complete specification behind the visualization suite. Seven phases, three agents, one self-evolution loop—from foundation types to autonomous prompt optimization.

Context

SubQ Code is an AI coding agent orchestration and measurement platform that already parses sessions from 12 different agents and computes leverage metrics. However, there is no way to run agents head-to-head on the same task and deeply compare their full transcripts—system prompts, tool usage, token efficiency, timing, and failure patterns.

The goal: build a system that orchestrates headless sessions of SubQ Code, Claude Code, and Codex CLI on identical tasks in isolated git worktrees, captures enriched JSONL transcripts, and generates side-by-side analysis reports. The ultimate purpose—identify what it would take to make SubQ Code a more competent coding agent, primarily by analyzing where its behavior diverges from stronger agents and modifying its system prompts at various pipeline stages.

Design principle: LangChain’s harness engineering work proved that system prompt + tools + middleware optimization alone moved an agent 13.7 points on Terminal Bench 2.0 (52.8% → 66.5%) without changing the model. This “three-knob model” is the core design axis—each knob tunable and measurable independently.

Architecture

Six-stage data pipeline from task definition to comparison report.

subq eval run <task> --agents pi-agent,claude-code --runs 3
1. orchestrator Parse task YAML, allocate worktrees sequentially, launch agents in parallel
2. adapters One per agent. Manage process lifecycle. Return RawAgentOutput (stdout + exit code + timestamps)
3. enrichers Parse RawAgentOutput into EnrichedEvent[] via Bun.JSONL.parseChunk(). Reuse parsers/ internals
4. normalizer EnrichedEvent[] → EnrichedSession with canonical tool names and detected milestones
5. comparison Align sessions by milestone. Compute efficiency metrics. Diff system prompts. Run LLM judge
6. reporter Render TUI dashboard, JSON (--robot), or HTML report

File Structure

14 files under apps/cli/src/eval/, reduced from 27 in the original plan.

apps/cli/src/eval/
core types.ts — all types + TypeBox schemas
core task.ts — YAML task loading (custom only for MVP)
core worktree.ts — git worktree allocation/cleanup
core orchestrator.ts — parallel agent execution with timeout
-----
adapters types.ts — AgentAdapter, AgentRunHandle, ProcessSpawner
adapters index.ts — ALL_ADAPTERS barrel + registry
adapters pi-agent.ts — SubQ: spawn --json, pipe stdout
adapters claude-code.ts — Claude: spawn -p stream-json, pipe stdout
-----
pipeline enricher.ts — unified enrichment (reuses parsers/ internals)
pipeline normalizer.ts — events → EnrichedSession + canonical tool names
pipeline milestones.ts — milestone detection (tool-name-agnostic)
analysis comparison.ts — alignment + scoring + prompt diff
output report.ts — EvalReport + formatRobot()
output cost.ts — cost normalization across providers

Core Types

15+ interfaces form the type lattice. Every type uses epoch-ms timestamps for safe JSON round-tripping. TypeBox schemas validate all YAML-loaded types at parse boundaries.

EvalAgentId

Identity

Extract<AgentType, "claude-code" | "codex" | "pi-agent">

derived type 5 reviewers flagged

RawAgentOutput

Capture

stdout bytes, stderr bytes, exit code, timestamps, optional disk JSONL path

adapter output no parsing

EnrichedEvent

Canonical

Agent-agnostic event: message, tool_call, tool_result, thinking, system_prompt, error, cost, milestone

8 kinds agent-agnostic

EnrichedSession

Complete

Messages, milestones, system prompt injections, token usage, wall clock time, exit code, task resolved

extends Session full picture

Milestone

Progress

9 kinds: first_file_read, first_file_edit, first_test_run, first_bash_command, first_search, first_error_recovery, task_completion, verification_pass/fail

epoch-ms 9 kinds

QualityRubric

Scoring

5 dimensions: correctness, completeness, code quality, minimal diff, verification. Each 0.0–1.0

multi-dimensional heuristic + LLM

TokenUsage

Input, output, cache read/write tokens, total, cost USD. Per-message and per-session aggregation.

cost attribution · cache-aware

SystemPromptInjection

5 sources: initial, reminder, compaction, tool_result, other. Captures content, label, turn index.

prompt archaeology · per-turn

EvalTask (TypeBox)

YAML task definition: id, name, repoPath, baseCommit, prompt, verifyCommand, timeout, complexity, tags.

validated at parse · js-yaml v4

KnobConfig (TypeBox)

Per-agent knob overrides: system prompt file/append, tool allow/deny lists, budget cap, model override, custom env.

three-knob model · per-agent

Key Design Decisions

Decision
Rationale
Source
Bun.spawn() over Node execFile
Need streaming stdout; existing spawn.ts buffers everything
Framework docs
EvalAgentId = Extract<AgentType, ...>
Prevents type drift from parallel string unions
5 reviewers
Git worktrees over Docker
Consensus primitive; faster, lighter, native macOS
Research
Serialized worktree creation
.git/config.lock race condition (Claude Code #47266)
Best practices
Milestone alignment (not message index)
Agents take different paths; milestones capture functional progress
Research
--runs 3 default
Non-determinism is the defining property of agent evaluation
Eval methodology
Env allowlist for child processes
Prevents API key leakage to spawned agents
Security
Epoch-ms timestamps
Date objects break on JSON round-trip; number is serialization-safe
Code architect
Frontier-based selection (top-N)
Maintains diversity, prevents premature convergence
EvoSkill
Human-gated deployment
Evolution discovers candidates; human decides what ships
Hermes guardrails

Implementation Phases

Phase 1: Foundation

Types + Tasks + Worktrees

All enriched type definitions + TypeBox schemas. YAML task loading with js-yaml v4. Worktree allocation/cleanup with serialized creation and crash handlers. Cost normalization.

types.ts task.ts worktree.ts cost.ts

Phase 2: Adapters

SubQ + Claude Code

AgentAdapter and ProcessSpawner interfaces. SubQ adapter with --json stdout pipe. Claude adapter with --bare --stream-json --bypassPermissions. Env allowlist. ANTHROPIC_API_KEY stripped.

pi-agent.ts claude-code.ts ProcessSpawner

Phase 3: Enricher + Normalizer + Milestones

Stream Processing

Unified enrichment via Bun.JSONL.parseChunk(). Canonical tool name map (42 aliases → 8 canonical names). Milestone detection from enriched event stream with fixed TEST_COMMAND_PATTERNS.

enricher.ts normalizer.ts milestones.ts

Phase 4: Orchestrator + CLI

Command Surface

Parallel Bun.spawn() execution with AbortSignal timeout. Ops layer following ops/leverage.ts pattern. Commander.js commands: run, compare, list, prompt-diff, tasks, clean. --robot on all subcommands.

orchestrator.ts ops/eval.ts commands/eval.ts

Phase 5: Comparison + Prompt Analysis + LLM Judge

Analysis Engine

Milestone alignment on elapsedMs. 5-dimension quality rubric. System prompt segmentation into 10 semantic categories. Three diff formats: matrix, unified, imperative extraction. Dual LLM judge with position-swap protocol (Claude + Kimi K2.5).

comparison.ts PoLL pattern anti-gaming

Phase 6: Reporting

JSON --robot First

EvalReport with schemaVersion. formatRobot() as JSON.stringify(report, null, "\t"). Streaming progress events for long eval runs.

report.ts --robot schemaVersion

Phase 7a: Self-Evolution Loop

EvoSkill Integration

The capstone: transform subq eval from measurement tool to evolution engine. Proposer analyzes failure traces, Skill Builder implements proposals, Frontier maintains top-N configurations as git branches. Feedback memory prevents circular proposals.

proposer.ts skill-builder.ts frontier.ts guardrails.ts loop.ts

Phase 7b: GEPA-Style Prompt Evolution

Reflective Analysis

Imperative extraction + mutation of individual prompt rules. Section reordering for cache efficiency. Prompt A/B testing with statistical comparison across tasks.

imperative extraction cache impact prompt-ab

Phase 7c: Extensions

Codex + TUI + SWE-bench

Codex CLI adapter via proc.exited. Ink TUI dashboard. HTML report template. SWE-bench task source. wterm observation mode for live browser-based observation.

codex.ts TUI SWE-bench wterm

CLI Commands

subq eval — all commands support --robot
run subq eval run <task> [--agents pi-agent,claude-code] [--runs 3] [--knobs-file <path>] [--timeout <s>]
compare subq eval compare <run-id> [--robot] [--html <path>]
list subq eval list [--days <n>] [--robot]
prompt-diff subq eval prompt-diff <run-id> [--robot]
tasks subq eval tasks [--source custom] [--robot]
clean subq eval clean [--dry-run]
-----
evolve subq eval evolve <task-dir> [--iterations 10] [--frontier-size 3] [--knob-types prompt,skill,tool]
frontier subq eval frontier [--robot] | frontier deploy <id> | frontier diff <a> <b>
prompt-ab subq eval prompt-ab <task> --variant-a baseline.txt --variant-b evolved.txt --runs 5

Security Model

Risk
Mitigation
Phase
API key leakage
Env allowlist—only PATH, HOME, TERM + agent-specific keys
Phase 2
YAML deserialization
js-yaml v4 load() safe by default + TypeBox validation
Phase 1
Agent escapes worktree
--bare on Claude Code skips hooks/CLAUDE.md; HOME set to temp dir
Phase 2
Benchmark gaming
Anti-gaming trace audit heuristics + private holdout tasks (.gitignored)
Phase 5
Session data exposure
Eval run dirs created with 0o700; --redact flag in Phase 7
Phase 6
Budget overrun
--max-budget-usd always passed to Claude adapter
Phase 2

System Prompt Analysis

The key deliverable for the stated goal—“make SubQ Code more competent.” Four operations on system prompts across agents.

1. Extract

Capture system prompts at every injection point for each agent: initial, reminder, compaction, tool_result, other.

per-turn capture · 5 sources

2. Segment

10 semantic categories: identity, environment, workflow, tool_instructions, behavioral_rules, safety, error_recovery, output_format, injected_context, meta.

semantic sections · 10 categories

3. Diff

Three formats: semantic section matrix (presence/absence/token count), unified diff (line-level), imperative extraction (all “Do X” / “Never Y” rules compared).

matrix · unified · imperatives

4. Correlate

Map prompt sections to success/failure patterns via LLM trace attribution. Produces actionable prompt patches.

trace attribution · prompt patches
Imperative extraction algorithm: For each sentence, detect imperative mood (starts with verb, contains must/should/never/always). Classify as REQUIRE | PROHIBIT | PREFER | RECOVER. Compare across agents: rules only in Claude Code, only in SubQ, or in both with different phrasing. This directly produces prompt patches.

Research Findings

12 review agents and 3 external research integrations produced these critical corrections and insights.

Critical Corrections

Blocking Issues (3)

EvalAgentType must use Extract<AgentType>—5 reviewers flagged. "eval" must be added to TOP_LEVEL_COMMANDS—3 flagged. Child process env must use allowlist, not process.env inheritance.

BLOCKING 5+3+3 reviewers

Evaluation Methodology

Statistical Rigor

Non-determinism is the defining property. Minimum 3 runs per comparison. Multi-dimensional rubrics, not single metrics. Position-swap protocol for pairwise judges. Self-enhancement bias: use PoLL pattern (Claude + non-Claude judge).

3+ runs PoLL panel position swap

ProcessSpawner Injection

Bun.spawn() is not available in Vitest. Injectable ProcessSpawner interface enables testing with synthetic JSONL fixtures.

testability · HIGH priority

TEST_PATTERNS Fix

Original regex /test|pytest/ matches cat test.txt. Fixed: match against command field, require test as program name or subcommand.

false positive elimination

Cache Impact Metrics

Prompt modifications invalidate Anthropic’s prefix cache. Static sections must precede dynamic. Track baselineCacheableTokens vs patchedCacheableTokens.

cost optimization · section ordering

Crash Recovery

SIGKILL prevents cleanup handlers. Worktrees accumulate. Fix: process.on(“exit”) + SIGTERM handlers + subq eval clean scans for stale worktrees.

orphan detection · stale cleanup

Dependencies

Runtime

Required

js-yaml (v4, safe by default). Claude Code binary. SubQ Code (in-repo). Bun runtime.

js-yaml claude subq

Phase 7

Deferred

Codex CLI (v0.124.0). wterm (observation mode). SWE-bench task source.

codex wterm SWE-bench

Reference Only

Not Runtime

EvoSkill (Apache 2.0)—pattern reimplemented in TypeScript. Hermes Self-Evolution (MIT)—tiered optimization patterns.

EvoSkill Hermes GEPA