Context
SubQ Code is an AI coding agent orchestration and measurement platform that already parses sessions from 12 different agents and computes leverage metrics. However, there is no way to run agents head-to-head on the same task and deeply compare their full transcripts—system prompts, tool usage, token efficiency, timing, and failure patterns.
The goal: build a system that orchestrates headless sessions of SubQ Code, Claude Code, and Codex CLI on identical tasks in isolated git worktrees, captures enriched JSONL transcripts, and generates side-by-side analysis reports. The ultimate purpose—identify what it would take to make SubQ Code a more competent coding agent, primarily by analyzing where its behavior diverges from stronger agents and modifying its system prompts at various pipeline stages.
Architecture
Six-stage data pipeline from task definition to comparison report.
File Structure
14 files under apps/cli/src/eval/, reduced from 27 in the original plan.
Core Types
15+ interfaces form the type lattice. Every type uses epoch-ms timestamps for safe JSON round-tripping. TypeBox schemas validate all YAML-loaded types at parse boundaries.
EvalAgentId
Identity
Extract<AgentType, "claude-code" | "codex" | "pi-agent">
RawAgentOutput
Capture
stdout bytes, stderr bytes, exit code, timestamps, optional disk JSONL path
EnrichedEvent
Canonical
Agent-agnostic event: message, tool_call, tool_result, thinking, system_prompt, error, cost, milestone
EnrichedSession
Complete
Messages, milestones, system prompt injections, token usage, wall clock time, exit code, task resolved
Milestone
Progress
9 kinds: first_file_read, first_file_edit, first_test_run, first_bash_command, first_search, first_error_recovery, task_completion, verification_pass/fail
QualityRubric
Scoring
5 dimensions: correctness, completeness, code quality, minimal diff, verification. Each 0.0–1.0
TokenUsage
Input, output, cache read/write tokens, total, cost USD. Per-message and per-session aggregation.
SystemPromptInjection
5 sources: initial, reminder, compaction, tool_result, other. Captures content, label, turn index.
EvalTask (TypeBox)
YAML task definition: id, name, repoPath, baseCommit, prompt, verifyCommand, timeout, complexity, tags.
KnobConfig (TypeBox)
Per-agent knob overrides: system prompt file/append, tool allow/deny lists, budget cap, model override, custom env.
Key Design Decisions
Implementation Phases
Phase 1: Foundation
Types + Tasks + Worktrees
All enriched type definitions + TypeBox schemas. YAML task loading with js-yaml v4. Worktree allocation/cleanup with serialized creation and crash handlers. Cost normalization.
Phase 2: Adapters
SubQ + Claude Code
AgentAdapter and ProcessSpawner interfaces. SubQ adapter with --json stdout pipe. Claude adapter with --bare --stream-json --bypassPermissions. Env allowlist. ANTHROPIC_API_KEY stripped.
Phase 3: Enricher + Normalizer + Milestones
Stream Processing
Unified enrichment via Bun.JSONL.parseChunk(). Canonical tool name map (42 aliases → 8 canonical names). Milestone detection from enriched event stream with fixed TEST_COMMAND_PATTERNS.
Phase 4: Orchestrator + CLI
Command Surface
Parallel Bun.spawn() execution with AbortSignal timeout. Ops layer following ops/leverage.ts pattern. Commander.js commands: run, compare, list, prompt-diff, tasks, clean. --robot on all subcommands.
Phase 5: Comparison + Prompt Analysis + LLM Judge
Analysis Engine
Milestone alignment on elapsedMs. 5-dimension quality rubric. System prompt segmentation into 10 semantic categories. Three diff formats: matrix, unified, imperative extraction. Dual LLM judge with position-swap protocol (Claude + Kimi K2.5).
Phase 6: Reporting
JSON --robot First
EvalReport with schemaVersion. formatRobot() as JSON.stringify(report, null, "\t"). Streaming progress events for long eval runs.
Phase 7a: Self-Evolution Loop
EvoSkill Integration
The capstone: transform subq eval from measurement tool to evolution engine. Proposer analyzes failure traces, Skill Builder implements proposals, Frontier maintains top-N configurations as git branches. Feedback memory prevents circular proposals.
Phase 7b: GEPA-Style Prompt Evolution
Reflective Analysis
Imperative extraction + mutation of individual prompt rules. Section reordering for cache efficiency. Prompt A/B testing with statistical comparison across tasks.
Phase 7c: Extensions
Codex + TUI + SWE-bench
Codex CLI adapter via proc.exited. Ink TUI dashboard. HTML report template. SWE-bench task source. wterm observation mode for live browser-based observation.
CLI Commands
Security Model
System Prompt Analysis
The key deliverable for the stated goal—“make SubQ Code more competent.” Four operations on system prompts across agents.
1. Extract
Capture system prompts at every injection point for each agent: initial, reminder, compaction, tool_result, other.
2. Segment
10 semantic categories: identity, environment, workflow, tool_instructions, behavioral_rules, safety, error_recovery, output_format, injected_context, meta.
3. Diff
Three formats: semantic section matrix (presence/absence/token count), unified diff (line-level), imperative extraction (all “Do X” / “Never Y” rules compared).
4. Correlate
Map prompt sections to success/failure patterns via LLM trace attribution. Produces actionable prompt patches.
Research Findings
12 review agents and 3 external research integrations produced these critical corrections and insights.
Critical Corrections
Blocking Issues (3)
EvalAgentType must use Extract<AgentType>—5 reviewers flagged. "eval" must be added to TOP_LEVEL_COMMANDS—3 flagged. Child process env must use allowlist, not process.env inheritance.
Evaluation Methodology
Statistical Rigor
Non-determinism is the defining property. Minimum 3 runs per comparison. Multi-dimensional rubrics, not single metrics. Position-swap protocol for pairwise judges. Self-enhancement bias: use PoLL pattern (Claude + non-Claude judge).
ProcessSpawner Injection
Bun.spawn() is not available in Vitest. Injectable ProcessSpawner interface enables testing with synthetic JSONL fixtures.
TEST_PATTERNS Fix
Original regex /test|pytest/ matches cat test.txt. Fixed: match against command field, require test as program name or subcommand.
Cache Impact Metrics
Prompt modifications invalidate Anthropic’s prefix cache. Static sections must precede dynamic. Track baselineCacheableTokens vs patchedCacheableTokens.
Crash Recovery
SIGKILL prevents cleanup handlers. Worktrees accumulate. Fix: process.on(“exit”) + SIGTERM handlers + subq eval clean scans for stale worktrees.
Dependencies
Runtime
Required
js-yaml (v4, safe by default). Claude Code binary. SubQ Code (in-repo). Bun runtime.
Phase 7
Deferred
Codex CLI (v0.124.0). wterm (observation mode). SWE-bench task source.
Reference Only
Not Runtime
EvoSkill (Apache 2.0)—pattern reimplemented in TypeScript. Hermes Self-Evolution (MIT)—tiered optimization patterns.