The Pipeline
Data flows inward through four concentric layers. The orchestrator parses task definitions and allocates worktrees. Adapters spawn agent processes and capture raw output. Enrichers parse JSONL streams into canonical events. The core normalizes, compares, and reports.
Orchestrator
Parses task YAML, allocates git worktrees sequentially (avoids .git/config.lock race), launches adapters in parallel. Manages run lifecycle: start, timeout, cleanup. Entry point: subq eval run
Adapters
One per agent. Each adapter wraps Bun.spawn() with agent-specific flags. Returns RawAgentOutput: stdout bytes, exit code, timestamps. Injectable via ProcessSpawner interface for testing.
Enrichers
Parses RawAgentOutput into EnrichedEvent[] using Bun.JSONL.parseChunk(). Extracts milestones, system prompts, tool calls. Reuses existing parsers/ internals from SubQ Code.
Judge Layer
Error analysis pipeline classifies failures into 9 categories. Five binary LLM judges (one per dimension) with anti-hallucination grounding. PoLL multi-model panel with position-swap protocol. Blends with heuristic scores.
Normalizer → Comparison → Reporter
Normalizes EnrichedSessions. Aligns milestones. Applies quality rubric. Blends heuristic + LLM judge scores via per-dimension weights. Produces EvalReport with Rogan-Gladen corrected rates.
The Type Lattice
Fifteen interconnected interfaces form the framework's type system. TypeBox schemas validate all YAML-loaded data. Epoch-millisecond timestamps ensure serializable types across JSON boundaries.
Core Identity
EvalAgentId
Extract<AgentType, "claude-code" | "codex" | "pi-agent">
Derived from AgentType via Extract<>, not a parallel union. Single source of truth.
Raw Output
RawAgentOutput
stdout: Uint8Array, exitCode: number, startMs: number, endMs: number. The adapter's contract—enrichment happens externally.
Canonical Event
EnrichedEvent
Agent-agnostic event with canonical tool names, normalized timestamps. Basis for all cross-agent comparison.
Session Container
EnrichedSession
Messages[], milestones[], system prompts[], token usage. The complete enriched transcript of one agent's run.
Cost Tracking
TokenUsage
inputTokens, outputTokens, cacheReadTokens, cacheWriteTokens, costUsd. Per-agent, per-run cost attribution.
Temporal Markers
Milestone
kind: string, timestampMs: number, turnIndex: number, elapsedMs: number. first_file_read, first_test_run, first_file_edit, and more.
Quality Assessment
QualityRubric
correctness, completeness, codeQuality, minimalDiff, verification. Multi-dimensional scoring by heuristic + optional LLM judge.
Prompt Forensics
SystemPromptInjection
source, content, label, turnIndex. Tracks every system prompt injection point across the agent session.
Judge Output
JudgeVerdict
result: Pass | Fail, confidence: number, critique: string, evidence: string[]. Binary verdict from one dimension’s judge.
Pairwise
PairwiseVerdict
winner: EvalAgentId | “tie”, perDimension: Record, biasDetected: boolean. Position-swap reconciled comparison.
Error Attribution
ErrorAttribution
category: 9 error types across 3 tiers (capability, resource, environmental). Heuristic + LLM-assisted classification.
Bias Correction
BiasCorrection
observedRate, correctedRate, tpr, tnr, ci95Low, ci95High. Rogan-Gladen formula output per dimension.
TypeBox Schemas
Design Decisions
Epoch-ms Timestamps
All timestamps are epoch-milliseconds (number), not Date objects. Serializable across JSON boundaries without lossy conversion.
TypeBox, Not Zod
TypeBox generates JSON Schema from TypeScript types. Faster at runtime than Zod. Compile-time type inference via Static<>.
Extract<>, Not Union
EvalAgentId derives from AgentType via Extract<>. Adding a new agent to AgentType automatically makes it available for eval.
Sequential Worktrees
git worktree add serialized to avoid .git/config.lock race (Claude Code issue #47266). Agent execution parallelized after.
Security Model
Child processes receive only approved environment variables. The allowlist prevents API key leakage across agent boundaries.
ANTHROPIC_API_KEY is explicitly stripped from the subprocess environment so claude -p uses the Max subscription instead of burning API credits.