The Pipeline
Data flows inward through four concentric layers. The orchestrator parses task definitions and allocates worktrees. Adapters spawn agent processes and capture raw output. Enrichers parse JSONL streams into canonical events. The core normalizes, compares, and reports.
Orchestrator
Parses task YAML, allocates git worktrees sequentially (avoids .git/config.lock race), launches adapters in parallel. Manages run lifecycle: start, timeout, cleanup. Entry point: subq eval run
Adapters
One per agent. Each adapter wraps Bun.spawn() with agent-specific flags. Returns RawAgentOutput: stdout bytes, exit code, timestamps. Injectable via ProcessSpawner interface for testing.
Enrichers
Parses RawAgentOutput into EnrichedEvent[] using Bun.JSONL.parseChunk(). Extracts milestones, system prompts, tool calls. Reuses existing parsers/ internals from SubQ Code.
Normalizer → Comparison → Reporter
Normalizes EnrichedSessions to common schema. Aligns milestones across agents. Applies quality rubric and optional LLM judge. Produces EvalReport in terminal table or --robot JSON.
The Type Lattice
Fifteen interconnected interfaces form the framework's type system. TypeBox schemas validate all YAML-loaded data. Epoch-millisecond timestamps ensure serializable types across JSON boundaries.
Core Identity
EvalAgentId
Extract<AgentType, "claude-code" | "codex" | "pi-agent">
Derived from AgentType via Extract<>, not a parallel union. Single source of truth.
Raw Output
RawAgentOutput
stdout: Uint8Array, exitCode: number, startMs: number, endMs: number. The adapter's contract—enrichment happens externally.
Canonical Event
EnrichedEvent
Agent-agnostic event with canonical tool names, normalized timestamps. Basis for all cross-agent comparison.
Session Container
EnrichedSession
Messages[], milestones[], system prompts[], token usage. The complete enriched transcript of one agent's run.
Cost Tracking
TokenUsage
inputTokens, outputTokens, cacheReadTokens, cacheWriteTokens, costUsd. Per-agent, per-run cost attribution.
Temporal Markers
Milestone
kind: string, timestampMs: number, turnIndex: number, elapsedMs: number. first_file_read, first_test_run, first_file_edit, and more.
Quality Assessment
QualityRubric
correctness, completeness, codeQuality, minimalDiff, verification. Multi-dimensional scoring by heuristic + optional LLM judge.
Prompt Forensics
SystemPromptInjection
source, content, label, turnIndex. Tracks every system prompt injection point across the agent session.
TypeBox Schemas
Design Decisions
Epoch-ms Timestamps
All timestamps are epoch-milliseconds (number), not Date objects. Serializable across JSON boundaries without lossy conversion.
TypeBox, Not Zod
TypeBox generates JSON Schema from TypeScript types. Faster at runtime than Zod. Compile-time type inference via Static<>.
Extract<>, Not Union
EvalAgentId derives from AgentType via Extract<>. Adding a new agent to AgentType automatically makes it available for eval.
Sequential Worktrees
git worktree add serialized to avoid .git/config.lock race (Claude Code issue #47266). Agent execution parallelized after.
Security Model
Child processes receive only approved environment variables. The allowlist prevents API key leakage across agent boundaries.
ANTHROPIC_API_KEY is explicitly stripped from the subprocess environment so claude -p uses the Max subscription instead of burning API credits.