Architecture · Data Flow

How It Flows

From task YAML to comparison report. Four concentric layers—orchestrator, adapters, enrichers, normalizer—plus the full type lattice that gives every stage its shape.

The Pipeline

Data flows inward through four concentric layers. The orchestrator parses task definitions and allocates worktrees. Adapters spawn agent processes and capture raw output. Enrichers parse JSONL streams into canonical events. The core normalizes, compares, and reports.

Orchestrator

Parses task YAML, allocates git worktrees sequentially (avoids .git/config.lock race), launches adapters in parallel. Manages run lifecycle: start, timeout, cleanup. Entry point: subq eval run

Adapters

One per agent. Each adapter wraps Bun.spawn() with agent-specific flags. Returns RawAgentOutput: stdout bytes, exit code, timestamps. Injectable via ProcessSpawner interface for testing.

Enrichers

Parses RawAgentOutput into EnrichedEvent[] using Bun.JSONL.parseChunk(). Extracts milestones, system prompts, tool calls. Reuses existing parsers/ internals from SubQ Code.

Normalizer → Comparison → Reporter

Normalizes EnrichedSessions to common schema. Aligns milestones across agents. Applies quality rubric and optional LLM judge. Produces EvalReport in terminal table or --robot JSON.

Outer → core: task YAML → EvalReport

The Type Lattice

Fifteen interconnected interfaces form the framework's type system. TypeBox schemas validate all YAML-loaded data. Epoch-millisecond timestamps ensure serializable types across JSON boundaries.

Core Identity

EvalAgentId

Extract<AgentType, "claude-code" | "codex" | "pi-agent">

Derived from AgentType via Extract<>, not a parallel union. Single source of truth.

Raw Output

RawAgentOutput

stdout: Uint8Array, exitCode: number, startMs: number, endMs: number. The adapter's contract—enrichment happens externally.

Canonical Event

EnrichedEvent

Agent-agnostic event with canonical tool names, normalized timestamps. Basis for all cross-agent comparison.

Session Container

EnrichedSession

Messages[], milestones[], system prompts[], token usage. The complete enriched transcript of one agent's run.

Cost Tracking

TokenUsage

inputTokens, outputTokens, cacheReadTokens, cacheWriteTokens, costUsd. Per-agent, per-run cost attribution.

Temporal Markers

Milestone

kind: string, timestampMs: number, turnIndex: number, elapsedMs: number. first_file_read, first_test_run, first_file_edit, and more.

Quality Assessment

QualityRubric

correctness, completeness, codeQuality, minimalDiff, verification. Multi-dimensional scoring by heuristic + optional LLM judge.

Prompt Forensics

SystemPromptInjection

source, content, label, turnIndex. Tracks every system prompt injection point across the agent session.

TypeBox Schemas

src/eval/types.ts — EvalTaskSchema
id Type.String() — unique task identifier
name Type.String() — human-readable task name
repoPath Type.String() — absolute path to target repository
baseCommit Type.Union([Type.String({pattern: /^[a-f0-9]{4,40}$/}), Type.Null()])
prompt Type.String({minLength: 1}) — the task prompt sent to agents
verifyCommand Type.Union([Type.String(), Type.Null()]) — post-eval verification
timeoutSeconds Type.Number({minimum: 1, maximum: 3600}) — per-agent timeout
tags Type.Array(Type.String()) — categorization for filtering
src/eval/types.ts — KnobConfigSchema
systemPromptFile Type.Optional(Type.String()) — path to system prompt override
toolAllowList Type.Optional(Type.Array(Type.String())) — permitted tools
toolDenyList Type.Optional(Type.Array(Type.String())) — blocked tools
skillPaths Type.Optional(Type.Array(Type.String())) — SKILL.md file paths
envOverrides Type.Optional(Type.Record(Type.String(), Type.String())) — env vars

Design Decisions

Epoch-ms Timestamps

All timestamps are epoch-milliseconds (number), not Date objects. Serializable across JSON boundaries without lossy conversion.

JSON.stringify safety

TypeBox, Not Zod

TypeBox generates JSON Schema from TypeScript types. Faster at runtime than Zod. Compile-time type inference via Static<>.

Performance + JSON Schema output

Extract<>, Not Union

EvalAgentId derives from AgentType via Extract<>. Adding a new agent to AgentType automatically makes it available for eval.

Single source of truth

Sequential Worktrees

git worktree add serialized to avoid .git/config.lock race (Claude Code issue #47266). Agent execution parallelized after.

Race condition avoidance

Security Model

Child processes receive only approved environment variables. The allowlist prevents API key leakage across agent boundaries.

Variable
Purpose
Agents
HOME
User home directory for config file resolution
SubQ Claude Codex
PATH
Binary resolution for git, bun, node
SubQ Claude Codex
SUBQ_SYSTEM_*
System prompt override for eval knob config
SubQ
ANTHROPIC_API_KEY
STRIPPED — forces Max subscription
Claude
Key insight: Claude Code’s ANTHROPIC_API_KEY is explicitly stripped from the subprocess environment so claude -p uses the Max subscription instead of burning API credits.

JSONL Streaming

src/eval/enricher.ts — streaming pattern
import Bun.JSONL.parseChunk() — handles partial lines across chunks
stream proc.stdout: ReadableStream<Uint8Array> from Bun.spawn()
chunk for await (const chunk of proc.stdout) { parseChunk(chunk) }
emit Each parsed line → EnrichedEvent with canonical tool names
collect Events[] → EnrichedSession with milestones + token usage