How It Flows — SubQ Code Eval Framework

The Pipeline

Data flows inward through four concentric layers. The orchestrator parses task definitions and allocates worktrees. Adapters spawn agent processes and capture raw output. Enrichers parse JSONL streams into canonical events. The core normalizes, compares, and reports.

Orchestrator

Parses task YAML, allocates git worktrees sequentially (avoids .git/config.lock race), launches adapters in parallel. Manages run lifecycle: start, timeout, cleanup. Entry point: subq eval run

Adapters

One per agent. Each adapter wraps Bun.spawn() with agent-specific flags. Returns RawAgentOutput: stdout bytes, exit code, timestamps. Injectable via ProcessSpawner interface for testing.

Enrichers

Parses RawAgentOutput into EnrichedEvent[] using Bun.JSONL.parseChunk(). Extracts milestones, system prompts, tool calls. Reuses existing parsers/ internals from SubQ Code.

Judge Layer

Error analysis pipeline classifies failures into 9 categories. Five binary LLM judges (one per dimension) with anti-hallucination grounding. PoLL multi-model panel with position-swap protocol. Blends with heuristic scores.

Normalizer → Comparison → Reporter

Normalizes EnrichedSessions. Aligns milestones. Applies quality rubric. Blends heuristic + LLM judge scores via per-dimension weights. Produces EvalReport with Rogan-Gladen corrected rates.

Outer → core: task YAML → EvalReport

The Type Lattice

Fifteen interconnected interfaces form the framework's type system. TypeBox schemas validate all YAML-loaded data. Epoch-millisecond timestamps ensure serializable types across JSON boundaries.

Core Identity

EvalAgentId

Extract<AgentType, "claude-code" | "codex" | "pi-agent">

Derived from AgentType via Extract<>, not a parallel union. Single source of truth.

Raw Output

RawAgentOutput

stdout: Uint8Array, exitCode: number, startMs: number, endMs: number. The adapter's contract—enrichment happens externally.

Canonical Event

EnrichedEvent

Agent-agnostic event with canonical tool names, normalized timestamps. Basis for all cross-agent comparison.

Session Container

EnrichedSession

Messages[], milestones[], system prompts[], token usage. The complete enriched transcript of one agent's run.

Cost Tracking

TokenUsage

inputTokens, outputTokens, cacheReadTokens, cacheWriteTokens, costUsd. Per-agent, per-run cost attribution.

Temporal Markers

Milestone

kind: string, timestampMs: number, turnIndex: number, elapsedMs: number. first_file_read, first_test_run, first_file_edit, and more.

Quality Assessment

QualityRubric

correctness, completeness, codeQuality, minimalDiff, verification. Multi-dimensional scoring by heuristic + optional LLM judge.

Prompt Forensics

SystemPromptInjection

source, content, label, turnIndex. Tracks every system prompt injection point across the agent session.

Judge Output

JudgeVerdict

result: Pass | Fail, confidence: number, critique: string, evidence: string[]. Binary verdict from one dimension’s judge.

Pairwise

PairwiseVerdict

winner: EvalAgentId | “tie”, perDimension: Record, biasDetected: boolean. Position-swap reconciled comparison.

Error Attribution

ErrorAttribution

category: 9 error types across 3 tiers (capability, resource, environmental). Heuristic + LLM-assisted classification.

Bias Correction

BiasCorrection

observedRate, correctedRate, tpr, tnr, ci95Low, ci95High. Rogan-Gladen formula output per dimension.

TypeBox Schemas

src/eval/types.ts — EvalTaskSchema

id Type.String() — unique task identifier

name Type.String() — human-readable task name

repoPath Type.String() — absolute path to target repository

baseCommit Type.Union([Type.String({pattern: /^[a-f0-9]{4,40}$/}), Type.Null()])

prompt Type.String({minLength: 1}) — the task prompt sent to agents

verifyCommand Type.Union([Type.String(), Type.Null()]) — post-eval verification

timeoutSeconds Type.Number({minimum: 1, maximum: 3600}) — per-agent timeout

tags Type.Array(Type.String()) — categorization for filtering

src/eval/types.ts — KnobConfigSchema

systemPromptFile Type.Optional(Type.String()) — path to system prompt override

toolAllowList Type.Optional(Type.Array(Type.String())) — permitted tools

toolDenyList Type.Optional(Type.Array(Type.String())) — blocked tools

skillPaths Type.Optional(Type.Array(Type.String())) — SKILL.md file paths

envOverrides Type.Optional(Type.Record(Type.String(), Type.String())) — env vars

Design Decisions

Epoch-ms Timestamps

All timestamps are epoch-milliseconds (number), not Date objects. Serializable across JSON boundaries without lossy conversion.

JSON.stringify safety

TypeBox, Not Zod

TypeBox generates JSON Schema from TypeScript types. Faster at runtime than Zod. Compile-time type inference via Static<>.

Performance + JSON Schema output

Extract<>, Not Union

EvalAgentId derives from AgentType via Extract<>. Adding a new agent to AgentType automatically makes it available for eval.

Single source of truth

Sequential Worktrees

git worktree add serialized to avoid .git/config.lock race (Claude Code issue #47266). Agent execution parallelized after.

Race condition avoidance

Security Model

Child processes receive only approved environment variables. The allowlist prevents API key leakage across agent boundaries.

Variable

Purpose

Agents

HOME

User home directory for config file resolution

SubQ Claude Codex

PATH

Binary resolution for git, bun, node

SubQ Claude Codex

SUBQ_SYSTEM_*

System prompt override for eval knob config

SubQ

ANTHROPIC_API_KEY

STRIPPED — forces Max subscription

Claude

Key insight: Claude Code’s ANTHROPIC_API_KEY is explicitly stripped from the subprocess environment so claude -p uses the Max subscription instead of burning API credits.

JSONL Streaming

src/eval/enricher.ts — streaming pattern

import Bun.JSONL.parseChunk() — handles partial lines across chunks

stream proc.stdout: ReadableStream<Uint8Array> from Bun.spawn()

chunk for await (const chunk of proc.stdout) { parseChunk(chunk) }

emit Each parsed line → EnrichedEvent with canonical tool names

collect Events[] → EnrichedSession with milestones + token usage