The Judge's Chamber — SubQ Code Eval Framework

The Evaluation Hierarchy

Exhaust deterministic checks before reaching for an LLM judge. Many dimensions that seem subjective reduce to code-based checks when you understand the domain. LLM judges are expensive, non-deterministic, and biased—use them only when a code-based check cannot answer the question with >90% accuracy.

verifyCommand

→

Linter exit code

→

git diff --stat

→

Milestone presence

→

Heuristic scoring

→

LLM judge panel

→

Blended score

Free & deterministic → expensive & non-deterministic

Dimension

Code-Based Check (Free)

LLM Judge (Expensive)

Correctness

verifyCommand exit code

“Does the diff address the root cause?”

Code Quality

Linter exit code + violation count

Subjective readability & idiomaticity

Minimal Diff

git diff --stat line/file count

“Are all changed lines necessary?”

Verification

first_test_run milestone

“Did recovery actually fix the issue?”

Completeness

Test file delta & new test detection

“Are edge cases handled?”

The Five Judges

One binary Pass/Fail judge per quality dimension. Not one holistic multi-dimensional judge—universal consensus from 11 research agents and the eval-audit diagnostic. Binary forces a clear decision boundary. Likert scales cause rater drift and cannot be calibrated.

Root Cause Resolution

Pass: Diff fixes root cause. Logically sound. Passing verify is strong evidence.

Fail: Does not fix stated issue. Symptom masking, overly narrow fix, hardcoded output.

Input: task prompt + git diff + verify output

Test-After-Change

Pass: Agent ran tests/build after changes. Error recovery counts.

Fail: No verification step. Only tested before changes. Gave up after failure.

Input: task prompt + full agent transcript

Full Requirement Coverage

Pass: All prompt requirements handled. Edge cases considered. Tests added.

Fail: Only exact reported case fixed. Multi-part task partially done.

Input: task prompt + diff + test changes

Craft & Idiom

Pass: Descriptive names, correct idioms, proper error handling, no debug artifacts.

Fail: Cryptic names, any types, @ts-ignore, console.log left in.

Input: task prompt + diff only

Surgical Precision

Pass: Only necessary changes. Minor auto-formatting on modified files acceptable.

Fail: Unrelated refactors, unrequested features, reformatting untouched files.

Input: task prompt + diff only

Binary Pass/Fail over Likert scales—universal consensus from Hamel Husain, Autorubric, and all eval skills. Likert scores cause rater drift and cannot be calibrated.

Chain-of-Thought Protocol

Every judge prompt forces a mandatory scratchpad with 5 numbered steps before the JSON verdict. When the model writes the verdict before reasoning, subsequent justification is post-hoc rationalization. 15–25% reliability improvement.

<scratchpad> — mandatory reasoning chain

step 1 TASK DECOMPOSITION — Break the task into discrete requirements

step 2 DIFF AUDIT — For each requirement, cite specific diff lines. ADDRESSED/UNADDRESSED

step 3 BEHAVIORAL EVIDENCE — Check tool trace for verification, exit codes

step 4 FAILURE SCAN — List regressions, missing edge cases, syntax errors

step 5 VERDICT REASONING — Synthesize steps 1–4 into pass/fail

Anti-Hallucination Grounding

The single highest-impact prompt addition. Without grounding rules, judges invent “potential race conditions” and “possible memory leaks” that exist nowhere in the diff.

Cite or Discard

Rule 1

Every claim must reference a specific file:line from the diff. No citation = no claim.

Diff Boundary

Rule 2

Evaluate ONLY changed lines (+ and −). Context lines are not being evaluated.

No Speculation

Rule 3

Do not invent performance issues, security vulnerabilities, or future maintenance problems not evident in the diff.

Trace Is Truth

Rule 4

If tests passed (exit code 0 in trace), do not claim the code is broken.

Position-Swap Protocol

Mandatory for all pairwise comparisons. Position bias flip rates reach 66% on some models (GPT-5.4, Mazur benchmark 2026). Median flip rate across models: 45%.

Agent A

Agent B

Position Swap

A↔B · reconcile verdicts

pairwiseCompareWithPositionSwap() — the only public API

pass 1 Run judge with (A=agent1, B=agent2) → verdict1

pass 2 Swap: run with (A=agent2, B=agent1) → verdict2

remap Map verdict2 back: A→B, B→A, TIE→TIE

agree Both agree → use that winner, confidence = average

disagree Disagree → TIE, confidence = 0.5, bias_detected = true

PoLL Panel: Multi-Model Consensus

A panel of smaller, disjoint-family models outperforms a single GPT-4 judge across 3 settings and 6 datasets. 7× cheaper. Auto-selected based on which agents are being evaluated.

When

Panel Composition

Rationale

Claude Code NOT evaluated

Claude Sonnet 4 Kimi K2.5

Two strong families, maximum coverage

Claude Code IS evaluated

Kimi K2.5 GPT-4.1-mini

Zero Claude self-enhancement bias

Key insight: cacheSystem: true on all judge calls. The rubric is identical across evaluations—10% input cost after first call via Anthropic prompt caching. temperature: 0.1 for scoring consistency.

Heuristic–LLM Blending

The existing scoreFromSession() heuristic scores and LLM judge binary verdicts are blended per dimension. When heuristic and LLM disagree by >0.5, trust the LLM—it saw the actual code.

Dimension

Heuristic

LLM Judge

Correctness (0.35)

0.7

0.3

Verification (0.20)

0.6

0.4

Completeness (0.20)

0.4

0.6

Code Quality (0.15)

0.2

0.8

Minimal Diff (0.10)

0.5

Bias Mitigation Stack

Five layered defenses against the known failure modes of LLM-as-a-judge systems. Each layer catches what the previous missed.

Position-Swap Protocol

Run every pairwise comparison twice with swapped positions. Disagreement → TIE.

Chain-of-Thought Before Verdict

Mandatory 5-step scratchpad forces genuine reasoning before the binary decision.

PoLL Multi-Model Panel

Disjoint model families vote independently. No single model’s bias dominates.

Anti-Hallucination Grounding

Cite-or-discard, diff boundary, no speculation, trace is truth.

Validation Pipeline

TPR/TNR > 90% against human labels. Rogan-Gladen bias correction on production rates.

Error Analysis Pipeline

Nine error categories across three tiers. Heuristic detection first, LLM attribution for ambiguous cases, aggregation for per-agent error profiles.

Tier 1 · Capability

Comprehension Fail

Agent misidentified the bug or solved a different problem entirely.

Tier 1 · Capability

Planning Fail

Right understanding, wrong approach. More than 3 backtracks detected.

Tier 1 · Capability

Implementation Fail

Right approach, wrong code. Verification fails after first edit attempt.

Tier 2 · Resource

Token Exhaustion

Session ends with high token count, no verification step reached.

Tier 2 · Resource

Timeout Expiration

Wall clock ≥ timeout limit, kill signal sent to agent process.

Tier 2 · Resource

Error Loop

Same tool call pattern repeated >3 times with failures.

Tier 3 · Environment

Environment Fail

First shell command fails (dependency missing). Not the agent’s fault.

Tier 3 · Anti-Gaming

Gaming Artifact

Hardcoded answers, test mutations, verify-then-patch pattern detected.

Tier 1 · Capability

Verification Skip

Agent never ran tests. Committed changes without any verification step.

Anti-Gaming Detection

Three mechanisms producing AntiGamingSignal. When gaming is detected, taskResolved is overridden to false in the comparison engine.

Structural Detection

verify_then_patch: verification passes before meaningful file edits. test_mutation: edits targeting test files with reduced assertions. noop_edit: diff is whitespace/comment only.

Heuristic — fast, deterministic

Semantic Detection

LLM judge receives thinking content + diff: “Does the reasoning show genuine problem-solving or pattern-matching against expected test outputs?”

LLM judge — expensive, nuanced

Cross-Run Correlation

Identical diffs across multiple --runs on the same task = deterministic gaming (memorized answers). Signal: Levenshtein distance < 5% of diff length.

Statistical — requires --runs ≥ 2

Judge Type System

Core

JudgeVerdict

result: Pass | Fail, confidence: number, critique: string, evidence: string[]. The atomic unit of judge output.

judge-schemas.ts

Pairwise

PairwiseVerdict

winner: EvalAgentId | “tie”, perDimension: Record<dimension, winner>, biasDetected: boolean.

judge-schemas.ts

Error

ErrorAttribution

category: ErrorCategory, tier: 1|2|3, confidence: number, evidence: string. From heuristic or LLM analysis.

error-analysis.ts

Gaming

AntiGamingSignal

type: verify_then_patch | test_mutation | noop_edit | identical_diff, severity: number, override: boolean.

error-analysis.ts