Phase 5 · LLM-as-a-Judge

The Judge’s Chamber

Five binary judges, one per quality dimension. Code-based checks first, LLM judges for what heuristics cannot see. Position-swap protocol, multi-model PoLL panels, and anti-hallucination grounding—because scores without rigor are theatre.

The Evaluation Hierarchy

Exhaust deterministic checks before reaching for an LLM judge. Many dimensions that seem subjective reduce to code-based checks when you understand the domain. LLM judges are expensive, non-deterministic, and biased—use them only when a code-based check cannot answer the question with >90% accuracy.

verifyCommand
Linter exit code
git diff --stat
Milestone presence
Heuristic scoring
LLM judge panel
Blended score
Free & deterministic → expensive & non-deterministic
Dimension
Code-Based Check (Free)
LLM Judge (Expensive)
Correctness
verifyCommand exit code
“Does the diff address the root cause?”
Code Quality
Linter exit code + violation count
Subjective readability & idiomaticity
Minimal Diff
git diff --stat line/file count
“Are all changed lines necessary?”
Verification
first_test_run milestone
“Did recovery actually fix the issue?”
Completeness
Test file delta & new test detection
“Are edge cases handled?”

The Five Judges

One binary Pass/Fail judge per quality dimension. Not one holistic multi-dimensional judge—universal consensus from 11 research agents and the eval-audit diagnostic. Binary forces a clear decision boundary. Likert scales cause rater drift and cannot be calibrated.

Correctness 0.35
Root Cause Resolution
Pass: Diff fixes root cause. Logically sound. Passing verify is strong evidence.
Fail: Does not fix stated issue. Symptom masking, overly narrow fix, hardcoded output.
Input: task prompt + git diff + verify output
Verification 0.20
Test-After-Change
Pass: Agent ran tests/build after changes. Error recovery counts.
Fail: No verification step. Only tested before changes. Gave up after failure.
Input: task prompt + full agent transcript
Completeness 0.20
Full Requirement Coverage
Pass: All prompt requirements handled. Edge cases considered. Tests added.
Fail: Only exact reported case fixed. Multi-part task partially done.
Input: task prompt + diff + test changes
Code Quality 0.15
Craft & Idiom
Pass: Descriptive names, correct idioms, proper error handling, no debug artifacts.
Fail: Cryptic names, any types, @ts-ignore, console.log left in.
Input: task prompt + diff only
Minimal Diff 0.10
Surgical Precision
Pass: Only necessary changes. Minor auto-formatting on modified files acceptable.
Fail: Unrelated refactors, unrequested features, reformatting untouched files.
Input: task prompt + diff only
Binary Pass/Fail over Likert scales—universal consensus from Hamel Husain, Autorubric, and all eval skills. Likert scores cause rater drift and cannot be calibrated.

Chain-of-Thought Protocol

Every judge prompt forces a mandatory scratchpad with 5 numbered steps before the JSON verdict. When the model writes the verdict before reasoning, subsequent justification is post-hoc rationalization. 15–25% reliability improvement.

<scratchpad> — mandatory reasoning chain
step 1 TASK DECOMPOSITION — Break the task into discrete requirements
step 2 DIFF AUDIT — For each requirement, cite specific diff lines. ADDRESSED/UNADDRESSED
step 3 BEHAVIORAL EVIDENCE — Check tool trace for verification, exit codes
step 4 FAILURE SCAN — List regressions, missing edge cases, syntax errors
step 5 VERDICT REASONING — Synthesize steps 1–4 into pass/fail

Anti-Hallucination Grounding

The single highest-impact prompt addition. Without grounding rules, judges invent “potential race conditions” and “possible memory leaks” that exist nowhere in the diff.

Cite or Discard

Rule 1

Every claim must reference a specific file:line from the diff. No citation = no claim.

Diff Boundary

Rule 2

Evaluate ONLY changed lines (+ and −). Context lines are not being evaluated.

No Speculation

Rule 3

Do not invent performance issues, security vulnerabilities, or future maintenance problems not evident in the diff.

Trace Is Truth

Rule 4

If tests passed (exit code 0 in trace), do not claim the code is broken.


Position-Swap Protocol

Mandatory for all pairwise comparisons. Position bias flip rates reach 66% on some models (GPT-5.4, Mazur benchmark 2026). Median flip rate across models: 45%.

Agent A
Agent B
Position Swap
A↔B · reconcile verdicts
pairwiseCompareWithPositionSwap() — the only public API
pass 1 Run judge with (A=agent1, B=agent2) → verdict1
pass 2 Swap: run with (A=agent2, B=agent1) → verdict2
remap Map verdict2 back: A→B, B→A, TIE→TIE
agree Both agree → use that winner, confidence = average
disagree Disagree → TIE, confidence = 0.5, bias_detected = true

PoLL Panel: Multi-Model Consensus

A panel of smaller, disjoint-family models outperforms a single GPT-4 judge across 3 settings and 6 datasets. 7× cheaper. Auto-selected based on which agents are being evaluated.

When
Panel Composition
Rationale
Claude Code NOT evaluated
Claude Sonnet 4 Kimi K2.5
Two strong families, maximum coverage
Claude Code IS evaluated
Kimi K2.5 GPT-4.1-mini
Zero Claude self-enhancement bias
Key insight: cacheSystem: true on all judge calls. The rubric is identical across evaluations—10% input cost after first call via Anthropic prompt caching. temperature: 0.1 for scoring consistency.

Heuristic–LLM Blending

The existing scoreFromSession() heuristic scores and LLM judge binary verdicts are blended per dimension. When heuristic and LLM disagree by >0.5, trust the LLM—it saw the actual code.

Dimension
Heuristic
LLM Judge
Correctness (0.35)
0.7
0.3
Verification (0.20)
0.6
0.4
Completeness (0.20)
0.4
0.6
Code Quality (0.15)
0.2
0.8
Minimal Diff (0.10)
0.5
0.5

Bias Mitigation Stack

Five layered defenses against the known failure modes of LLM-as-a-judge systems. Each layer catches what the previous missed.

1
Position-Swap Protocol
Run every pairwise comparison twice with swapped positions. Disagreement → TIE.
2
Chain-of-Thought Before Verdict
Mandatory 5-step scratchpad forces genuine reasoning before the binary decision.
3
PoLL Multi-Model Panel
Disjoint model families vote independently. No single model’s bias dominates.
4
Anti-Hallucination Grounding
Cite-or-discard, diff boundary, no speculation, trace is truth.
5
Validation Pipeline
TPR/TNR > 90% against human labels. Rogan-Gladen bias correction on production rates.

Error Analysis Pipeline

Nine error categories across three tiers. Heuristic detection first, LLM attribution for ambiguous cases, aggregation for per-agent error profiles.

Tier 1 · Capability

Comprehension Fail

Agent misidentified the bug or solved a different problem entirely.

Tier 1 · Capability

Planning Fail

Right understanding, wrong approach. More than 3 backtracks detected.

Tier 1 · Capability

Implementation Fail

Right approach, wrong code. Verification fails after first edit attempt.

Tier 2 · Resource

Token Exhaustion

Session ends with high token count, no verification step reached.

Tier 2 · Resource

Timeout Expiration

Wall clock ≥ timeout limit, kill signal sent to agent process.

Tier 2 · Resource

Error Loop

Same tool call pattern repeated >3 times with failures.

Tier 3 · Environment

Environment Fail

First shell command fails (dependency missing). Not the agent’s fault.

Tier 3 · Anti-Gaming

Gaming Artifact

Hardcoded answers, test mutations, verify-then-patch pattern detected.

Tier 1 · Capability

Verification Skip

Agent never ran tests. Committed changes without any verification step.


Anti-Gaming Detection

Three mechanisms producing AntiGamingSignal. When gaming is detected, taskResolved is overridden to false in the comparison engine.

Structural Detection

verify_then_patch: verification passes before meaningful file edits. test_mutation: edits targeting test files with reduced assertions. noop_edit: diff is whitespace/comment only.

Heuristic — fast, deterministic

Semantic Detection

LLM judge receives thinking content + diff: “Does the reasoning show genuine problem-solving or pattern-matching against expected test outputs?”

LLM judge — expensive, nuanced

Cross-Run Correlation

Identical diffs across multiple --runs on the same task = deterministic gaming (memorized answers). Signal: Levenshtein distance < 5% of diff length.

Statistical — requires --runs ≥ 2

Judge Type System

Core

JudgeVerdict

result: Pass | Fail, confidence: number, critique: string, evidence: string[]. The atomic unit of judge output.

judge-schemas.ts

Pairwise

PairwiseVerdict

winner: EvalAgentId | “tie”, perDimension: Record<dimension, winner>, biasDetected: boolean.

judge-schemas.ts

Error

ErrorAttribution

category: ErrorCategory, tier: 1|2|3, confidence: number, evidence: string. From heuristic or LLM analysis.

error-analysis.ts

Gaming

AntiGamingSignal

type: verify_then_patch | test_mutation | noop_edit | identical_diff, severity: number, override: boolean.

error-analysis.ts