The Evaluation Hierarchy
Exhaust deterministic checks before reaching for an LLM judge. Many dimensions that seem subjective reduce to code-based checks when you understand the domain. LLM judges are expensive, non-deterministic, and biased—use them only when a code-based check cannot answer the question with >90% accuracy.
verifyCommand exit codegit diff --stat line/file countfirst_test_run milestoneThe Five Judges
One binary Pass/Fail judge per quality dimension. Not one holistic multi-dimensional judge—universal consensus from 11 research agents and the eval-audit diagnostic. Binary forces a clear decision boundary. Likert scales cause rater drift and cannot be calibrated.
any types, @ts-ignore, console.log left in.Chain-of-Thought Protocol
Every judge prompt forces a mandatory scratchpad with 5 numbered steps before the JSON verdict. When the model writes the verdict before reasoning, subsequent justification is post-hoc rationalization. 15–25% reliability improvement.
Anti-Hallucination Grounding
The single highest-impact prompt addition. Without grounding rules, judges invent “potential race conditions” and “possible memory leaks” that exist nowhere in the diff.
Cite or Discard
Rule 1
Every claim must reference a specific file:line from the diff. No citation = no claim.
Diff Boundary
Rule 2
Evaluate ONLY changed lines (+ and −). Context lines are not being evaluated.
No Speculation
Rule 3
Do not invent performance issues, security vulnerabilities, or future maintenance problems not evident in the diff.
Trace Is Truth
Rule 4
If tests passed (exit code 0 in trace), do not claim the code is broken.
Position-Swap Protocol
Mandatory for all pairwise comparisons. Position bias flip rates reach 66% on some models (GPT-5.4, Mazur benchmark 2026). Median flip rate across models: 45%.
PoLL Panel: Multi-Model Consensus
A panel of smaller, disjoint-family models outperforms a single GPT-4 judge across 3 settings and 6 datasets. 7× cheaper. Auto-selected based on which agents are being evaluated.
cacheSystem: true on all judge calls. The rubric is identical across evaluations—10% input cost after first call via Anthropic prompt caching. temperature: 0.1 for scoring consistency.
Heuristic–LLM Blending
The existing scoreFromSession() heuristic scores and LLM judge binary verdicts are blended per dimension. When heuristic and LLM disagree by >0.5, trust the LLM—it saw the actual code.
Bias Mitigation Stack
Five layered defenses against the known failure modes of LLM-as-a-judge systems. Each layer catches what the previous missed.
Error Analysis Pipeline
Nine error categories across three tiers. Heuristic detection first, LLM attribution for ambiguous cases, aggregation for per-agent error profiles.
Tier 1 · Capability
Comprehension Fail
Agent misidentified the bug or solved a different problem entirely.
Tier 1 · Capability
Planning Fail
Right understanding, wrong approach. More than 3 backtracks detected.
Tier 1 · Capability
Implementation Fail
Right approach, wrong code. Verification fails after first edit attempt.
Tier 2 · Resource
Token Exhaustion
Session ends with high token count, no verification step reached.
Tier 2 · Resource
Timeout Expiration
Wall clock ≥ timeout limit, kill signal sent to agent process.
Tier 2 · Resource
Error Loop
Same tool call pattern repeated >3 times with failures.
Tier 3 · Environment
Environment Fail
First shell command fails (dependency missing). Not the agent’s fault.
Tier 3 · Anti-Gaming
Gaming Artifact
Hardcoded answers, test mutations, verify-then-patch pattern detected.
Tier 1 · Capability
Verification Skip
Agent never ran tests. Committed changes without any verification step.
Anti-Gaming Detection
Three mechanisms producing AntiGamingSignal. When gaming is detected, taskResolved is overridden to false in the comparison engine.
Structural Detection
verify_then_patch: verification passes before meaningful file edits. test_mutation: edits targeting test files with reduced assertions. noop_edit: diff is whitespace/comment only.
Semantic Detection
LLM judge receives thinking content + diff: “Does the reasoning show genuine problem-solving or pattern-matching against expected test outputs?”
Cross-Run Correlation
Identical diffs across multiple --runs on the same task = deterministic gaming (memorized answers). Signal: Levenshtein distance < 5% of diff length.
Judge Type System
Core
JudgeVerdict
result: Pass | Fail, confidence: number, critique: string, evidence: string[]. The atomic unit of judge output.
Pairwise
PairwiseVerdict
winner: EvalAgentId | “tie”, perDimension: Record<dimension, winner>, biasDetected: boolean.
Error
ErrorAttribution
category: ErrorCategory, tier: 1|2|3, confidence: number, evidence: string. From heuristic or LLM analysis.
Gaming
AntiGamingSignal
type: verify_then_patch | test_mutation | noop_edit | identical_diff, severity: number, override: boolean.