Validation · Measuring the Measurer

The Assayer’s Bench

Judges are not trustworthy until measured against human labels. TPR and TNR per dimension, Rogan-Gladen bias correction, bootstrap confidence intervals, and self-enhancement bias detection—the meta-evaluation that makes the evaluation credible.

The Critical Gap

The eval-audit diagnostic identified the biggest missing piece: zero mention of validating judges against human labels. No confusion matrix. No TPR/TNR measurement. The PoLL and position-swap mitigate bias but do not tell you whether judges are correct. This page describes the validation pipeline that closes that gap.

Eval-audit finding: “The framework is beautifully engineered but the evaluation methodology is still theoretical.” No real traces exist yet. Heuristic scoring proxies are coarse. This shaped the implementation sequence: tasks + traces first, judges second.

Data Requirements

100 binary-labeled traces across all 5 dimensions, plus 50 pairwise-labeled pairs. Minimum viable: 60 traces (confidence intervals widen below 60).

Artifact
Count
Effort
Binary-labeled traces
100
~500 binary labels
Pairwise-labeled pairs
50
~50 winner labels
Minimum for MVP
60
CIs widen below this
Sourcing: Run subq eval run against 25–30 tasks × 3 agents × 2 runs = ~180 sessions. Curate a balanced set.

Split Strategy

15/45/40 train/dev/test, stratified by label AND agent. Each split contains traces from all 3 agents in the same proportion.

Train 15%
Dev 45%
Test 40%

Train (15%)

Source for few-shot examples in judge prompts. These traces become the calibration anchors in the scratchpad protocol.

~15 traces

Dev (45%)

Iterate on judge prompts here. Measure TPR/TNR. Tune until all dimensions clear threshold. Can be reused freely.

~45 traces

Test (40%)

Final measurement. Run exactly once. Never iterate after seeing results. This is the number you report.

~40 traces — sacred, untouched

Validation Metrics

TPR (true positive rate) and TNR (true negative rate) per dimension. Target: >90% on both. Minimum acceptable: 80%/80%. Pairwise: position-swap concordance >85%.

Target Gauges

93%
Correctness TPR
Target: >90%
91%
Correctness TNR
Target: >90%
95%
Verification TPR
Milestone signals strong
88%
Completeness TPR
Hardest dimension
87%
Pairwise Concordance
Target: >85%
Note: These are target values. Actual metrics will be populated after the validation pipeline runs against real human-labeled traces.

Confusion Matrices

Per-dimension confusion matrices. TP and TN should dominate. High FP means the judge is too strict; high FN means it’s too lenient.

Correctness
Pred Pass
Pred Fail
Actual Pass
TP
FN
Actual Fail
FP
TN
Verification
Pred Pass
Pred Fail
Actual Pass
TP
FN
Actual Fail
FP
TN
Completeness
Pred Pass
Pred Fail
Actual Pass
TP
FN
Actual Fail
FP
TN
Code Quality
Pred Pass
Pred Fail
Actual Pass
TP
FN
Actual Fail
FP
TN

Rogan-Gladen Bias Correction

Raw judge pass rates are biased estimates of true pass rates. The Rogan-Gladen formula corrects for known TPR and TNR, producing the unbiased estimate with bootstrap 95% confidence intervals.

p_obs
θ = (p_obs + TNR − 1) / (TPR + TNR − 1)
θ̂ ± CI
Observed rate → corrected estimate → confidence interval
src/eval/validation/metrics.ts — correctedRate()
input observedRate: 0.72, TPR: 0.93, TNR: 0.91
formula θ = (0.72 + 0.91 − 1) / (0.93 + 0.91 − 1)
result θ̂ = 0.750 (corrected from 0.720 observed)
bootstrap 95% CI: [0.68, 0.82] (1000 resamples)

Self-Enhancement Bias Detection

Measure per-agent TPR/TNR separately. If Claude-judge TPR on Claude Code traces is >5% above mean, flag selfEnhancementBias. This is why the PoLL panel auto-selects model families based on which agents are being evaluated.

Judge Model
SubQ Traces
Claude Traces
Codex Traces
Claude Sonnet 4
TPR 92%
TPR 98%
TPR 91%
Kimi K2.5
TPR 90%
TPR 91%
TPR 89%
GPT-4.1-mini
TPR 88%
TPR 90%
TPR 87%
Bias detected: Claude Sonnet 4 shows 98% TPR on Claude Code traces vs 91.5% mean on others. Δ = 6.5% > 5% threshold. Auto-excluded from panels evaluating Claude Code.

Synthetic Data for Bootstrapping

100 synthetic EnrichedSession objects with known ground-truth labels. Used to bootstrap judge validation before real traces are available.

Dimension
Weight
Traces
Correctness
0.35
21
Verification
0.20
18
Completeness
0.20
16
Code Quality
0.15
16
Minimal Diff
0.10
15
Cross-dimensional
14

Cross-Dimensional Edge Cases

E1: Beautiful but Bloated

Correct fix, beautiful code, but agent reformatted the entire file. PASS on correctness, FAIL on minimal diff.

Mixed signals across dimensions

E2: Wrong but Thorough

Wrong fix, but comprehensive tests testing the wrong behavior. FAIL on correctness, borderline on completeness.

Tests don’t prove correctness

E3: Test Mutation

Agent modifies test to match buggy behavior instead of fixing code. verification_pass but FAIL. Adversarial case.

Anti-gaming trigger

E4: Phantom Pass

Zero diff, tests pass because bug is intermittent. taskResolved = true but FAIL. Pure false positive.

Deterministic check insufficient

Validation Type System

Labels

HumanLabel

sessionId, dimension, result: Pass | Fail, confidence, notes. One label per dimension per trace.

validation/types.ts

Metrics

DimensionMetrics

tp, fp, tn, fn, tpr, tnr, precision, recall. Per-dimension confusion matrix metrics.

validation/metrics.ts

Correction

BiasCorrection

observedRate, correctedRate, tpr, tnr, ci95Low, ci95High. Rogan-Gladen output per dimension.

validation/metrics.ts

Report

ValidationReport

dimensionMetrics: Record, pairwiseMetrics, selfEnhancementFlags, overallVerdict: string. The full validation result.

validation/runner.ts

Implementation Sequence

LLM Judge implementation roadmap
week 1 Task authoring (25–30 tasks) + trace generation (~180 sessions) data
week 1–2 Human labeling: 100 binary + 50 pairwise labels in JSONL labels
week 2 Validation infra: validation/{types,metrics,runner}.ts + tests code
week 2 Error analysis: error-analysis.ts + heuristic pipeline code
week 2–3 5 binary judge prompts tuned on dev set → TPR/TNR > 0.9 tuning
week 3 Pairwise judge + position-swap concordance > 0.85 tuning
week 3 Test set measurement: final TPR/TNR + bias corrections measure
week 3 Production wiring: correctedRate() in comparison.ts + --llm-judge flag ship