The Critical Gap
The eval-audit diagnostic identified the biggest missing piece: zero mention of validating judges against human labels. No confusion matrix. No TPR/TNR measurement. The PoLL and position-swap mitigate bias but do not tell you whether judges are correct. This page describes the validation pipeline that closes that gap.
Data Requirements
100 binary-labeled traces across all 5 dimensions, plus 50 pairwise-labeled pairs. Minimum viable: 60 traces (confidence intervals widen below 60).
subq eval run against 25–30 tasks × 3 agents × 2 runs = ~180 sessions. Curate a balanced set.
Split Strategy
15/45/40 train/dev/test, stratified by label AND agent. Each split contains traces from all 3 agents in the same proportion.
Train (15%)
Source for few-shot examples in judge prompts. These traces become the calibration anchors in the scratchpad protocol.
Dev (45%)
Iterate on judge prompts here. Measure TPR/TNR. Tune until all dimensions clear threshold. Can be reused freely.
Test (40%)
Final measurement. Run exactly once. Never iterate after seeing results. This is the number you report.
Validation Metrics
TPR (true positive rate) and TNR (true negative rate) per dimension. Target: >90% on both. Minimum acceptable: 80%/80%. Pairwise: position-swap concordance >85%.
Target Gauges
Confusion Matrices
Per-dimension confusion matrices. TP and TN should dominate. High FP means the judge is too strict; high FN means it’s too lenient.
Rogan-Gladen Bias Correction
Raw judge pass rates are biased estimates of true pass rates. The Rogan-Gladen formula corrects for known TPR and TNR, producing the unbiased estimate with bootstrap 95% confidence intervals.
Self-Enhancement Bias Detection
Measure per-agent TPR/TNR separately. If Claude-judge TPR on Claude Code traces is >5% above mean, flag selfEnhancementBias. This is why the PoLL panel auto-selects model families based on which agents are being evaluated.
Synthetic Data for Bootstrapping
100 synthetic EnrichedSession objects with known ground-truth labels. Used to bootstrap judge validation before real traces are available.
Cross-Dimensional Edge Cases
E1: Beautiful but Bloated
Correct fix, beautiful code, but agent reformatted the entire file. PASS on correctness, FAIL on minimal diff.
E2: Wrong but Thorough
Wrong fix, but comprehensive tests testing the wrong behavior. FAIL on correctness, borderline on completeness.
E3: Test Mutation
Agent modifies test to match buggy behavior instead of fixing code. verification_pass but FAIL. Adversarial case.
E4: Phantom Pass
Zero diff, tests pass because bug is intermittent. taskResolved = true but FAIL. Pure false positive.
Validation Type System
Labels
HumanLabel
sessionId, dimension, result: Pass | Fail, confidence, notes. One label per dimension per trace.
Metrics
DimensionMetrics
tp, fp, tn, fn, tpr, tnr, precision, recall. Per-dimension confusion matrix metrics.
Correction
BiasCorrection
observedRate, correctedRate, tpr, tnr, ci95Low, ci95High. Rogan-Gladen output per dimension.
Report
ValidationReport
dimensionMetrics: Record, pairwiseMetrics, selfEnhancementFlags, overallVerdict: string. The full validation result.