The Assayer's Bench — SubQ Code Eval Framework

The Critical Gap

The eval-audit diagnostic identified the biggest missing piece: zero mention of validating judges against human labels. No confusion matrix. No TPR/TNR measurement. The PoLL and position-swap mitigate bias but do not tell you whether judges are correct. This page describes the validation pipeline that closes that gap.

Eval-audit finding: “The framework is beautifully engineered but the evaluation methodology is still theoretical.” No real traces exist yet. Heuristic scoring proxies are coarse. This shaped the implementation sequence: tasks + traces first, judges second.

Data Requirements

100 binary-labeled traces across all 5 dimensions, plus 50 pairwise-labeled pairs. Minimum viable: 60 traces (confidence intervals widen below 60).

Artifact

Count

Effort

Binary-labeled traces

100

~500 binary labels

Pairwise-labeled pairs

~50 winner labels

Minimum for MVP

CIs widen below this

Sourcing: Run subq eval run against 25–30 tasks × 3 agents × 2 runs = ~180 sessions. Curate a balanced set.

Split Strategy

15/45/40 train/dev/test, stratified by label AND agent. Each split contains traces from all 3 agents in the same proportion.

Train 15%

Dev 45%

Test 40%

Train (15%)

Source for few-shot examples in judge prompts. These traces become the calibration anchors in the scratchpad protocol.

~15 traces

Dev (45%)

Iterate on judge prompts here. Measure TPR/TNR. Tune until all dimensions clear threshold. Can be reused freely.

~45 traces

Test (40%)

Final measurement. Run exactly once. Never iterate after seeing results. This is the number you report.

~40 traces — sacred, untouched

Validation Metrics

TPR (true positive rate) and TNR (true negative rate) per dimension. Target: >90% on both. Minimum acceptable: 80%/80%. Pairwise: position-swap concordance >85%.

Target Gauges

93%

Correctness TPR

Target: >90%

91%

Correctness TNR

Target: >90%

95%

Verification TPR

Milestone signals strong

88%

Completeness TPR

Hardest dimension

87%

Pairwise Concordance

Target: >85%

Note: These are target values. Actual metrics will be populated after the validation pipeline runs against real human-labeled traces.

Confusion Matrices

Per-dimension confusion matrices. TP and TN should dominate. High FP means the judge is too strict; high FN means it’s too lenient.

Correctness

Pred Pass

Pred Fail

Actual Pass

Actual Fail

Verification

Pred Pass

Pred Fail

Actual Pass

Actual Fail

Completeness

Pred Pass

Pred Fail

Actual Pass

Actual Fail

Code Quality

Pred Pass

Pred Fail

Actual Pass

Actual Fail

Rogan-Gladen Bias Correction

Raw judge pass rates are biased estimates of true pass rates. The Rogan-Gladen formula corrects for known TPR and TNR, producing the unbiased estimate with bootstrap 95% confidence intervals.

p_obs

→

θ = (p_obs + TNR − 1) / (TPR + TNR − 1)

→

θ̂ ± CI

Observed rate → corrected estimate → confidence interval

src/eval/validation/metrics.ts — correctedRate()

input observedRate: 0.72, TPR: 0.93, TNR: 0.91

formula θ = (0.72 + 0.91 − 1) / (0.93 + 0.91 − 1)

result θ̂ = 0.750 (corrected from 0.720 observed)

bootstrap 95% CI: [0.68, 0.82] (1000 resamples)

Self-Enhancement Bias Detection

Measure per-agent TPR/TNR separately. If Claude-judge TPR on Claude Code traces is >5% above mean, flag selfEnhancementBias. This is why the PoLL panel auto-selects model families based on which agents are being evaluated.

Judge Model

SubQ Traces

Claude Traces

Codex Traces

Claude Sonnet 4

TPR 92%

TPR 98%

TPR 91%

Kimi K2.5

TPR 90%

TPR 91%

TPR 89%

GPT-4.1-mini

TPR 88%

TPR 90%

TPR 87%

Bias detected: Claude Sonnet 4 shows 98% TPR on Claude Code traces vs 91.5% mean on others. Δ = 6.5% > 5% threshold. Auto-excluded from panels evaluating Claude Code.

Synthetic Data for Bootstrapping

100 synthetic EnrichedSession objects with known ground-truth labels. Used to bootstrap judge validation before real traces are available.

Dimension

Weight

Traces

Correctness

0.35

Verification

0.20

Completeness

0.20

Code Quality

0.15

Minimal Diff

0.10

Cross-dimensional

—

Cross-Dimensional Edge Cases

E1: Beautiful but Bloated

Correct fix, beautiful code, but agent reformatted the entire file. PASS on correctness, FAIL on minimal diff.

Mixed signals across dimensions

E2: Wrong but Thorough

Wrong fix, but comprehensive tests testing the wrong behavior. FAIL on correctness, borderline on completeness.

Tests don’t prove correctness

E3: Test Mutation

Agent modifies test to match buggy behavior instead of fixing code. verification_pass but FAIL. Adversarial case.

Anti-gaming trigger

E4: Phantom Pass

Zero diff, tests pass because bug is intermittent. taskResolved = true but FAIL. Pure false positive.

Deterministic check insufficient

Validation Type System

Labels

HumanLabel

sessionId, dimension, result: Pass | Fail, confidence, notes. One label per dimension per trace.

validation/types.ts

Metrics

DimensionMetrics

tp, fp, tn, fn, tpr, tnr, precision, recall. Per-dimension confusion matrix metrics.

validation/metrics.ts

Correction

BiasCorrection

observedRate, correctedRate, tpr, tnr, ci95Low, ci95High. Rogan-Gladen output per dimension.

validation/metrics.ts

Report

ValidationReport

dimensionMetrics: Record, pairwiseMetrics, selfEnhancementFlags, overallVerdict: string. The full validation result.

validation/runner.ts

Implementation Sequence

LLM Judge implementation roadmap

week 1 Task authoring (25–30 tasks) + trace generation (~180 sessions) data

week 1–2 Human labeling: 100 binary + 50 pairwise labels in JSONL labels

week 2 Validation infra: validation/{types,metrics,runner}.ts + tests code

week 2 Error analysis: error-analysis.ts + heuristic pipeline code

week 2–3 5 binary judge prompts tuned on dev set → TPR/TNR > 0.9 tuning

week 3 Pairwise judge + position-swap concordance > 0.85 tuning

week 3 Test set measurement: final TPR/TNR + bias corrections measure

week 3 Production wiring: correctedRate() in comparison.ts + --llm-judge flag ship