Prior Art & Lineage — SubQ Code Eval Framework

Primary Influences

LangChain Harness Engineering

The Three-Knob Model

System prompt + tools + middleware optimization alone moved an agent 13.7 points on Terminal Bench 2.0 (52.8% → 66.5%) without changing the model. This “three-knob model” is the core design axis—each knob tunable and measurable independently. The iterative trace-analyze-modify loop maps directly to our Hermetic Loop.

+13.7pt gain Terminal Bench 2.0 trace-analyze-modify

EvoSkill V1

Sentient AGI

Three-agent evolutionary loop: Executor → Proposer → Skill Builder. Frontier-based selection on git branches. Key finding: skills evolved for one agent transfer zero-shot to others. Supports Claude Code, Codex, OpenCode, OpenHands, Goose.

Apache 2.0 OfficeQA +7.3% SealQA +12.1% BrowseComp +5.3%

Hermes Self-Evolution

Nous Research

DSPy + GEPA for reflective prompt evolution. Reads execution traces to understand why things fail. Four-tier optimization: skills → tool descriptions → system prompts → code. Guardrails include test suite, size limits, caching compatibility, and semantic preservation.

MIT License ~$2–10/run No GPU required

GEPA

ICLR 2026 Oral

Genetic-Pareto prompt evolution—reflective mutations from execution traces. Works with as few as 3 examples. Outperforms RL and previous DSPy optimizers. The reflective failure analysis engine that powers Hermes’ self-evolution.

Sentient AGI / DSPy 3-example minimum Pareto-optimal

Harness engineering alone moves agents 13.7 points without changing the model. The eval framework exists to turn these knobs scientifically.

Architecture References

METR Transcript Analysis

Adapter pattern per agent → common schema → per-transcript metrics. Directly inspired our ProcessSpawner interface and per-agent adapter architecture.

Adapter pattern · time-savings methodology

Agent of Empires

9-agent orchestrator using tmux + git worktrees + Docker. Serializes worktree creation to avoid .git/config.lock race conditions. 1.4k stars.

Worktree isolation · agent registry

AgentLog

Vendor-neutral JSONL schema for agent sessions (specVersion 0.2.0). Informed our EnrichedEvent taxonomy and JSONL streaming architecture.

Event taxonomy · specVersion 0.2.0

Vexp SWE-Bench

Side-by-side comparison reports: pass@1, cost, duration, tokens. Established the comparison report structure our Reporter module follows.

Comparison reports · pass@1 methodology

Evaluation Methodology

Augment Code Observability

Four dimensions of agent observability: execution traces, output evaluations, cost attribution, and identity tracking. Shaped our enrichment pipeline’s metric surface.

4 dimensions · observability framework

Berkeley/Penn Cheating Paper

28+ cheating instances across 9 benchmarks. Top 3 Terminal-Bench submissions cheated. Built Meerkat for automated trace auditing. Trace auditing is non-negotiable. (arXiv)

Trace auditing · anti-gaming heuristics

PoLL Pattern

Multi-judge panel: one Claude + one non-Claude judge. Prevents self-enhancement bias when evaluating Claude Code. Our dual-judge architecture uses Claude + Kimi K2.5 with position-swap protocol.

Multi-judge panel · bias mitigation

Key Findings

“Git worktrees are the consensus isolation primitive.”

Agent of Empires, Claude Code, amux, SubQ side-agents

“AGENTS.md is a double-edged sword—same file boosts one task 25%, hurts another 30%. Progressive disclosure wins.”

Gloaguen et al., 2026 · Augment Code

“Top 3 Terminal-Bench submissions cheated. Trace auditing is non-negotiable.”

Brown et al., 2026 · 28+ instances across 9 benchmarks

“Skills evolved for one agent transfer zero-shot to others.”

EvoSkill V1 · cross-agent/cross-model transferability

“GEPA works with as few as 3 examples—outperforms RL and previous DSPy optimizers.”

Hermes · ICLR 2026 Oral

The Three-Knob Model

LangChain’s core insight distilled: three independent levers, each tunable and measurable. The entire eval framework exists to turn these knobs scientifically.

Knob 1 — System Prompt

The agent’s instructions. GEPA-style reflective mutation from imperative extraction. Highest impact, highest risk—prompt changes can break caching, shift identity, or invalidate behavioral expectations.

Knob 2 — Tools

Allow/deny lists, tool descriptions, parameter schemas. Reversible—constrain or expand what the agent can do per task. Enable grep for search-heavy tasks, deny write for analysis-only.

Knob 3 — Middleware / Skills

Additive SKILL.md files that teach patterns: search persistence, verification-first, error recovery. Lowest risk—purely additive, no existing config modified. EvoSkill’s frontier selects winning skill combinations.

Outer → inner: decreasing risk, increasing frequency of optimization

The 13.7-point result: LangChain moved from 52.8% to 66.5% on Terminal Bench 2.0 by tuning these three knobs alone—no model change, no fine-tuning, no architecture modification. The eval framework exists to reproduce and surpass this result across SubQ Code, Claude Code, and Codex CLI.