1
Primary Influences
LangChain Harness Engineering
The Three-Knob Model
System prompt + tools + middleware optimization alone moved an agent 13.7 points on Terminal Bench 2.0 (52.8% → 66.5%) without changing the model. This “three-knob model” is the core design axis—each knob tunable and measurable independently. The iterative trace-analyze-modify loop maps directly to our Hermetic Loop.
+13.7pt gain
Terminal Bench 2.0
trace-analyze-modify
EvoSkill V1
Sentient AGI
Three-agent evolutionary loop: Executor → Proposer → Skill Builder. Frontier-based selection on git branches. Key finding: skills evolved for one agent transfer zero-shot to others. Supports Claude Code, Codex, OpenCode, OpenHands, Goose.
Apache 2.0
OfficeQA +7.3%
SealQA +12.1%
BrowseComp +5.3%
Hermes Self-Evolution
Nous Research
DSPy + GEPA for reflective prompt evolution. Reads execution traces to understand why things fail. Four-tier optimization: skills → tool descriptions → system prompts → code. Guardrails include test suite, size limits, caching compatibility, and semantic preservation.
MIT License
~$2–10/run
No GPU required
GEPA
ICLR 2026 Oral
Genetic-Pareto prompt evolution—reflective mutations from execution traces. Works with as few as 3 examples. Outperforms RL and previous DSPy optimizers. The reflective failure analysis engine that powers Hermes’ self-evolution.
Sentient AGI / DSPy
3-example minimum
Pareto-optimal
2
Architecture References
METR transcripts-oco
Adapter pattern per agent → common schema → per-transcript metrics. Directly inspired our ProcessSpawner interface and per-agent adapter architecture.
Adapter pattern · time-savings methodology
Agent of Empires
9-agent orchestrator using tmux + git worktrees + Docker. Serializes worktree creation to avoid .git/config.lock race conditions. 1.4k stars.
Worktree isolation · agent registry
AgentLog
Vendor-neutral JSONL schema for agent sessions (specVersion 0.2.0). Informed our EnrichedEvent taxonomy and JSONL streaming architecture.
Event taxonomy · specVersion 0.2.0
Vexp SWE-Bench
Side-by-side comparison reports: pass@1, cost, duration, tokens. Established the comparison report structure our Reporter module follows.
Comparison reports · pass@1 methodology
3
Evaluation Methodology
Augment Code Observability
Four dimensions of agent observability: execution traces, output evaluations, cost attribution, and identity tracking. Shaped our enrichment pipeline’s metric surface.
4 dimensions · observability framework
Berkeley/Penn Cheating Paper
28+ cheating instances across 9 benchmarks. Top 3 Terminal-Bench submissions cheated. Built Meerkat for automated trace auditing. Trace auditing is non-negotiable.
Trace auditing · anti-gaming heuristics
PoLL Pattern
Multi-judge panel: one Claude + one non-Claude judge. Prevents self-enhancement bias when evaluating Claude Code. Our dual-judge architecture uses Claude + Kimi K2.5 with position-swap protocol.
Multi-judge panel · bias mitigation
4
Key Findings
“Git worktrees are the consensus isolation primitive.”
Agent of Empires, Claude Code, amux, SubQ side-agents
“AGENTS.md is a double-edged sword—same file boosts one task 25%, hurts another 30%. Progressive disclosure wins.”
Augment Code observability study
“Top 3 Terminal-Bench submissions cheated. Trace auditing is non-negotiable.”
Berkeley/Penn cheating paper · 28+ instances across 9 benchmarks
“Skills evolved for one agent transfer zero-shot to others.”
EvoSkill V1 · cross-agent/cross-model transferability
“GEPA works with as few as 3 examples—outperforms RL and previous DSPy optimizers.”
Hermes Self-Evolution · ICLR 2026 Oral
5
The Three-Knob Model
LangChain’s core insight distilled: three independent levers, each tunable and measurable. The entire eval framework exists to turn these knobs scientifically.
Knob 1 — System Prompt
The agent’s instructions. GEPA-style reflective mutation from imperative extraction. Highest impact, highest risk—prompt changes can break caching, shift identity, or invalidate behavioral expectations.
Knob 2 — Tools
Allow/deny lists, tool descriptions, parameter schemas. Reversible—constrain or expand what the agent can do per task. Enable grep for search-heavy tasks, deny write for analysis-only.
Knob 3 — Middleware / Skills
Additive SKILL.md files that teach patterns: search persistence, verification-first, error recovery. Lowest risk—purely additive, no existing config modified. EvoSkill’s frontier selects winning skill combinations.
Outer → inner: decreasing risk, increasing frequency of optimization
The 13.7-point result: LangChain moved from 52.8% to 66.5% on Terminal Bench 2.0 by tuning these three knobs alone—no model change, no fine-tuning, no architecture modification. The eval framework exists to reproduce and surpass this result across SubQ Code, Claude Code, and Codex CLI.