The Hermetic Loop — SubQ Code Eval Framework

The Loop

Five stages, endlessly cycling. The eval framework runs agents on tasks, analyzes their failures, proposes improvements, builds new skills or prompt patches, evaluates the result, and selects winners for the frontier. Each iteration dissolves a weakness and coagulates a strength.

Hermetic Loop

solve et coagula

Execute

Run agents on tasks, capture EnrichedSessions

Analyze

Proposer reads failure traces, identifies patterns

Build

Skill Builder writes SKILL.md or prompt patch

Evaluate

Re-run with evolved config, compare scores

Select

Frontier mechanism: top-N survive as git branches

Solve et coagula—dissolve what failed, coagulate what works. Each iteration dissolves a weakness and coagulates a strength.

Tiered Optimization

Four tiers of optimization targets, ordered by risk. The loop starts with the safest tier and escalates only when lower tiers plateau.

Tier 1 — Skills

Lowest risk, purely additive. New SKILL.md files that teach the agent patterns: search persistence, verification-first, error recovery. No existing config modified.

Tier 2 — Tool Allow/Deny Lists

Reversible. Constrain or expand which tools the agent can use. Enable grep for search-heavy tasks, deny write for analysis-only tasks.

Tier 3 — System Prompt Sections

Moderate risk. Modify system prompt segments. Cache impact—prompt changes invalidate cached prefixes. Each change must prove its value against cache cost.

Tier 4 — Prompt Ordering

Highest risk. Reorder prompt sections for cache efficiency. A single wrong reordering can degrade cache hit rate and increase latency across all tasks.

Outer → inner: increasing risk, decreasing frequency

Frontier Visualization

Each evolution iteration produces a configuration. Winners are selected by the frontier mechanism and persisted as git branches. The lineage traces back through parent configurations.

Iteration 1

Baseline

Original configuration, no modifications. Score: 0.62

parent: none knob: none

Iteration 3

Search Persistence

Gained search persistence skill. Score: 0.71 (+0.09)

parent: iter-1 knob: skill

Iteration 5

Prompt Recovery

Improved error recovery prompt. Score: 0.78 (+0.07)

parent: iter-3 knob: prompt

Feedback Memory

The Proposer remembers what was tried before. Circular proposals are rejected. Only novel approaches or extensions of successful ones proceed.

proposer — iteration 5 analysis

proposer Analyzing 12 failure traces from iteration 4...

proposer Pattern: agent fails to run tests before reporting completion

feedback Iteration 2 tried: "Always run test suite" prompt patch → rejected (0% delta)

feedback Iteration 3 tried: verification-first skill → accepted (+9% delta)

-----

proposer Proposal: Extend verification skill with pre-commit hook pattern

builder Writing eval-skills/verification-v2/SKILL.md...

guardrail Regression gate: PASS (baseline tasks: 0.63 → 0.64) pass

guardrail Size limit: PASS (SKILL.md: 2.1KB ≤ 15KB) pass

frontier New entry: iter-5-verification-v2 (score: 0.73, parent: iter-3) selected

Guardrails

Check

Threshold

Enforcement

Regression Gate

Score on baseline tasks must not decrease by >2%

Automatic reject. No override.

Size Limits

SKILL.md ≤ 15KB, prompt patch ≤ 5KB

Automatic reject if exceeded.

Cache Compat

Prompt prefix hash must match for Tier 1–2 changes

Warning for Tier 3+. Block if delta >20% latency.

Semantic Check

LLM judge confirms prompt meaning preserved

Automatic reject on semantic drift.

Trace Audit

Anti-gaming heuristics on all eval transcripts

Flag + quarantine suspicious results.

Human Review

All Tier 3–4 changes require manual approval

Gate before deployment. No auto-merge.

Lineage

EvoSkill V1

Sentient AGI

Three-agent evolutionary loop: Executor, Proposer, Skill Builder. Frontier-based selection on git branches. Cross-agent skill transferability—skills evolved for one agent transfer zero-shot to others.

Apache 2.0 OfficeQA +7.3% SealQA +12.1% BrowseComp +5.3%

Hermes Self-Evolution

Nous Research

DSPy + GEPA (Genetic-Pareto Prompt Evolution, ICLR 2026 Oral) for reflective prompt mutation. Reads execution traces to understand why things fail. Four-tier optimization targets.

MIT License ~$2–10/run No GPU required 3-example minimum