Phase 7 · Self-Evolution

The Hermetic Loop

Solve et coagula—dissolve what failed, coagulate what works. A five-stage evolution cycle that transforms evaluation failures into agent capabilities, automatically.

The Loop

Five stages, endlessly cycling. The eval framework runs agents on tasks, analyzes their failures, proposes improvements, builds new skills or prompt patches, evaluates the result, and selects winners for the frontier. Each iteration dissolves a weakness and coagulates a strength.

Hermetic Loop
solve et coagula
Execute
Run agents on tasks, capture EnrichedSessions
Analyze
Proposer reads failure traces, identifies patterns
Build
Skill Builder writes SKILL.md or prompt patch
Evaluate
Re-run with evolved config, compare scores
Select
Frontier mechanism: top-N survive as git branches

Tiered Optimization

Four tiers of optimization targets, ordered by risk. The loop starts with the safest tier and escalates only when lower tiers plateau.

Tier 1 — Skills

Lowest risk, purely additive. New SKILL.md files that teach the agent patterns: search persistence, verification-first, error recovery. No existing config modified.

Tier 2 — Tool Allow/Deny Lists

Reversible. Constrain or expand which tools the agent can use. Enable grep for search-heavy tasks, deny write for analysis-only tasks.

Tier 3 — System Prompt Sections

Moderate risk. Modify system prompt segments. Cache impact—prompt changes invalidate cached prefixes. Each change must prove its value against cache cost.

Tier 4 — Prompt Ordering

Highest risk. Reorder prompt sections for cache efficiency. A single wrong reordering can degrade cache hit rate and increase latency across all tasks.

Outer → inner: increasing risk, decreasing frequency

Frontier Visualization

Each evolution iteration produces a configuration. Winners are selected by the frontier mechanism and persisted as git branches. The lineage traces back through parent configurations.

Iteration 1

Baseline

Original configuration, no modifications. Score: 0.62

parent: none knob: none

Iteration 3

Search Persistence

Gained search persistence skill. Score: 0.71 (+0.09)

parent: iter-1 knob: skill

Iteration 5

Prompt Recovery

Improved error recovery prompt. Score: 0.78 (+0.07)

parent: iter-3 knob: prompt

Feedback Memory

The Proposer remembers what was tried before. Circular proposals are rejected. Only novel approaches or extensions of successful ones proceed.

proposer — iteration 5 analysis
proposer Analyzing 12 failure traces from iteration 4...
proposer Pattern: agent fails to run tests before reporting completion
Iteration 2 tried: "Always run test suite" prompt patch → rejected (0% delta)
Iteration 3 tried: verification-first skill → accepted (+9% delta)
-----
proposer Proposal: Extend verification skill with pre-commit hook pattern
builder Writing eval-skills/verification-v2/SKILL.md...
guardrail Regression gate: PASS (baseline tasks: 0.63 → 0.64) pass
guardrail Size limit: PASS (SKILL.md: 2.1KB ≤ 15KB) pass
frontier New entry: iter-5-verification-v2 (score: 0.73, parent: iter-3) selected

Guardrails

Check
Threshold
Enforcement
Regression Gate
Score on baseline tasks must not decrease by >2%
Automatic reject. No override.
Size Limits
SKILL.md ≤ 15KB, prompt patch ≤ 5KB
Automatic reject if exceeded.
Cache Compat
Prompt prefix hash must match for Tier 1–2 changes
Warning for Tier 3+. Block if delta >20% latency.
Semantic Check
LLM judge confirms prompt meaning preserved
Automatic reject on semantic drift.
Trace Audit
Anti-gaming heuristics on all eval transcripts
Flag + quarantine suspicious results.
Human Review
All Tier 3–4 changes require manual approval
Gate before deployment. No auto-merge.

Lineage

EvoSkill V1

Sentient AGI

Three-agent evolutionary loop: Executor, Proposer, Skill Builder. Frontier-based selection on git branches. Cross-agent skill transferability—skills evolved for one agent transfer zero-shot to others.

Apache 2.0 OfficeQA +7.3% SealQA +12.1% BrowseComp +5.3%

Hermes Self-Evolution

Nous Research

DSPy + GEPA (Genetic-Pareto Prompt Evolution, ICLR 2026 Oral) for reflective prompt mutation. Reads execution traces to understand why things fail. Four-tier optimization targets.

MIT License ~$2–10/run No GPU required 3-example minimum