🔬 Automatic Skill Refiners for AI Coding Agents

Research brief — July 1, 2026 · 10 tools surveyed · Comparing automated approaches to improving agent skills (SKILL.md / CLAUDE.md / cursor rules)

Executive Summary

The skill optimization space has exploded in 2026. Three tiers have emerged:

Tier 1 — Ready now SkillOpt-Sleep, Arize Prompt Learning, Anthropic Skill Creator

Tier 2 — Powerful but setup-heavy SkillOpt (full), GEPA/gskill, Darwin-Skill, EvoSkill

Tier 3 — Building blocks TextGrad, DSPy, Trace2Skill, claudemd, agentmd

🏆 Tier 1: Ready to Use Today

SkillOpt-Sleep (Microsoft)

TOP PICK 10,330 ★ (parent repo) June 2026

Claude Code plugin · SkillOpt repo · arXiv paper

The deployment-time companion to SkillOpt. Harvests your Claude Code/Codex sessions, mines recurring tasks, replays them offline, and consolidates learned improvements into validated skill edits. Nothing changes until you approve.

How One "Night" Works

harvest ~/.claude transcripts → mine recurring tasks → replay offline → consolidate (reflect → bounded edit → GATE on held-out tasks) → stage proposal → (you) adopt or discard

Key Features

Claude Code plugin ready — /plugin marketplace add ./skillopt-sleep-plugin
Validation gate — edits only accepted when held-out score strictly improves
Nothing live until you adopt — /skillopt-sleep adopt with backup
Free dry-run — mock backend costs nothing, try the plumbing first
Experience replay — recall past similar tasks to strengthen nightly updates
Deficient skills went 0.00 → 1.00 on held-out tasks in 1-2 nights

Quick Start

git clone https://github.com/microsoft/SkillOpt.git
cd SkillOpt
# In Claude Code:
/plugin marketplace add ./skillopt-sleep-plugin
/plugin install skillopt-sleep@skillopt-sleep
/skillopt-sleep dry-run   # preview what it would learn
/skillopt-sleep run       # full cycle, stages proposal
/skillopt-sleep adopt     # apply changes (with backup)

Arize Phoenix Prompt Learning

TOP PICK 2026

Blog post · GitHub (Claude Code) · Best practices learned

Specifically built to optimize CLAUDE.md, .clinerules, and cursor rules. Uses an iterative prompt learning loop with LLM-as-judge evaluation — the closest tool to what you're already doing manually.

How It Works

run coding agent on task splits → LLM-as-judge evaluates results → PromptLearningOptimizer generates improved rulesets → iterate for multiple loops → track in Phoenix for observability

Results

+10-15% accuracy on SWE-bench for Cline
+10% boost for Claude Code with no architecture changes
Has specific cookbooks for Claude Code and Cline
Full observability via Phoenix dashboard

Requires setting up eval tasks — more involved than SkillOpt-Sleep but gives you more control over what "good" means.

Anthropic Skill Creator (Official)

Official plugin

Plugin page · Blog post · SKILL.md

Anthropic's official plugin with four composable agents (Executor, Grader, Comparator, Analyzer) for data-driven skill refinement. Semi-automated — human-in-the-loop for approval.

Modes

Create — extracts workflow from conversation and builds a skill
Eval — defines test prompts and checks skill behavior
Improve — analyzes eval results and suggests targeted improvements
Benchmark — runs standardized assessments with variance analysis
Blind A/B comparison between skill versions

🔧 Tier 2: Powerful But Requires Setup

SkillOpt (Full Framework)

10,330 ★ May 2026

github.com/microsoft/SkillOpt · arXiv paper · pip install skillopt

The full offline optimizer. Treats skill documents as trainable parameters — runs epochs with (mini-)batch size, learning rate, and validation gates. Produces a compact best_skill.md artifact.

Results

Best or tied-best on all 52 evaluated cells (model × benchmark × harness)
On GPT-5.5: +23.5 pts direct chat, +24.8 Codex, +19.1 Claude Code
Skills transfer across models and harnesses

Need to define benchmarks/tasks. More work than Sleep plugin but more control.

达尔文.skill (Darwin-Skill) 2.0

4,490 ★ Apr 2026

github.com/alchaincyf/darwin-skill · npx skills add alchaincyf/darwin-skill

Inspired by Karpathy's autoresearch. Evaluates skills on 9 quality dimensions with human-in-the-loop checkpoints. Integrated SkillOpt's validation gate in v2.0.

Key Differences

Human-in-the-loop by design — 3-layer checkpoint system
9-dimension rubric: failure encoding, actionable specificity, high-risk blacklist, etc.
Anti-pattern blacklist: prevents AI from grading its own work (LLM self-eval = 46.4% accurate)
Early stopping: auto-stops if gain < 1 point per round
Results: 80.8 → 91.65 on real skills

Best for skills where you have strong aesthetic/quality opinions.

GEPA / gskill

2026

GEPA repo · gskill blog · gskill guide

Genetic-Pareto evolutionary search for skill optimization. Uses SWE-smith to auto-generate verifiable tasks from any GitHub repo. 90× cheaper than RL approaches.

Maintains a Pareto frontier of candidate skills
Skills transfer across models (train on cheap, deploy on expensive)
Improved Jinja repo bug-fixing: 55% → 82% resolve rate
Now integrated into DSPy

EvoSkill (Sentient AGI)

2026

GitHub · arXiv paper

Failure-driven evolutionary skill discovery. Operates on skill folders compatible with Claude Code, Codex, OpenCode. Model stays frozen; only skills evolve.

Skills materialized as structured folders (instructions + trigger metadata + helper scripts)
Retained only if they improve held-out validation
Maintains Pareto frontier of agent programs

🧱 Tier 3: Building Blocks & Supporting Tools

TextGrad (Stanford)

GitHub · Stanford HAI

"AutoGrad for Text" — PyTorch-like API for optimizing prompts using textual feedback. General purpose, not skill-specific. Used as baseline in SkillOpt comparisons.

DSPy (Stanford NLP)

dspy.ai · GitHub · GEPA integration

Declarative LM programs with automatic prompt optimization (MIPROv2, COPRO). You have this as a Hermes skill. GEPA now integrated. Building block, not ready-made.

Trace2Skill

arXiv paper

Many-to-one trajectory consolidation. Merges all trajectory patches simultaneously (order-independent). Works with 35B+ open-source models.

claudemd + agentmd

pip install claudemd · pip install agentmd-gen

CLAUDE.md lifecycle management (generate, staleness check, update, lint). agentmd generates context files for all agent platforms. Not optimization, but useful supporting tools.

Full Comparison Matrix

Tool	Auto?	Claude Code?	Needs benchmarks?	Cost	Best for
SkillOpt-Sleep	Yes (review before adopt)	Native plugin	No (uses your sessions)	API budget for replays	Nightly improvement of daily-use skills
Arize Prompt Learning	Yes	Yes	Yes (task splits)	API + Phoenix setup	Measurable accuracy gains on coding tasks
Anthropic Skill Creator	Semi (human approval)	Native plugin	No (uses eval prompts)	Minimal	Iterative refinement with A/B testing
SkillOpt (full)	Yes	Yes	Yes (6 built-in)	Significant API budget	Maximum lift on specific benchmarks
Darwin-Skill	Semi (checkpoints)	Yes (skills.sh)	No (rubric-based)	API for evals	Quality-focused skills with aesthetic criteria
GEPA/gskill	Yes	Yes	Auto-generated from repos	90× cheaper than RL	Repo-specific coding skills
EvoSkill	Yes	Yes	No (failure-driven)	API for rollouts	Discovering new skills from failures
Your Manual Refiner	No (manual)	Hermes skill	No (uses transcripts)	Time only	Catching domain-specific preference drift

🎯 Recommendation for Your Pipeline

Start with SkillOpt-Sleep for nightly skill consolidation, then add Arize Prompt Learning for measurable accuracy tracking on your content pipeline. Keep your manual refiner for catching things automated tools miss.

Skill-Specific Strategy

📊 Content Pipeline (Infographics)

SkillOpt-Sleep — high-frequency, lots of trajectory data from Obsidian KG sessions
Arize Prompt Learning — define "good infographic" as eval criteria, track improvement over time
Your "ask for changes" feedback = perfect training signal for the nightly cycle

🎬 Video Editing (Remotion)

Darwin-Skill — human checkpoints catch subtle quality/aesthetic issues
Anthropic Skill Creator — A/B compare video outputs between skill versions
Less frequent use = fewer trajectories for SkillOpt-Sleep to learn from

Hybrid Approach (Recommended)

Install SkillOpt-Sleep as Claude Code plugin for nightly consolidation of all skills
Keep your manual refiner for high-value skills where domain expertise matters
Add Arize Phoenix when you want measurable tracking of skill quality over time
Use Darwin-Skill's rubric as a quality checklist when manually reviewing proposals

📄 Academic Papers

Paper	Key Idea	Link
SkillOpt	Text-space optimizer with deep-learning discipline	arXiv 2605.23904
SkillLens	Skill quality measurement (9-dimension rubric)	arXiv 2605.23899
EvoSkill	Failure-driven evolutionary skill discovery	arXiv 2603.02766
Trace2Skill	Many-to-one trajectory consolidation	arXiv 2603.25158
TextGrad	AutoGrad for text — gradient-style prompt optimization	GitHub
Codified Context	Tiered context architecture for large projects	arXiv 2602.20478
Beyond the Prompt	Empirical study of Cursor rules effectiveness	arXiv 2512.18925

🔗 All Links

Microsoft SkillOpt (10.3K ★) — core optimizer + Sleep plugin
SkillOpt-Sleep Claude Code plugin
SkillOpt project page
Microsoft Research blog post
VentureBeat coverage
Arize Prompt Learning (Claude Code)
Arize blog post
Arize: CLAUDE.md best practices
Darwin-Skill 2.0 (4.5K ★)
GEPA/gskill
EvoSkill (Sentient AGI)
GBrain (24.8K ★) — broader knowledge brain
GBrain Evals — standard benchmark suite
TextGrad (Stanford)
DSPy
claudemd — CLAUDE.md lifecycle management
agentmd — multi-platform context file generator