Research brief β July 1, 2026 Β· 10 tools surveyed Β· Comparing automated approaches to improving agent skills (SKILL.md / CLAUDE.md / cursor rules)
The skill optimization space has exploded in 2026. Three tiers have emerged:
Claude Code plugin Β· SkillOpt repo Β· arXiv paper
The deployment-time companion to SkillOpt. Harvests your Claude Code/Codex sessions, mines recurring tasks, replays them offline, and consolidates learned improvements into validated skill edits. Nothing changes until you approve.
/plugin marketplace add ./skillopt-sleep-plugin/skillopt-sleep adopt with backupgit clone https://github.com/microsoft/SkillOpt.git cd SkillOpt # In Claude Code: /plugin marketplace add ./skillopt-sleep-plugin /plugin install skillopt-sleep@skillopt-sleep /skillopt-sleep dry-run # preview what it would learn /skillopt-sleep run # full cycle, stages proposal /skillopt-sleep adopt # apply changes (with backup)
Blog post Β· GitHub (Claude Code) Β· Best practices learned
Specifically built to optimize CLAUDE.md, .clinerules, and cursor rules. Uses an iterative prompt learning loop with LLM-as-judge evaluation β the closest tool to what you're already doing manually.
Requires setting up eval tasks β more involved than SkillOpt-Sleep but gives you more control over what "good" means.
Plugin page Β· Blog post Β· SKILL.md
Anthropic's official plugin with four composable agents (Executor, Grader, Comparator, Analyzer) for data-driven skill refinement. Semi-automated β human-in-the-loop for approval.
github.com/microsoft/SkillOpt Β· arXiv paper Β· pip install skillopt
The full offline optimizer. Treats skill documents as trainable parameters β runs epochs with (mini-)batch size, learning rate, and validation gates. Produces a compact best_skill.md artifact.
Need to define benchmarks/tasks. More work than Sleep plugin but more control.
github.com/alchaincyf/darwin-skill Β· npx skills add alchaincyf/darwin-skill
Inspired by Karpathy's autoresearch. Evaluates skills on 9 quality dimensions with human-in-the-loop checkpoints. Integrated SkillOpt's validation gate in v2.0.
Best for skills where you have strong aesthetic/quality opinions.
GEPA repo Β· gskill blog Β· gskill guide
Genetic-Pareto evolutionary search for skill optimization. Uses SWE-smith to auto-generate verifiable tasks from any GitHub repo. 90Γ cheaper than RL approaches.
Failure-driven evolutionary skill discovery. Operates on skill folders compatible with Claude Code, Codex, OpenCode. Model stays frozen; only skills evolve.
"AutoGrad for Text" β PyTorch-like API for optimizing prompts using textual feedback. General purpose, not skill-specific. Used as baseline in SkillOpt comparisons.
dspy.ai Β· GitHub Β· GEPA integration
Declarative LM programs with automatic prompt optimization (MIPROv2, COPRO). You have this as a Hermes skill. GEPA now integrated. Building block, not ready-made.
Many-to-one trajectory consolidation. Merges all trajectory patches simultaneously (order-independent). Works with 35B+ open-source models.
pip install claudemd Β· pip install agentmd-gen
CLAUDE.md lifecycle management (generate, staleness check, update, lint). agentmd generates context files for all agent platforms. Not optimization, but useful supporting tools.
| Tool | Auto? | Claude Code? | Needs benchmarks? | Cost | Best for |
|---|---|---|---|---|---|
| SkillOpt-Sleep | Yes (review before adopt) | Native plugin | No (uses your sessions) | API budget for replays | Nightly improvement of daily-use skills |
| Arize Prompt Learning | Yes | Yes | Yes (task splits) | API + Phoenix setup | Measurable accuracy gains on coding tasks |
| Anthropic Skill Creator | Semi (human approval) | Native plugin | No (uses eval prompts) | Minimal | Iterative refinement with A/B testing |
| SkillOpt (full) | Yes | Yes | Yes (6 built-in) | Significant API budget | Maximum lift on specific benchmarks |
| Darwin-Skill | Semi (checkpoints) | Yes (skills.sh) | No (rubric-based) | API for evals | Quality-focused skills with aesthetic criteria |
| GEPA/gskill | Yes | Yes | Auto-generated from repos | 90Γ cheaper than RL | Repo-specific coding skills |
| EvoSkill | Yes | Yes | No (failure-driven) | API for rollouts | Discovering new skills from failures |
| Your Manual Refiner | No (manual) | Hermes skill | No (uses transcripts) | Time only | Catching domain-specific preference drift |
| Paper | Key Idea | Link |
|---|---|---|
| SkillOpt | Text-space optimizer with deep-learning discipline | arXiv 2605.23904 |
| SkillLens | Skill quality measurement (9-dimension rubric) | arXiv 2605.23899 |
| EvoSkill | Failure-driven evolutionary skill discovery | arXiv 2603.02766 |
| Trace2Skill | Many-to-one trajectory consolidation | arXiv 2603.25158 |
| TextGrad | AutoGrad for text β gradient-style prompt optimization | GitHub |
| Codified Context | Tiered context architecture for large projects | arXiv 2602.20478 |
| Beyond the Prompt | Empirical study of Cursor rules effectiveness | arXiv 2512.18925 |