πŸ”¬ Automatic Skill Refiners for AI Coding Agents

Research brief β€” July 1, 2026 Β· 10 tools surveyed Β· Comparing automated approaches to improving agent skills (SKILL.md / CLAUDE.md / cursor rules)

Executive Summary

The skill optimization space has exploded in 2026. Three tiers have emerged:

Tier 1 β€” Ready now SkillOpt-Sleep, Arize Prompt Learning, Anthropic Skill Creator
Tier 2 β€” Powerful but setup-heavy SkillOpt (full), GEPA/gskill, Darwin-Skill, EvoSkill
Tier 3 β€” Building blocks TextGrad, DSPy, Trace2Skill, claudemd, agentmd

πŸ† Tier 1: Ready to Use Today

SkillOpt-Sleep (Microsoft)

TOP PICK 10,330 β˜… (parent repo) June 2026

Claude Code plugin Β· SkillOpt repo Β· arXiv paper

The deployment-time companion to SkillOpt. Harvests your Claude Code/Codex sessions, mines recurring tasks, replays them offline, and consolidates learned improvements into validated skill edits. Nothing changes until you approve.

How One "Night" Works

harvest ~/.claude transcripts β†’ mine recurring tasks β†’ replay offline β†’ consolidate (reflect β†’ bounded edit β†’ GATE on held-out tasks) β†’ stage proposal β†’ (you) adopt or discard

Key Features

Quick Start

git clone https://github.com/microsoft/SkillOpt.git
cd SkillOpt
# In Claude Code:
/plugin marketplace add ./skillopt-sleep-plugin
/plugin install skillopt-sleep@skillopt-sleep
/skillopt-sleep dry-run   # preview what it would learn
/skillopt-sleep run       # full cycle, stages proposal
/skillopt-sleep adopt     # apply changes (with backup)

Arize Phoenix Prompt Learning

TOP PICK 2026

Blog post Β· GitHub (Claude Code) Β· Best practices learned

Specifically built to optimize CLAUDE.md, .clinerules, and cursor rules. Uses an iterative prompt learning loop with LLM-as-judge evaluation β€” the closest tool to what you're already doing manually.

How It Works

run coding agent on task splits β†’ LLM-as-judge evaluates results β†’ PromptLearningOptimizer generates improved rulesets β†’ iterate for multiple loops β†’ track in Phoenix for observability

Results

Requires setting up eval tasks β€” more involved than SkillOpt-Sleep but gives you more control over what "good" means.

Anthropic Skill Creator (Official)

Official plugin

Plugin page Β· Blog post Β· SKILL.md

Anthropic's official plugin with four composable agents (Executor, Grader, Comparator, Analyzer) for data-driven skill refinement. Semi-automated β€” human-in-the-loop for approval.

Modes

πŸ”§ Tier 2: Powerful But Requires Setup

SkillOpt (Full Framework)

10,330 β˜… May 2026

github.com/microsoft/SkillOpt Β· arXiv paper Β· pip install skillopt

The full offline optimizer. Treats skill documents as trainable parameters β€” runs epochs with (mini-)batch size, learning rate, and validation gates. Produces a compact best_skill.md artifact.

Results

Need to define benchmarks/tasks. More work than Sleep plugin but more control.

θΎΎε°”ζ–‡.skill (Darwin-Skill) 2.0

4,490 β˜… Apr 2026

github.com/alchaincyf/darwin-skill Β· npx skills add alchaincyf/darwin-skill

Inspired by Karpathy's autoresearch. Evaluates skills on 9 quality dimensions with human-in-the-loop checkpoints. Integrated SkillOpt's validation gate in v2.0.

Key Differences

Best for skills where you have strong aesthetic/quality opinions.

GEPA / gskill

2026

GEPA repo Β· gskill blog Β· gskill guide

Genetic-Pareto evolutionary search for skill optimization. Uses SWE-smith to auto-generate verifiable tasks from any GitHub repo. 90Γ— cheaper than RL approaches.

EvoSkill (Sentient AGI)

2026

GitHub Β· arXiv paper

Failure-driven evolutionary skill discovery. Operates on skill folders compatible with Claude Code, Codex, OpenCode. Model stays frozen; only skills evolve.

🧱 Tier 3: Building Blocks & Supporting Tools

TextGrad (Stanford)

GitHub Β· Stanford HAI

"AutoGrad for Text" β€” PyTorch-like API for optimizing prompts using textual feedback. General purpose, not skill-specific. Used as baseline in SkillOpt comparisons.

DSPy (Stanford NLP)

dspy.ai Β· GitHub Β· GEPA integration

Declarative LM programs with automatic prompt optimization (MIPROv2, COPRO). You have this as a Hermes skill. GEPA now integrated. Building block, not ready-made.

Trace2Skill

arXiv paper

Many-to-one trajectory consolidation. Merges all trajectory patches simultaneously (order-independent). Works with 35B+ open-source models.

claudemd + agentmd

pip install claudemd Β· pip install agentmd-gen

CLAUDE.md lifecycle management (generate, staleness check, update, lint). agentmd generates context files for all agent platforms. Not optimization, but useful supporting tools.

Full Comparison Matrix

Tool Auto? Claude Code? Needs benchmarks? Cost Best for
SkillOpt-Sleep Yes (review before adopt) Native plugin No (uses your sessions) API budget for replays Nightly improvement of daily-use skills
Arize Prompt Learning Yes Yes Yes (task splits) API + Phoenix setup Measurable accuracy gains on coding tasks
Anthropic Skill Creator Semi (human approval) Native plugin No (uses eval prompts) Minimal Iterative refinement with A/B testing
SkillOpt (full) Yes Yes Yes (6 built-in) Significant API budget Maximum lift on specific benchmarks
Darwin-Skill Semi (checkpoints) Yes (skills.sh) No (rubric-based) API for evals Quality-focused skills with aesthetic criteria
GEPA/gskill Yes Yes Auto-generated from repos 90Γ— cheaper than RL Repo-specific coding skills
EvoSkill Yes Yes No (failure-driven) API for rollouts Discovering new skills from failures
Your Manual Refiner No (manual) Hermes skill No (uses transcripts) Time only Catching domain-specific preference drift

🎯 Recommendation for Your Pipeline

Start with SkillOpt-Sleep for nightly skill consolidation, then add Arize Prompt Learning for measurable accuracy tracking on your content pipeline. Keep your manual refiner for catching things automated tools miss.

Skill-Specific Strategy

πŸ“Š Content Pipeline (Infographics)

🎬 Video Editing (Remotion)

Hybrid Approach (Recommended)

  1. Install SkillOpt-Sleep as Claude Code plugin for nightly consolidation of all skills
  2. Keep your manual refiner for high-value skills where domain expertise matters
  3. Add Arize Phoenix when you want measurable tracking of skill quality over time
  4. Use Darwin-Skill's rubric as a quality checklist when manually reviewing proposals

πŸ“„ Academic Papers

PaperKey IdeaLink
SkillOptText-space optimizer with deep-learning disciplinearXiv 2605.23904
SkillLensSkill quality measurement (9-dimension rubric)arXiv 2605.23899
EvoSkillFailure-driven evolutionary skill discoveryarXiv 2603.02766
Trace2SkillMany-to-one trajectory consolidationarXiv 2603.25158
TextGradAutoGrad for text β€” gradient-style prompt optimizationGitHub
Codified ContextTiered context architecture for large projectsarXiv 2602.20478
Beyond the PromptEmpirical study of Cursor rules effectivenessarXiv 2512.18925

πŸ”— All Links