Creates, modifies, improves, tests, and benchmarks Claude Code skills using category-aware design, gotchas-driven development, eval prompts, and performance analysis.
From skill-creator-pronpx claudepluginhub leejuoh/claude-code-zero --plugin skill-creator-proThis skill uses the workspace's default tool permissions.
agents/analyzer.mdagents/comparator.mdagents/grader.mdassets/eval_review.htmleval-viewer/generate_review.pyeval-viewer/viewer.htmlreferences/design-patterns.mdreferences/eval-writing-guide.mdreferences/schemas.mdreferences/skill-categories.mdreferences/troubleshooting-guide.mdscripts/__init__.pyscripts/aggregate_benchmark.pyscripts/generate_report.pyscripts/improve_description.pyscripts/package_skill.pyscripts/quick_validate.pyscripts/run_eval.pyscripts/run_loop.pyscripts/utils.pySearches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Implements structured self-debugging workflow for AI agent failures: capture errors, diagnose patterns like loops or context overflow, apply contained recoveries, and generate introspection reports.
Create, test, measure, and iteratively improve skills using category-aware design, gotchas-driven development, and progressive disclosure coaching.
The skill creation process has five phases:
Figure out where the user is in this process and jump in. Maybe they say "I want to make a skill for X" -- start at phase 1. Maybe they already have a draft -- skip to phase 3. Be flexible.
Pay attention to context cues about the user's technical level. Terms like "evaluation" and "benchmark" are fine for most users, but explain terms like "JSON" or "assertion" briefly if you're unsure. This skill serves people across a wide range of familiarity with coding.
Most effective skills start small. At Anthropic, "most of ours began as a few lines and a single gotcha, and got better because people kept adding to them as Claude hit new edge cases." Choose the path that fits:
Path A: Extract — "Turn this into a skill" / "Make what we just did reusable"
Path B: Greenfield — "I want to make a skill for X"
Start with one concrete, challenging task. Get Claude to succeed on that single task, then extract the winning approach into a skill. Don't try to design for every scenario upfront — iterate on one task before expanding.
Success metrics reference:
| Type | Metric | How to measure |
|---|---|---|
| Quantitative | Triggers on 90%+ of relevant queries | Run 10-20 test queries, track auto vs manual trigger rate |
| Quantitative | Completes workflow in fewer tool calls | Compare tool call count with-skill vs without-skill |
| Quantitative | 0 failed API/MCP calls per workflow | Monitor MCP server logs for retry rates and error codes |
| Qualitative | Users don't need to prompt about next steps | During testing, note how often you need to redirect or clarify |
| Qualitative | Workflows complete without user correction | Run the same request 3-5 times, compare structural consistency |
| Qualitative | Consistent results across sessions | Can a new user accomplish the task on first try with minimal guidance? |
Once intent is captured, proactively dig deeper before writing anything:
Wait to write test prompts until you've got this part ironed out.
Before choosing a category, clarify the skill's orientation:
Most skills lean one direction. Knowing which helps you choose the right structure and category below.
Before drafting, identify which of 9 categories the skill fits into. This shapes design choices, testing priorities, and improvement patterns. Read ${CLAUDE_SKILL_DIR}/references/skill-categories.md for the full guide with templates and category-specific advice.
| # | Category | Signature |
|---|---|---|
| 1 | Library & API Reference | Reference snippets + gotchas list |
| 2 | Product Verification | External tool pairing + programmatic assertions |
| 3 | Data Fetching & Analysis | Credential helpers + dashboard IDs + workflows |
| 4 | Business Process Automation | Simple instructions + log-based consistency |
| 5 | Code Scaffolding & Templates | Composable scripts + natural-language requirements |
| 6 | Code Quality & Review | Deterministic scripts + hooks/CI integration |
| 7 | CI/CD & Deployment | Multi-skill composition + error-rate monitoring |
| 8 | Runbooks | Symptom-to-report investigation flows |
| 9 | Infrastructure Operations | Destructive-action guardrails + confirmation gates |
The best skills fit cleanly into one category. Skills that straddle multiple tend to confuse. If a skill spans categories, consider splitting it.
Also identify the skill type:
Before writing SKILL.md, check if the skill uses platform features where the official spec is the source of truth:
Bash(git *), tool restrictions)For these, fetch the official docs index and the relevant page:
WebFetch https://code.claude.com/docs/llms.txt
Then fetch the specific page (e.g., skills.md, hooks.md, plugins-reference.md):
WebFetch https://code.claude.com/docs/en/<page>
Key pages: skills.md (frontmatter spec), hooks.md + hooks-guide.md (hook events/syntax), plugins-reference.md (plugin.json schema), sub-agents.md (agent restrictions).
Skip this step when: writing skill body content, designing gotchas, structuring folders, or working on evals -- these don't depend on platform spec.
Based on the category and intent, write the SKILL.md. Read ${CLAUDE_SKILL_DIR}/references/design-patterns.md for detailed guidance -- it covers both implementation patterns (sequential workflow, multi-MCP coordination, iterative refinement, context-aware tool selection, domain-specific intelligence) and writing patterns (gotchas design, progressive disclosure, hooks, composability).
Core principles:
Don't state the obvious. Claude already knows how to code. If your skill just restates things Claude would do anyway, it's wasting context for zero gain. Focus on information that pushes Claude out of its default patterns -- the frontend-design skill works because it teaches aesthetic choices Claude wouldn't make on its own, not basic React patterns.
Gotchas section = highest ROI. This is the single most impactful thing you can put in a skill. Every gotcha prevents Claude from hitting a failure mode that would waste the user's time. Build it from real failure points -- start with 2-3 based on domain knowledge, then grow it as you test. A good gotcha names the problem AND the fix:
## Gotchas
- Never use `datetime.now()` in tests -- use dependency injection for time
- The API returns `snake_case` but the SDK expects `camelCase` -- always transform
- Batch size > 100 silently drops records without error
Explain the why. LLMs are smart -- when you explain reasoning, they generalize beyond the specific case you wrote about. "We validate timestamps because the API silently accepts future dates but the downstream system crashes" is far more powerful than "ALWAYS validate timestamps." If you find yourself writing ALWAYS or NEVER in all caps, that's a yellow flag -- reframe with reasoning.
Give flexibility. Skills get reused across situations you can't predict. If you over-constrain with rigid step sequences, the skill breaks on anything slightly different from your test cases. Give Claude the information it needs but let it adapt to context.
Key frontmatter fields:
| Field | Description |
|---|---|
name | kebab-case, matches folder name |
description | Trigger condition -- see Phase 5 for optimization |
argument-hint | Hint shown during autocomplete (e.g., [issue-number]) |
allowed-tools | Restrict tools (e.g., Read, Grep, Bash(git *)) |
model | Model override when this skill is active |
effort | Effort level override (low, medium, high) |
context | fork to run in isolated subagent |
agent | Subagent type when context: fork is set (e.g., Explore, Plan) |
hooks | On-demand hooks active during skill execution |
disable-model-invocation | true = manual-only (user invokes with /name) |
paths | YAML list of globs — skill only triggers for matching file paths (e.g., ["src/**/*.ts"]) |
skills | List of skill names to auto-load when subagents execute this skill |
user-invocable | false = hidden from / menu, Claude-only background knowledge |
shell | Shell interpreter for inline shell execution blocks: bash (default) or powershell |
String substitutions available in SKILL.md body:
| Variable | Resolves to |
|---|---|
$ARGUMENTS | Text the user typed after the slash command (e.g., /my-skill fix the login bug → fix the login bug) |
$ARGUMENTS[N] | Nth individual argument (0-indexed). E.g., /my-skill foo bar → $ARGUMENTS[0] = foo |
${CLAUDE_SKILL_DIR} | Absolute path to this skill's folder — use to reference bundled files (${CLAUDE_SKILL_DIR}/references/api.md) |
${CLAUDE_PLUGIN_ROOT} | Plugin root directory — use for hook script paths |
${CLAUDE_PLUGIN_DATA} | Persistent data directory that survives plugin upgrades — use for config, logs, databases |
${CLAUDE_SESSION_ID} | Current session ID — use for per-session tracking or logging |
${CLAUDE_SKILL_DIR} is the most important for skill authors. Use it whenever your SKILL.md body tells Claude to read a bundled file — it resolves correctly regardless of where the plugin is installed.
A skill is a folder, not just a markdown file. Think of the entire file system as context engineering and progressive disclosure.
Three levels of context loading:
skill-name/
SKILL.md # Instructions and navigation (required)
scripts/ # Executable code for deterministic tasks
references/ # Docs loaded into context as needed
assets/ # Templates, icons, fonts for output
bin/ # Executables invocable as bare commands from Bash tool
When to use each:
scripts/ -- Helper functions, validation scripts, data fetchers. If during testing all subagents independently write a similar script, bundle it here.references/ -- API docs, detailed specifications. Split by variant for multi-framework support (e.g., references/aws.md, references/gcp.md).assets/ -- Output templates, image files. If the output is a markdown file, include a template.bin/ -- Standalone executables that the Bash tool can invoke by name without a full path. Useful for CLI wrappers, data processors, or any tool the skill needs to call repeatedly. Must have execute permission and shebang lines.Reference files from SKILL.md using ${CLAUDE_SKILL_DIR} with when-to-read guidance:
## Additional Resources
- For API details: read `${CLAUDE_SKILL_DIR}/references/api.md`
- For output template: copy `${CLAUDE_SKILL_DIR}/assets/report-template.md`
Using ${CLAUDE_SKILL_DIR} ensures paths resolve correctly regardless of where the plugin is installed. Relative markdown links ([text](references/api.md)) also work for Read tool access, but ${CLAUDE_SKILL_DIR} is more reliable across different invocation contexts.
Some skills need user-specific context (Slack channel, API key, project name). Use lazy initialization:
${CLAUDE_PLUGIN_DATA}/config.json (persists across upgrades)AskUserQuestionSkills can register hooks that activate only during the skill's session. Use these for opinionated guardrails you don't want always-on:
/careful -- Block rm -rf, DROP TABLE, force-push via PreToolUse matcher/freeze -- Block Edit/Write outside a specific directory during debuggingConsider adding hooks when the skill touches production data, involves destructive operations, or needs directory boundaries.
Hook types: command (run a shell script), prompt (inject a model prompt), http (POST JSON to a URL), or agent (spawn a subagent). HTTP hooks are useful for integrations that don't need shell access.
Conditional filtering: Hooks support an if field using permission rule syntax (e.g., Bash(git *)) to narrow when they fire, reducing overhead from process spawning. Compound commands (ls && git push) and env-var-prefixed commands (FOO=bar git push) are matched correctly since v2.1.89.
Permission decisions: PreToolUse hooks can return allow, deny, or defer. The defer decision (v2.1.89+) pauses headless sessions at the tool call — useful for human-in-the-loop gates in -p pipelines, resumed with -p --resume.
Hook output limit: Hook output exceeding 50K characters is saved to disk with a file path + preview instead of being injected directly into context. Design hooks to produce concise output; if your hook generates large results (e.g., lint reports), consider writing to a file and returning just the path.
preventContinuation:true: For prompt-type hooks on non-Stop events, this flag stops the model from continuing after the hook fires (v2.1.92 restored semantics).
Available events: PreToolUse, PostToolUse, SessionStart, Stop, SubagentStop, StopFailure, SessionEnd, SubagentStart, UserPromptSubmit, PreCompact, PostCompact, Notification, PermissionRequest, PermissionDenied, Setup, ConfigChange, CwdChanged, FileChanged, TaskCreated, TeammateIdle, TaskCompleted, InstructionsLoaded, Elicitation, ElicitationResult, WorktreeCreate, WorktreeRemove. Verify syntax against official docs (hooks.md, hooks-guide.md) -- hook events and types evolve across releases.
Notable event: PermissionDenied (v2.1.89+) fires after auto mode classifier denials — return {retry: true} to let the model retry. Useful for skills that need graceful recovery from permission blocks.
For skills that benefit from history (standup posts, recurring reports):
${CLAUDE_PLUGIN_DATA} for stable storage that survives upgrades/skill-test or similar skills will conflict with this skill's eval workflow. Run evals using the steps in Phase 3 directly.cp -r the skill before making changes in Phase 4. Without a snapshot, you can't run a meaningful baseline comparison — the "before" is gone.kill $VIEWER_PID, subsequent launches may fail on port conflicts or you'll accumulate zombie processes.disableSkillShellExecution: true in settings.json (v2.1.91+), which blocks all inline shell execution in skills. If your skill relies on inline shell, document it as a requirement and provide a fallback that uses the Bash tool directly.After drafting, create 2-3 realistic test prompts -- the kind of thing a real user would actually say. Share with the user for approval before running.
Save to evals/evals.json. Don't write assertions yet -- you'll draft them while runs are in progress. See ${CLAUDE_SKILL_DIR}/references/schemas.md for the full schema.
{
"skill_name": "example-skill",
"evals": [
{
"id": 1,
"prompt": "User's realistic task prompt",
"expected_output": "Description of expected result",
"files": []
}
]
}
This section is one continuous sequence -- don't stop partway through. Do NOT use /skill-test or any other testing skill.
Put results in <skill-name>-workspace/ as a sibling to the skill directory. Organize by iteration (iteration-1/, iteration-2/, etc.) with each test case getting a descriptive directory name.
Step 1: Spawn all runs in the same turn
For each test case, spawn two subagents simultaneously -- one with the skill, one without. Launch everything at once so it all finishes around the same time.
cp -r)Write eval_metadata.json for each test case with eval_id, eval_name, prompt, and assertions (empty for now).
Step 2: Draft assertions while runs are in progress
Don't just wait. Draft quantitative assertions with descriptive names. Good assertions are binary (pass/fail), objectively verifiable, and read clearly in the benchmark viewer. See ${CLAUDE_SKILL_DIR}/references/eval-writing-guide.md for how to write good assertions.
Subjective skills (writing style, design quality) are better evaluated qualitatively -- don't force assertions onto things that need human judgment.
Update eval_metadata.json and evals/evals.json with the assertions.
Step 3: Capture timing data as runs complete
When each subagent completes, immediately save total_tokens and duration_ms to timing.json. This data comes through task notifications and isn't persisted elsewhere.
Step 4: Grade, aggregate, and launch the viewer
Grade each run -- Spawn grader (read ${CLAUDE_SKILL_DIR}/agents/grader.md). Save to grading.json. The expectations array must use fields text, passed, and evidence. For programmatically checkable assertions, write and run a script rather than eyeballing it.
Aggregate -- Run from ${CLAUDE_SKILL_DIR}:
python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>
Analyst pass -- Surface patterns the aggregate stats might hide. See ${CLAUDE_SKILL_DIR}/agents/analyzer.md.
Launch viewer:
nohup python ${CLAUDE_SKILL_DIR}/eval-viewer/generate_review.py \
<workspace>/iteration-N \
--skill-name "my-skill" \
--benchmark <workspace>/iteration-N/benchmark.json \
> /dev/null 2>&1 &
VIEWER_PID=$!
For iteration 2+: add --previous-workspace <workspace>/iteration-<N-1>.
For headless/cowork: use --static <output_path> for standalone HTML.
Tell the user the results are in their browser and to come back when done reviewing.
Step 5: Read feedback
Read feedback.json. Empty feedback = user thought it was fine. Focus improvements on specific complaints. Kill the viewer when done: kill $VIEWER_PID 2>/dev/null
Read the transcripts, not just final outputs. Then:
Generalize from feedback. We're creating skills used across many prompts, but iterating on a few examples for speed. Rather than fiddly overfitty changes, understand the underlying principle and fix broadly. If a stubborn issue persists, try different metaphors or patterns -- it's cheap to experiment.
Keep the prompt lean. Remove instructions not pulling their weight. If the skill makes the model waste time on unproductive steps, cut those parts.
Explain the why. Frame instructions around reasoning, not commands. "We validate timestamps because the API silently accepts future dates but the downstream system crashes" beats "ALWAYS validate timestamps."
Detect repeated work. If all subagents independently wrote similar helper scripts, bundle that script in scripts/. Write once, reference from SKILL.md.
Consider hooks. If the model strays outside intended boundaries, add an on-demand hook. Code is deterministic; language interpretation isn't.
Category-specific improvements. Consult ${CLAUDE_SKILL_DIR}/references/skill-categories.md for improvement patterns by category.
iteration-<N+1>/ directory, including baselines--previous-workspace pointing at previous iterationKeep going until the user is happy, feedback is all empty, or progress plateaus.
For rigorous A/B comparison, read ${CLAUDE_SKILL_DIR}/agents/comparator.md and ${CLAUDE_SKILL_DIR}/agents/analyzer.md. Give two outputs to an independent agent without revealing which is which. Optional -- the human review loop is usually sufficient.
If the user wants hands-off optimization instead of the manual review loop above, suggest /auto-optimize. It runs the skill dozens of times, scores outputs with binary evals, mutates the prompt, and keeps only improvements -- all autonomously. Best for skills that already work but need to go from 70% to 95%+.
The description field is the primary triggering mechanism. It's not a summary -- it's a trigger condition written for the model. Write it to be slightly "pushy" to combat undertriggering.
Display cap: The /skills listing truncates descriptions to 250 characters. The full description is still used for triggering, but front-load the most important trigger phrases so they're visible in the menu.
Step 1: Generate 20 trigger eval queries
Mix of should-trigger (8-10) and should-not-trigger (8-10). Queries must be realistic with specific details -- file paths, personal context, typos, casual speech. For should-not-trigger, focus on near-misses that share keywords but actually need something different.
Step 2: Review with user using ${CLAUDE_SKILL_DIR}/assets/eval_review.html template.
Step 3: Run optimization loop from ${CLAUDE_SKILL_DIR}:
python -m scripts.run_loop \
--eval-set <path-to-eval.json> \
--skill-path <path-to-skill> \
--model <model-id-powering-this-session> \
--max-iterations 5 \
--verbose
Step 4: Apply best_description from the JSON output to the skill's SKILL.md frontmatter.
Before packaging, verify:
< >) in frontmattername or description (wrap in quotes: name: "3000")description without quoting (YAML parses description: Triggers: X, Y incorrectly — use quotes)argument-hint (e.g., [topic: foo | bar] — use a plain string)${CLAUDE_PLUGIN_DATA}, not skill directoryclaude plugin validate . passes (checks frontmatter schema and hooks.json)For common issues during skill development (doesn't trigger, triggers too often, instructions not followed, large context, frontmatter errors), read ${CLAUDE_SKILL_DIR}/references/troubleshooting-guide.md.
If claude plugin validate . fails or the issue isn't covered in the troubleshooting guide:
https://code.claude.com/docs/llms.txt to get the docs indexskills.md for frontmatter errors, hooks.md for hook failures, plugins-reference.md for manifest issues)The bundled references in this skill cover design principles and eval methodology, but platform spec (what fields exist, what syntax is valid) lives in the official docs and may have changed since these references were written.
python ${CLAUDE_SKILL_DIR}/scripts/package_skill.py <path/to/skill-folder>
| File | Purpose |
|---|---|
references/skill-categories.md | 9 categories with templates, examples, and improvement patterns |
references/design-patterns.md | Gotchas patterns, progressive disclosure, hooks, setup, composability |
references/schemas.md | JSON schemas for evals, grading, benchmark, comparison |
references/troubleshooting-guide.md | 5 symptoms: doesn't trigger, triggers too often, instructions not followed, large context, frontmatter errors |
agents/grader.md | Evaluate assertions against outputs |
agents/comparator.md | Blind A/B comparison between two outputs |
agents/analyzer.md | Analyze benchmark patterns and comparison results |
Official docs (external): https://code.claude.com/docs/llms.txt → index of all pages. Fetch when working with platform features (frontmatter, hooks, allowed-tools, plugin manifest). Key pages: skills.md, hooks.md, hooks-guide.md, plugins-reference.md, sub-agents.md.
Cowork / headless: Use --static <output_path> for eval viewer. Feedback downloads as feedback.json.
Claude.ai: No subagents -- run test cases inline, one at a time. Skip baselines and benchmarking. Focus on qualitative feedback. Description optimization requires claude CLI -- skip if unavailable.
Written and tested against Claude Code v2.1.92. Key platform features used: ${CLAUDE_SKILL_DIR}, ${CLAUDE_PLUGIN_DATA}, ${CLAUDE_SESSION_ID}, effort frontmatter, skills frontmatter, paths frontmatter, HTTP hooks, conditional if on hooks, context: fork, plugin bin/ executables (v2.1.91+), PermissionDenied hook event (v2.1.89+), defer hook decision (v2.1.89+), disableSkillShellExecution setting (v2.1.91+). If something breaks after a Claude Code update, fetch https://code.claude.com/docs/llms.txt and check the relevant page for spec changes.