From skill-conductor
Creates, edits, evaluates, and packages agent skills for Claude Code. Manages the full skill lifecycle from drafting through testing to distribution.
How this skill is triggered — by the user, by Claude, or both
Slash command
/skill-conductor:skill-conductorThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Full lifecycle management for agent skills: **draft → test → review → improve → repeat**.
agents/analyzer.mdagents/bineval.mdagents/comparator.mdagents/grader.mdassets/eval_review.htmleval-viewer/generate_review.pyeval-viewer/viewer.htmlreferences/bineval-method.mdreferences/patterns.mdreferences/quality-questions.mdreferences/runtime-setup.mdreferences/schemas.mdreferences/sop-practices.mdscripts/aggregate_benchmark.pyscripts/eval_skill.pyscripts/generate_report.pyscripts/improve_description.pyscripts/init_skill.pyscripts/package_skill.pyscripts/quick_validate.pyFull lifecycle management for agent skills: draft → test → review → improve → repeat.
One skill to rule them all — from architecture to packaging. The core loop is always the same: write something, test it, see what fails, fix it, test again.
Before any mode that touches scripts (CREATE, IMPROVE, VALIDATE, OPTIMIZE, PACKAGE), run the pre-flight block → references/runtime-setup.md (checks uv, sets UV_BIN/SKILL_CONDUCTOR_DIR, verifies LLM access). If uv is absent, stop and tell the user.
Read context cues. If the user is a skill author iterating on their own work, be direct and technical. If they're new to skills, explain the why behind each step — not just what to do, but why it matters. Default to conversational, not robotic.
Detect mode from context. If ambiguous, ask.
| Mode | When | What happens |
|---|---|---|
| 1. CREATE | "build a skill", "new skill for..." | Full lifecycle: intent → architecture → scaffold → write → test |
| 2. IMPROVE | "fix this skill", "it doesn't trigger" | Diagnose → eval loop → self-update loop → iterate |
| 3. VALIDATE | "test this skill", "run evals" | Structural checks + trigger testing + BinEval scoring |
| 4. REVIEW | "review this skill", third-party assessment | 11-point quality gate, quick and focused |
| 5. OPTIMIZE | "improve triggering", "description optimization" | Automated description optimization with train/test split |
| 6. PACKAGE | "package for distribution" | Validate + bundle into .skill file |
Before writing anything, extract 2–3 concrete scenarios.
Ask:
Don't move on until you have a clear picture of what the skill does, for whom, and when. This prevents the most common failure: a skill that does something but triggers for the wrong things.
Before writing the skill, verify the agent fails without it:
If the agent already handles it perfectly, the skill is unnecessary. This sounds obvious, but it's the most skipped step and the most valuable one.
Choose a primary pattern from references/patterns.md (can combine):
| Pattern | Use when |
|---|---|
| Sequential workflow | clear step-by-step process |
| Iterative refinement | output improves with cycles |
| Context-aware selection | same goal, different tools by context |
| Domain intelligence | specialized knowledge beyond tool access |
| Multi-MCP coordination | workflow spans multiple services |
Choose degrees of freedom — this determines how much control vs. flexibility the skill gives the agent:
| Freedom | When | Example |
|---|---|---|
| Low (scripts) | fragile, error-prone, must be exact | PDF rotation, API calls |
| Medium (pseudocode) | preferred pattern exists, some variation ok | data processing |
| High (text) | multiple valid approaches, judgment needed | design decisions |
Golden rule: read references/sop-practices.md before authoring or reviewing ANY skill. It holds the canonical 9 authoring principles (universal): pre-flight, no-process-in-description, MOC (SKILL.md = map, not prose), fresh-practitioner author, TWI "why", blind-agent test, inline checklists, one-term-per-concept, cut-the-fat (env/keys OUT of SKILL.md). For procedural skills (business process with branching: request, quote, onboarding, escalation) the same file also has the deep SOP methodology — format selection, 7-step process, procedural checklist.
uv run scripts/init_skill.py <skill-name> --path <output-dir> [--resources scripts,references,assets]
Or create manually:
skill-name/
├── SKILL.md # required — the brain
├── scripts/ # deterministic operations (executed, not loaded)
├── references/ # detailed docs (loaded on demand)
└── assets/ # templates, images for output (never loaded)
---
name: kebab-case-name
description: >
[Purpose in one sentence]. Use when [triggers].
Do NOT use for [negative triggers].
---
The description is the single most important line. It determines whether the skill gets triggered at all. Rules:
name: lowercase, digits, hyphens only. No consecutive hyphens. Matches folder name. Max 64 charsdescription: max 1024 chars. No angle brackets. No process/workflow steps# GOOD: purpose + triggers, no process
description: Analyze Figma design files for developer handoff. Use when user uploads .fig files or asks for "design specs". Do NOT use for Sketch or Adobe XD.
# BAD: process in description (agent skips body)
description: Exports Figma assets, generates specs, creates Linear tasks, posts to Slack.
# Skill Name
## Overview
What this enables. 1-2 sentences. Core principle.
## [Main sections]
Step-by-step with numbered sequences.
Concrete templates over prose.
Imperative voice throughout.
## Common Mistakes
What goes wrong + how to fix.
## Troubleshooting (if applicable)
Error: [message] → Cause: [why] → Fix: [how]
/home/<user>, /Users/<user>) — reference them, never inline (Principle 9a)references/sop-practices.md Principle 5, TWI)This is the critical step — most failures hide here. Treat it as three sub-phases.
evals/evals.json exists with 3–5 prompts (see references/schemas.md)<skill-name>-workspace/iteration-1/eval-0) and eval_metadata.jsonuv and eval-viewer/generate_review.py are reachable from current working dirIf any item fails — fix before proceeding. A missing workspace dir mid-run loses outputs.
| What | Key move | Why |
|---|---|---|
| Spawn with-skill runs | One subagent per eval, skill active, save outputs to iteration-N/<eval-name>/with_skill/ | Parallel = same wall time as one run |
| Spawn baseline runs in the same turn | Same prompt, no skill (or old version snapshot for IMPROVE), save to without_skill/ or old_skill/ | If you wait, baselines drift in time and aren't comparable |
| Draft assertions while runs execute | Pull verifiable statements from eval prompts | Don't waste the 5–15 min of subagent time |
| Capture timing on each notification | Save total_tokens, duration_ms to timing.json immediately | Notification is the only source — process per-arrival, don't batch |
timing.json files written (one per run)grading.json with fields text, passed, evidence (not name/met)benchmark.json aggregated: uv run scripts/aggregate_benchmark.py <workspace>/iteration-N --skill-name <name>agents/analyzer.md for what to look for (non-discriminating assertions, high-variance evals, time/token tradeoffs)uv run eval-viewer/generate_review.py <workspace> --skill-name <name> --benchmark <path>
--static <output.html> and send file to user--previous-workspace <previous-iteration-path>The last bullet is the trap. If you skip user review and "improve" based on your own reading of outputs, you optimize against your taste, not the user's.
If any fail → iterate. Find how the agent rationalizes around the skill, plug loopholes, re-verify.
Read the existing SKILL.md completely. Identify the problem class:
| Problem | Signal | Fix |
|---|---|---|
| Undertriggering | skill doesn't load | add keywords, trigger phrases, file types to description |
| Overtriggering | loads for unrelated queries | add negative triggers, be more specific |
| Skips body | follows description only | remove process/workflow from description |
| Inconsistent output | varies across sessions | add explicit templates, reduce freedom, add scripts |
| Too slow | large context | move detail to references/, cut body to <500 lines |
scripts/. Saves every future invocation from reinventing the wheelreferences/sop-practices.md — the 9 canonical principles (universal) map directly to skill failure modes: process leaking into description, SKILL.md bloated instead of a map, env/keys inlined, silent improvisation from missing "why", missed edge cases, agents skipping end-of-doc checklists. For process skills (ticket, quote, escalation) also apply the deep SOP methodology in the same fileThe improvement cycle mirrors CREATE Step 6, but focused on the broken behavior:
agents/grader.mduv run eval-viewer/generate_review.py <workspace>
--static <output.html> instead of live serverDrive iteration off failing BinEval questions, not taste. Run the loop:
references/bineval-method.md) → collect failing[]agents/analyzer.md as note-taker: turn the failing questions + their explanations into generalized, deduped lessons (not one-off patches for a single test case)failing[] (or its critical subset) is empty, or after 3 iterations. Keep the best result by gate_passed first, then overall S.When you have two meaningfully different versions:
agents/comparator.md — answers the SAME binary questions for outputs A and B without knowing which skill produced whichagents/analyzer.md — unblinds results, analyzes WHY the winner wonThis prevents bias. The comparator judges output quality, not skill design.
Three stages, run in order.
uv run scripts/eval_skill.py <skill-folder>
Checks: frontmatter, naming, description quality, process leak detection, body size, structure, scripts. Target: 10/10, no warnings.
Generate 6 test prompts:
Run each in clean session. Target: 6/6 correct.
For automated trigger testing at scale, use:
uv run scripts/run_eval.py --eval-set <path> --skill-path <path> --runs-per-query 3
Evaluate with atomic binary yes/no questions across 5 dimensions — each answered 1/0 with evidence. See references/bineval-method.md for the method, references/quality-questions.md for the question bank, and agents/bineval.md for the evaluator that emits bineval.json.
The 5 dimensions: Discovery, Clarity, Structure, Robustness, Completeness.
Questions come from two sources:
scripts/eval_skill.py --json (the sole emitter), e.g. DET-STRUCT-SKILLMD-EXISTS, DET-DISCOVERY-DESC-PRESENT. Some are flagged critical.Aggregate: per-dimension dimension_scores S_d = mean of that dimension's answers; overall S = mean of all answers.
Display bands: S≥0.90 production-ready · 0.70–0.89 solid · 0.50–0.69 needs-work · <0.50 rewrite.
GATE = every critical question (deterministic + critical bank questions) answered 1. The GATE is the pass criterion — not the scalar S.
Quick quality gate for third-party skills.
[ ] SKILL.md exists, exact case
[ ] Valid YAML frontmatter (name + description)
[ ] name: kebab-case, matches folder, ≤64 chars
[ ] description: ≤1024 chars, no angle brackets
[ ] description has triggers ("Use when...")
[ ] description has NO workflow/process steps
[ ] No README.md inside skill folder
[ ] SKILL.md < 500 lines
[ ] References max 1 level deep
[ ] Scripts tested and executable
[ ] No hardcoded paths/tokens/secrets
Then run VALIDATE Stage 2 (discovery) on the description. Report score + checklist.
The deterministic subset of this checklist is emitted as binary BinEval question records by scripts/eval_skill.py --json (e.g. DET-STRUCT-SKILLMD-EXISTS, DET-DISCOVERY-DESC-PRESENT, DET-ROBUST-NO-SECRETS) — the sole emitter of those records.
The checklist exists because these are the failure modes that actually happen in practice — especially process-in-description, which causes the agent to skip the body entirely.
Automated description optimization. The description competes with other skills for Claude's attention — optimization finds the wording that triggers most accurately.
Queries must be realistic — concrete, detailed, with file paths, context, abbreviations, typos. Not "Format this data" but "my boss sent Q4 sales final FINAL v2.xlsx, add profit margin % column, revenue is col C costs col D".
Should-trigger (10): Different phrasings of the same intent — formal, casual, implicit. Include cases where user doesn't name the skill but clearly needs it. Add competing-skill edge cases.
Should-NOT-trigger (10): Near-misses that share keywords but need something different. Adjacent domains, ambiguous phrasing. "Write fibonacci" as negative for PDF skill = useless — too easy. Make negatives genuinely tricky.
Triggering mechanics: Claude only consults skills for tasks it can't handle directly. Simple queries ("read this PDF") won't trigger skills regardless of description — Claude handles them with basic tools. Eval queries must be substantive enough that consulting a skill would help.
assets/eval_review.htmluv run scripts/run_loop.py \
--eval-set evals/eval_set.json \
--skill-path <skill-dir> \
--model claude-sonnet-4-20250514 \
--max-iterations 5 \
--holdout 0.4 \
--verbose
The loop:
| Script | Purpose |
|---|---|
scripts/run_eval.py | Run trigger evaluation on a description |
scripts/improve_description.py | Claude proposes improved description |
scripts/generate_report.py | HTML visualization of optimization history |
scripts/aggregate_benchmark.py | Statistical aggregation of benchmark runs |
uv run scripts/quick_validate.py <skill-folder>
uv run scripts/package_skill.py <skill-folder> [output-dir]
Creates skill-name.skill (zip with .skill extension). Verify: unzip in temp dir, check structure intact.
references/sop-practices.md| Directory | Loaded? | Purpose |
|---|---|---|
| SKILL.md | on trigger | brain — instructions |
| references/ | on demand | detailed docs, schemas |
| scripts/ | executed, not loaded | deterministic operations |
| assets/ | never loaded | templates, images |
| Level | When loaded | Budget |
|---|---|---|
| Frontmatter | always (system prompt) | ~100 words |
| SKILL.md body | on trigger | <500 lines |
| Bundled resources | on demand | unlimited |
[What it does] + Use when [triggers, file types, symptoms]. + Do NOT use for [negatives].
| Path | What's inside |
|---|---|
agents/grader.md | Evidence-based assertion grading |
agents/comparator.md | Blind A/B output comparison |
agents/analyzer.md | Post-hoc analysis + benchmark notes |
agents/bineval.md | BinEval evaluator — emits bineval.json |
references/patterns.md | 5 architectural patterns + anti-patterns |
references/schemas.md | JSON schemas for evals, grading, benchmark |
references/bineval-method.md | BinEval method: dimensions, scoring, GATE |
references/quality-questions.md | BinEval question bank (deterministic + bank) |
references/sop-practices.md | Canon: 9 authoring principles (universal) + deep SOP methodology for procedural skills |
references/runtime-setup.md | Pre-flight: uv/env/path checks, LLM-access options |
eval-viewer/ | Interactive HTML viewer for eval results |
assets/eval_review.html | Trigger eval set editor |
scripts/eval_skill.py | Structural validation (10-point scoring) |
scripts/init_skill.py | Skill scaffolder |
scripts/run_eval.py | Trigger evaluation runner |
scripts/run_loop.py | Eval + improve optimization loop |
scripts/improve_description.py | Claude-powered description improvement |
scripts/aggregate_benchmark.py | Benchmark statistics aggregator |
scripts/generate_report.py | HTML report generator |
scripts/quick_validate.py | Quick validation for packager |
scripts/test_smoke.py | Smoke tests for all scripts (12 tests) |
scripts/package_skill.py | Skill → .skill packager |
scripts/utils.py | Shared utilities (parse_skill_md) |
npx claudepluginhub smixs/skill-conductor --plugin skill-conductorCreates, refines, and benchmarks Claude Code agent skills. Drafts content, generates test prompts, runs evals with Python scripts, analyzes results, and iterates on feedback.
Create, edit, optimize, and benchmark skills for Claude Code. Guides users through drafting, evaluating with test prompts, analyzing quantitative results, and iterating on skill descriptions.
Creates new Claude Code skills from scratch, edits and improves existing ones, runs evals to test them, benchmarks performance with variance analysis, and optimizes descriptions for triggering accuracy.