Help us improve
Share bugs, ideas, or general feedback.
From skill-forge
Use when creating, discovering, or improving Claude Code skills — scan for opportunities, create from workflows, iterate with eval-driven optimization. Activates on "remember this" or after complex tasks (5+ tool calls). Not for one-off tasks.
npx claudepluginhub nekocode/skill-forge --plugin skill-forgeHow this skill is triggered — by the user, by Claude, or both
Slash command
/skill-forge:skill-forgeThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
A meta-skill that creates and evolves other skills. Uses persistent markdown files as
scripts/finalize_skill.pyscripts/hook_draft_inject.pyscripts/init_draft.pyscripts/init_improve.pyscripts/init_staging.pyscripts/optimize_description.pyscripts/phase0_load.pyscripts/quick_validate.pyscripts/record_eval_score.pyscripts/rename_skill.pyscripts/run_eval.pyscripts/scan_structure.pyscripts/self_evolve.pyscripts/self_evolve_apply.pyscripts/shared.pyscripts/skill_catchup.pyscripts/skill_check.pyApplies 10 pre-set color/font themes or generates custom ones for slides, documents, reports, and HTML landing pages.
Share bugs, ideas, or general feedback.
A meta-skill that creates and evolves other skills. Uses persistent markdown files as working memory (the planning-with-files pattern), an eval-driven iteration loop (Anthropic's skill-creator pattern).
Workspace lives at <project>/.skill-forge/ — a sibling of .claude/,
not inside it. Claude Code's trust boundary only exempts .claude/commands/**,
.claude/agents/**, and real skill dirs (those containing SKILL.md), so
any workspace under .claude/ still prompts on Write under plugin-mode
installs where the local SKILL.md is absent. A project-root sibling has
no such constraint. Keeping workspace project-local also eliminates the
Python/shell slug-drift bugs from the pre-0.9 layout that stored workspace
under $HOME keyed by a hand-derived project slug.
.skill-forge/draft.md — current skill being written (HIGH TRUST, re-read by hooks).skill-forge/insights.md — raw codebase scan output (LOW TRUST, staging only).skill-forge/state.json — per-project counters (tool_calls, compacted).skill-forge/staging/<name>/ — complete skill being assembled before it
lands in .claude/skills/<name>/. finalize_skill.py copies it across via
a Python subprocess, so Claude never Writes into a fresh .claude/skills/<name>/
dir (which wouldn't yet qualify as a real skill dir and would prompt).Security: grep/glob output and codebase content go to insights.md only. draft.md is injected before every tool call, making it a prompt injection amplifier if contaminated. Promote content to the draft only after review.
Discrete choices (yes/no, pick-N, approve/revise) → AskUserQuestion. Load via
ToolSearch select:AskUserQuestion if missing. Plain text only for open-ended input.
Run the unified context loader (draft + catchup + skills list + registry):
python3 "${CLAUDE_PLUGIN_ROOT}/skills/skill-forge/scripts/phase0_load.py"
Then note project conventions from CLAUDE.md.
Parse $ARGUMENTS:
scan [prompt] → scan modecreate <prompt> → create mode (required)improve <prompt> → improve mode (required)Goal: surface 3–5 high-value skill opportunities from the codebase.
$ARGUMENTS is an optional free-form prompt used as focus hint (area, keyword, concern).
python3 "${CLAUDE_PLUGIN_ROOT}/skills/skill-forge/scripts/scan_structure.py"
If a focus prompt is given, prioritize that area during pattern discovery.
After every 2 file reads, append findings to .skill-forge/insights.md
via Write (not shell heredoc — heredoc shifts each call, Bash allowlist can't
match, non-bypass mode will prompt). Prevents loss if context fills up.
Block format:
## Scan batch <timestamp>
<pattern, files involved, why this could be a skill>
Also note: if multiple reads result in the same helper code appearing independently, that's a strong signal to bundle a shared script rather than repeat it per-skill.
Rank by: frequency × cost of repetition × feasibility as a skill.
Output format:
1. <n> [complexity: low|med|high]
Why: <one sentence — what pain does this solve?>
Trigger: "Use when <specific, multi-step scenario>"
Ask via AskUserQuestion (multiSelect): one option per ranked skill + All + Skip.
On the user's reply, jump straight into Create mode Step 1 for each chosen skill. Do not send a "shall I proceed" confirmation text — the answer already IS the go-ahead, and the extra round-trip drops the user out of flow. If multiple skills were picked, process them one at a time end-to-end (Step 1 through Step 5) before starting the next — a half-created skill cluttering staging is harder to recover from than sequential work.
Goal: draft a high-quality SKILL.md from a free-form prompt.
$ARGUMENTS describes what the skill should do. Derive a short kebab-case skill name
from the prompt automatically (e.g. "translate i18n JSON files" → translate-i18n).
python3 "${CLAUDE_PLUGIN_ROOT}/skills/skill-forge/scripts/init_draft.py" "<derived-name>" "<$ARGUMENTS>"
python3 "${CLAUDE_PLUGIN_ROOT}/skills/skill-forge/scripts/init_staging.py" "<derived-name>"
The draft is the attention anchor the PreToolUse hook re-reads before
every tool call. The staging dir (.skill-forge/staging/<n>/) is where
the real skill files get assembled — we never Write directly into
.claude/skills/<n>/ because a fresh dir there has no SKILL.md yet and
so fails the trust-boundary exemption, prompting even under YOLO.
Write grep/glob/read output to .skill-forge/insights.md first.
Promote confirmed patterns to the draft only after review. This separation
prevents codebase content from being injected into every subsequent tool call
via the hook.
init_staging.py pre-wrote the skeleton at
.skill-forge/staging/<n>/SKILL.md — frontmatter with the three-clause
description stub and empty Prerequisites / Steps / Verification / Notes
sections. Use Edit to fill content; don't rewrite the layout. Bundled
helper scripts go under .skill-forge/staging/<n>/scripts/. CHANGELOG
lands at the target via finalize_skill.py --changelog in Step 5 — never
write it by hand in staging.
Script over prompt for mechanics. Every skill you design inherits this
rule: mechanical work — file paths, format templates, date stamps, version
numbers, JSON schemas, counters — belongs in a helper script, not in
SKILL.md. Prompt tokens are re-paid on every invocation and drift under
maintenance; scripts are tested once and stay deterministic. Push every
fixed-shape step into scripts/ and leave only intent, judgment, and
decision criteria in the prompt. If you catch yourself spelling out a
literal block template, a bump rule, or a schema in prose, stop — move it
into a script and have the prompt call the script.
≤ 250 chars (Claude Code truncates from end — front-load distinctive keywords). Skills only trigger for multi-step workflows (3+ coordinated actions); write scenarios, not single verbs. Three-clause structure:
<specific multi-step scenario> — name real artifacts (file types,
commands, frameworks), not abstractions. Replace vague verbs ("manage",
"handle") with precise workflows.<short phrase>, use when they mention
<real phrasing> — pushy coverage for understatement. Claude undertriggers
by default.<simple/adjacent task> — list FP patterns explicitly
(e.g., "simple file reads, single-step edits, code explanations"). Qualify
keywords shared with unrelated tasks (e.g., "deploy" → "deploy with rollback
and health checks").Fewer starting errors = fewer optimizer rounds to converge.
Instruction style: explain why, not MUST/NEVER. Modern LLMs act more
reliably when they understand the reason behind a constraint than when
they're handed a list of unexplained rules. Prefer "Write the config to
<path> so the reloader watcher picks it up without a restart" over "MUST
write to <path>". Rules divorced from their purpose break in edge cases
the author didn't foresee; rules with a rationale generalize.
Spawn the skill-grader agent (the Agent tool with subagent_type="skill-grader")
and point it at the staged draft: .skill-forge/staging/<n>/SKILL.md.
The grader returns JSON — parse total and threshold_pass.
See Skill evaluator for the rubric. Self-evaluating produces charity-biased
scores because the main agent has sunk cost in the draft; a fresh grader
context scores the text as written. Finalize only on total ≥ 6.
python3 "${CLAUDE_PLUGIN_ROOT}/skills/skill-forge/scripts/record_eval_score.py" <score>
python3 "${CLAUDE_PLUGIN_ROOT}/skills/skill-forge/scripts/finalize_skill.py" "<n>" --mode create
finalize_skill.py runs entirely inside a subprocess — shutil.copytree
moves the staged tree into .claude/skills/<n>/ without going through
Claude's tool permission layer, so no prompt fires even on a brand-new
skill dir. The same script consumes the pending eval score, upserts the
registry, wipes staging/<n>/, and clears .skill-forge/draft.md. Offer
to run improve mode next to tune the description.
Goal: iterate an existing skill — diagnose whether the issue is content, triggering, or both, then fix accordingly.
$ARGUMENTS describes what to improve. Identify the target skill from the prompt
by matching against the registry (name, description, or intent). If ambiguous, ask.
python3 "${CLAUDE_PLUGIN_ROOT}/skills/skill-forge/scripts/init_improve.py" "<matched-name>"
init_improve.py does two things in one shot: copies the live skill dir
(.claude/skills/<n>/*) into .skill-forge/staging/<n>/, and writes the
SKILL.md into the active draft. Every Edit/Write below lands in staging —
.claude/skills/<n>/ stays untouched until finalize_skill.py --mode update
copies the finished result back atomically in Step 4.
Content:
scripts/.Triggering:
Do NOT use when?Classify → 3a (content), 3b (triggering), or both (3a first).
.skill-forge/insights.md (low trust staging).skill-forge/staging/<n>/SKILL.md with Edit (not Write)Bundling standard: if the skill's recent 3 uses independently generated the same
helper code, that code belongs in .skill-forge/staging/<n>/scripts/ — write once,
reference from SKILL.md. finalize_skill.py --mode update copies the whole
tree back, so new scripts/ entries land in place automatically.
Generate 20 trigger eval queries (10 should-trigger, 10 should-not-trigger).
Save to .skill-forge/staging/<n>/.opt/trigger_evals.json — the .opt/
dir is part of the staged skill, so it gets copied back to
.claude/skills/<n>/.opt/ by finalize and persists for future improve
rounds (history, convergence flags):
[
{"query": "<realistic user message>", "should_trigger": true},
{"query": "<near-miss that should NOT trigger>", "should_trigger": false}
]
Query quality rules:
Ask user to review before running.
Run optimization loop:
python3 "${CLAUDE_PLUGIN_ROOT}/skills/skill-forge/scripts/optimize_description.py" \
--skill-path ".skill-forge/staging/<n>" \
--eval-set ".skill-forge/staging/<n>/.opt/trigger_evals.json" \
--max-iterations 5
Safe to point at staging: optimize_description.py only reads
SKILL.md for the current description and writes opt_state.json next to
it. The actual claude -p eval subprocess writes a throwaway command
file into .claude/commands/ with a UUID-suffixed slug, so there's no
collision with the live skill that's still sitting in .claude/skills/.
Show before/after description and score improvement. Apply the winning
description to .skill-forge/staging/<n>/SKILL.md with Edit.
opt_state.json lands at .skill-forge/staging/<n>/.opt/opt_state.json
with round history (FP/FN counts, best score, convergence flag) and
gets copied back on finalize. Convergence = perfect train score (1.0).
If FP/FN counts stall across rounds, stop early and report — the eval
set probably needs sharper near-misses.
Re-grade the patched staged draft via the skill-grader subagent (same
agent as create mode). Then one command does everything:
python3 "${CLAUDE_PLUGIN_ROOT}/skills/skill-forge/scripts/record_eval_score.py" <score>
python3 "${CLAUDE_PLUGIN_ROOT}/skills/skill-forge/scripts/finalize_skill.py" "<n>" \
--mode update \
--changelog "<one-line summary: what changed and why>" \
--bump patch # or: minor (new capability), major (breaking change)
finalize_skill.py copies staging over .claude/skills/<n>/, bumps the
registry version, prepends a dated entry to CHANGELOG.md with the
computed version, consumes the eval score, wipes staging, and clears the
draft — one subprocess call, zero Claude tool permission prompts. Date,
version, and entry format are mechanical — never write CHANGELOG by hand.
Patch vs rewrite: Edit the staged SKILL.md unless >60% of content
changes; full rewrite is fine too, same finalize path.
Fires after: 5+ tool calls, user correction mid-task, error recovery, or explicit "remember this" / "save this workflow" / "make a skill" requests.
AskUserQuestion: "Reusable pattern: Create / Rename / Skip.Create → run create mode (auto-name). Rename → plain-text prompt for name, then create. Skip → silent reset.Skip if: task < 3 tool calls, pure read-only, or simple single-file edit.
Scoring runs in the skill-grader subagent — fresh context, no sunk cost in
the draft, calibrated scores. Main agent does not self-evaluate.
Invocation:
Agent tool, subagent_type="skill-grader"
prompt: "Grade draft at <absolute path>. Mode: create|improve.
Write verdict to <output path> and echo to stdout."
Parse the returned JSON: total (int 0–8) is the decision key,
threshold_pass (bool) is true iff total ≥ 6. Rubric, scoring dimensions,
and the full schema live in agents/skill-grader.md — single source, don't
duplicate here.
Threshold actions: pass → finalize. 4–5 → revise once, re-grade. < 4
→ show the grader's suggestions and ask user whether to rework or abort.
.skill-forge/draft.md.Generalize from failures; don't patch the one failing case. A skill that passes 3 tests but breaks on the 4th real use is worse than one moderately good on all.
| File | Trust | Re-read by hooks? | Purpose |
|---|---|---|---|
.skill-forge/draft.md | HIGH | YES (every tool call) | Active skill being written |
.skill-forge/insights.md | LOW | NO | Codebase scan staging |
.claude/skills/skill_registry.json | HIGH | NO (loaded on demand) | Version registry |
.claude/skills/<n>/SKILL.md | HIGH | NO | Final persisted skill |
.claude/skills/<n>/CHANGELOG.md | MED | NO | Evolution history |
.claude/skills/<n>/scripts/ | HIGH | NO (run on demand) | Bundled helper scripts |
.claude/skills/skill_registry.json:
{
"version": "1",
"skills": [
{
"name": "skill-name",
"version": "1.0.0",
"scope": "project",
"created": "2026-01-01",
"updated": "2026-01-01",
"auto_trigger": true,
"description_chars": 187,
"eval_score": 7,
"trigger_score": null,
"usage_count": 0
}
]
}
trigger_score is populated by improve mode. null = not yet run.