Skill

skill-craft

Creates, evaluates, and reviews Claude Code skills and agent definition files. Covers skill-vs-hook decisions, SKILL.md authoring, frontmatter optimization, baseline-first evals with MCP tooling, validation, and packaging.

developer-tools

npx claudepluginhub ryanthedev/oberskills

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/oberskills:skill-craft [create <topic> | review <path>]

User invocable

Model invocable

Inline context

Default effort

When to use

Trigger on "create a skill", "write a skill for", "review this skill", "skill evals", "benchmark the skill", "trigger eval", "optimize the description", "package the skill", "should this be a skill or a hook".

Argument hint[create <topic> | review <path>]

Tool Access

This skill is limited to the following tools:

mcp__plugin_oberskills_skill-eval__validate_skillmcp__plugin_oberskills_skill-eval__test_triggersmcp__plugin_oberskills_skill-eval__optimize_descriptionmcp__plugin_oberskills_skill-eval__run_evalmcp__plugin_oberskills_skill-eval__grade_runmcp__plugin_oberskills_skill-eval__aggregate_benchmarkmcp__plugin_oberskills_skill-eval__compare_outputs

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Create skills through baseline-first evals, or audit existing ones. Judgment work (intake, design, eval authoring, interpretation, edits) happens here; everything checkable (validation, trigger probes, run spawning, grading, verdicts, stats, packaging) runs through the `skill-eval` MCP tools — never fill in compliance verdicts by hand.

Supporting Files

agents/analyzer.mdreferences/build.mdreferences/design.mdreferences/eval.mdreferences/review-skill.md

SKILL.md

132 lines · ~2.9k tokens

Similar Skills

skill-creator

2.2k

Creates and validates Claude Code skills per AgentSkills.io 2026 spec and 100-point rubric. Use for building new skills or auditing existing ones.

20 files11 tools

skill-creator

1.5k

Creates, refines, and benchmarks Claude Code agent skills. Drafts content, generates test prompts, runs evals with Python scripts, analyzes results, and iterates on feedback.

20 files

claude-code-settings

claude-skills

Guides creation of modular Agent Skills with progressive disclosure, covering skill structure, SKILL.md format, and best practices for capability uplift and encoded preference categories.

9 files

all-skills

Stats

LanguageTypeScript

Stars39

Forks4

MaintenanceExcellent

Last CommitJun 10, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stats

Actions

Help us improve

Share bugs, ideas, or general feedback.

skill-craft

Create skills through baseline-first evals, or audit existing ones. Judgment work (intake, design, eval authoring, interpretation, edits) happens here; everything checkable (validation, trigger probes, run spawning, grading, verdicts, stats, packaging) runs through the skill-eval MCP tools — never fill in compliance verdicts by hand.

Mode router

Intent	Mode	Load
Build something new ("create/write a skill for X")	CREATE	Pipeline below; references per phase
Audit an existing skill (path to a SKILL.md or skill directory)	REVIEW	`${CLAUDE_SKILL_DIR}/references/review-skill.md`
Audit the prompt body wording inside an agent definition, or a dispatch brief	Redirect	Invoke `oberskills:prompt` REVIEW (the file as an artifact — structure, frontmatter, evals — stays here)
Improve an existing skill	CREATE from phase 3 (BASELINE), with an `old_skill` snapshot as the baseline config	`${CLAUDE_SKILL_DIR}/references/eval.md`
Unclear or ambiguous request	Ask, or invoke `oberskills:clarify` for structured decomposition	—

Should this exist? (pre-gate)

Can Claude already do this well? Run the task once without a skill. No documented failure means don't build — if you didn't watch an agent fail without the skill, you don't know whether the skill teaches the right thing.
If you can't write three evals for it, don't build it.
Will it be used 5+ times? One-off → just do the task.
Always-relevant and small? → CLAUDE.md, not a skill (passive inline context beats skill retrieval for content needed on every task — Vercel eval; details in ${CLAUDE_SKILL_DIR}/references/design.md).

What to build

You need	Build	Why
Reusable procedure or domain knowledge Claude loads when relevant	Skill (default)	Commands are merged into skills; skills add supporting files and invocation control
User-controlled side-effect macro (deploy, release, publish)	Skill + `disable-model-invocation: true`	You don't want Claude deciding to deploy because the code looks ready; also removes the description from context
Background conventions Claude applies, user never invokes	Skill + `user-invocable: false`	Reference-content archetype
Work that floods the main context (logs, search results, bulk reads)	Subagent	Isolation: explores with tens of thousands of tokens, returns a distilled summary (sizing norm: the `oberskills:agent` skill)
Guaranteed enforcement on every event (block a tool call, gate a commit)	Hook	Prose can't guarantee execution; hooks run in the harness. Measured: a forced-eval hook reached near-perfect activation where description-level fixes plateaued — numbers in `${CLAUDE_SKILL_DIR}/references/design.md` (Spence)
Deterministic, repeated computation	Script bundled in a skill	Executed, not loaded — only output costs tokens
Always-relevant, small project facts	CLAUDE.md	Passive context is consistently available; no triggering decision to miss
Both reusable instructions AND isolation	Skill + `context: fork` (+ `agent:`) — task content with an actionable prompt only	Guideline-only forked skills return nothing useful

.claude/commands/*.md files are legacy: same frontmatter, no supporting files. New artifacts get a skill directory. Full decision detail: ${CLAUDE_SKILL_DIR}/references/design.md.

CREATE pipeline

#	Phase	Load	skill-eval tools	Gate — proceed only when
1	INTAKE	—	—	Problem stated; 3+ natural trigger phrasings collected; pre-gate above passed
2	DESIGN	`${CLAUDE_SKILL_DIR}/references/design.md`	—	Artifact type chosen from the table above; file structure and freedom levels planned
3	BASELINE	`${CLAUDE_SKILL_DIR}/references/eval.md`	`run_eval` (`configurations: ["without_skill"]`, one call per eval id)	`evals.json` with ≥3 evals exists; baseline runs complete; specific failures documented from grading output
4	BUILD	`${CLAUDE_SKILL_DIR}/references/build.md`	`validate_skill`	Minimal SKILL.md written that addresses the documented baseline gaps; `validate_skill` returns zero errors
5	EVAL	`${CLAUDE_SKILL_DIR}/references/eval.md`	`run_eval`, `aggregate_benchmark`, `test_triggers`, `optimize_description`, `compare_outputs`	with-skill beats baseline on the gap assertions; trigger accuracy passes in both directions; pressure gates pass (discipline skills). Otherwise iterate — max 3 iterations, then redesign
6	SHIP	—	`validate_skill` (`package: true` if a `.skill` file is wanted)	Zero errors; warnings resolved or explicitly justified to the user; user checkpoint passed

Write the skill after the baseline. The baseline failures are the spec.

Hard limits (summary — enforced by `validate_skill`; the tool's output is normative)

Limit	Value
`name`	≤64 chars, lowercase alphanumeric + hyphens, no leading/trailing/consecutive hyphens, must match the directory name, no "anthropic"/"claude"
`description`	1–1024 chars, third person, no XML tags
`description` + `when_to_use` in the listing	truncated at 1,536 chars combined
SKILL.md body	<500 lines hard; <5k tokens recommended; ~200 lines for the always-relevant core
References	One level deep from SKILL.md; >100 lines → table of contents at top
Evals	≥3 before ship

When this summary and the tool disagree, trust the tool.

Description quick formula

[Verb-first capabilities]. Use when [triggers/contexts]. Not for: [near-miss exclusions].

Third person, always — the description is injected into the system prompt.
Capability nouns are fine; never process steps — a workflow summary becomes a shortcut Claude follows instead of reading the body.
The exclusion clause lists near-misses (tasks that share keywords but belong elsewhere), not absurd negatives.
Measure, don't guess: test_triggers. Full doctrine and when to deviate: ${CLAUDE_SKILL_DIR}/references/build.md.

skill-eval tool quick reference

Tools are exposed as mcp__plugin_oberskills_skill-eval__<tool>; short names below.

Tool	Use at	Does
`validate_skill`	BUILD, SHIP, REVIEW	Spec lint + house WARN lints; zero errors required to ship; `package: true` also zips a `.skill`
`test_triggers`	EVAL, REVIEW	Activation rate over should/shouldn't queries via isolated probe sessions
`optimize_description`	EVAL	Train/holdout description improvement; chunked — one iteration per call, loop until `done: true`
`run_eval`	BASELINE, EVAL	Runs one eval id across configurations × runs in parallel, composes pressure blocks, grades externally
`grade_run`	EVAL	(Re)grades an existing run directory after editing assertions
`aggregate_benchmark`	EVAL	Stats, named-config delta, ship gates, notes → `benchmark.json` + `benchmark.md`
`compare_outputs`	EVAL (subjective skills)	Blind A/B judgment with shuffled sides

If these tools are missing, the server isn't running: tell the user to run /reload-plugins (dependencies install via the plugin's SessionStart hook on next session start).

Core authoring rules

Claude is already smart. Only add context Claude doesn't have; challenge every paragraph's token cost — the context window is a public good.
Standing instructions, not one-time steps. Skill content persists in context for the rest of the session.
Don't over-prompt. No CRITICAL/MUST trigger language by default — current models overtrigger under it, and skills written for prior models are often too prescriptive for current ones and can degrade output quality. Escalate force only for rules that measurably get missed, and then explain why.
Gates beat persuasion. "Proceed only when X passes" plus external or deterministic checks. Never self-assessed compliance, never anti-rationalization tables — validate_skill WARNs on these constructs.
Match freedom to fragility. Fragile or sequence-critical work gets an exact script ("run exactly this, do not modify"); open-ended work gets heuristics.
Feedback loops for quality-critical output. Run validator → fix → repeat; make validators verbose with specific error messages.

Integration

Counterpart	Hand-off
`oberskills:prompt`	Skill bodies are prompts — apply its DESIGN principles to body text; it owns example-count guidance, snippets, and the porting reference for non-Claude/weak-model targets. All prompt-wording review, including the prompt body inside an agent definition, goes to its REVIEW mode; the definition file as an artifact (structure, frontmatter, evals) stays here
`oberskills:agent`	Dispatching any subagent during development or review (analyzer, behavioral tests); owns model/effort selection and the subagent-frontmatter field table (its mechanics reference)
`oberskills:clarify`	Ambiguous intake
skill-eval MCP server	All checkable steps (table above). Server owns validation rules, pressure-block language, trigger-probe mechanics, grading schema, verdict computation, workspace layout

Analyzer agent prompt (dispatched during EVAL, see ${CLAUDE_SKILL_DIR}/references/eval.md): ${CLAUDE_SKILL_DIR}/agents/analyzer.md.

References

File	Load when
`${CLAUDE_SKILL_DIR}/references/design.md`	DESIGN phase: artifact choice, invocation control, structure, disclosure, routers
`${CLAUDE_SKILL_DIR}/references/build.md`	BUILD phase: frontmatter, naming, description doctrines, writing rules, model deltas, templates
`${CLAUDE_SKILL_DIR}/references/eval.md`	BASELINE/EVAL phases: evals.json authoring, checks, pressure evals, trigger evals, interpretation, ship gates
`${CLAUDE_SKILL_DIR}/references/review-skill.md`	REVIEW mode: audit dimensions, behavioral test, verdict

skill-craft

Popularity

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

Similar Skills

Help us improve

Help us improve

Find plugins for your project

skill-craft

Popularity

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

skill-craft

Mode router

Should this exist? (pre-gate)

What to build

CREATE pipeline

Hard limits (summary — enforced by validate_skill; the tool's output is normative)

Description quick formula

skill-eval tool quick reference

Core authoring rules

Integration

References

Similar Skills

Help us improve

skill-craft

Mode router

Should this exist? (pre-gate)

What to build

CREATE pipeline

Hard limits (summary — enforced by validate_skill; the tool's output is normative)

Description quick formula

skill-eval tool quick reference

Core authoring rules

Integration

References

Hard limits (summary — enforced by `validate_skill`; the tool's output is normative)

Hard limits (summary — enforced by `validate_skill`; the tool's output is normative)