Help us improve
Share bugs, ideas, or general feedback.
From oberskills
Creates, evaluates, and reviews Claude Code skills and agent definition files. Covers skill-vs-hook decisions, SKILL.md authoring, frontmatter optimization, baseline-first evals with MCP tooling, validation, and packaging.
npx claudepluginhub ryanthedev/oberskillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/oberskills:skill-craft [create <topic> | review <path>]When to use
Trigger on "create a skill", "write a skill for", "review this skill", "skill evals", "benchmark the skill", "trigger eval", "optimize the description", "package the skill", "should this be a skill or a hook".
[create <topic> | review <path>]This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Create skills through baseline-first evals, or audit existing ones. Judgment work (intake, design, eval authoring, interpretation, edits) happens here; everything checkable (validation, trigger probes, run spawning, grading, verdicts, stats, packaging) runs through the `skill-eval` MCP tools — never fill in compliance verdicts by hand.
Creates and validates Claude Code skills per AgentSkills.io 2026 spec and 100-point rubric. Use for building new skills or auditing existing ones.
Creates, refines, and benchmarks Claude Code agent skills. Drafts content, generates test prompts, runs evals with Python scripts, analyzes results, and iterates on feedback.
Guides creation of modular Agent Skills with progressive disclosure, covering skill structure, SKILL.md format, and best practices for capability uplift and encoded preference categories.
Share bugs, ideas, or general feedback.
Create skills through baseline-first evals, or audit existing ones. Judgment work (intake, design, eval authoring, interpretation, edits) happens here; everything checkable (validation, trigger probes, run spawning, grading, verdicts, stats, packaging) runs through the skill-eval MCP tools — never fill in compliance verdicts by hand.
| Intent | Mode | Load |
|---|---|---|
| Build something new ("create/write a skill for X") | CREATE | Pipeline below; references per phase |
| Audit an existing skill (path to a SKILL.md or skill directory) | REVIEW | ${CLAUDE_SKILL_DIR}/references/review-skill.md |
| Audit the prompt body wording inside an agent definition, or a dispatch brief | Redirect | Invoke oberskills:prompt REVIEW (the file as an artifact — structure, frontmatter, evals — stays here) |
| Improve an existing skill | CREATE from phase 3 (BASELINE), with an old_skill snapshot as the baseline config | ${CLAUDE_SKILL_DIR}/references/eval.md |
| Unclear or ambiguous request | Ask, or invoke oberskills:clarify for structured decomposition | — |
${CLAUDE_SKILL_DIR}/references/design.md).| You need | Build | Why |
|---|---|---|
| Reusable procedure or domain knowledge Claude loads when relevant | Skill (default) | Commands are merged into skills; skills add supporting files and invocation control |
| User-controlled side-effect macro (deploy, release, publish) | Skill + disable-model-invocation: true | You don't want Claude deciding to deploy because the code looks ready; also removes the description from context |
| Background conventions Claude applies, user never invokes | Skill + user-invocable: false | Reference-content archetype |
| Work that floods the main context (logs, search results, bulk reads) | Subagent | Isolation: explores with tens of thousands of tokens, returns a distilled summary (sizing norm: the oberskills:agent skill) |
| Guaranteed enforcement on every event (block a tool call, gate a commit) | Hook | Prose can't guarantee execution; hooks run in the harness. Measured: a forced-eval hook reached near-perfect activation where description-level fixes plateaued — numbers in ${CLAUDE_SKILL_DIR}/references/design.md (Spence) |
| Deterministic, repeated computation | Script bundled in a skill | Executed, not loaded — only output costs tokens |
| Always-relevant, small project facts | CLAUDE.md | Passive context is consistently available; no triggering decision to miss |
| Both reusable instructions AND isolation | Skill + context: fork (+ agent:) — task content with an actionable prompt only | Guideline-only forked skills return nothing useful |
.claude/commands/*.md files are legacy: same frontmatter, no supporting files. New artifacts get a skill directory. Full decision detail: ${CLAUDE_SKILL_DIR}/references/design.md.
| # | Phase | Load | skill-eval tools | Gate — proceed only when |
|---|---|---|---|---|
| 1 | INTAKE | — | — | Problem stated; 3+ natural trigger phrasings collected; pre-gate above passed |
| 2 | DESIGN | ${CLAUDE_SKILL_DIR}/references/design.md | — | Artifact type chosen from the table above; file structure and freedom levels planned |
| 3 | BASELINE | ${CLAUDE_SKILL_DIR}/references/eval.md | run_eval (configurations: ["without_skill"], one call per eval id) | evals.json with ≥3 evals exists; baseline runs complete; specific failures documented from grading output |
| 4 | BUILD | ${CLAUDE_SKILL_DIR}/references/build.md | validate_skill | Minimal SKILL.md written that addresses the documented baseline gaps; validate_skill returns zero errors |
| 5 | EVAL | ${CLAUDE_SKILL_DIR}/references/eval.md | run_eval, aggregate_benchmark, test_triggers, optimize_description, compare_outputs | with-skill beats baseline on the gap assertions; trigger accuracy passes in both directions; pressure gates pass (discipline skills). Otherwise iterate — max 3 iterations, then redesign |
| 6 | SHIP | — | validate_skill (package: true if a .skill file is wanted) | Zero errors; warnings resolved or explicitly justified to the user; user checkpoint passed |
Write the skill after the baseline. The baseline failures are the spec.
validate_skill; the tool's output is normative)| Limit | Value |
|---|---|
name | ≤64 chars, lowercase alphanumeric + hyphens, no leading/trailing/consecutive hyphens, must match the directory name, no "anthropic"/"claude" |
description | 1–1024 chars, third person, no XML tags |
description + when_to_use in the listing | truncated at 1,536 chars combined |
| SKILL.md body | <500 lines hard; <5k tokens recommended; ~200 lines for the always-relevant core |
| References | One level deep from SKILL.md; >100 lines → table of contents at top |
| Evals | ≥3 before ship |
When this summary and the tool disagree, trust the tool.
[Verb-first capabilities]. Use when [triggers/contexts]. Not for: [near-miss exclusions].
test_triggers. Full doctrine and when to deviate: ${CLAUDE_SKILL_DIR}/references/build.md.Tools are exposed as mcp__plugin_oberskills_skill-eval__<tool>; short names below.
| Tool | Use at | Does |
|---|---|---|
validate_skill | BUILD, SHIP, REVIEW | Spec lint + house WARN lints; zero errors required to ship; package: true also zips a .skill |
test_triggers | EVAL, REVIEW | Activation rate over should/shouldn't queries via isolated probe sessions |
optimize_description | EVAL | Train/holdout description improvement; chunked — one iteration per call, loop until done: true |
run_eval | BASELINE, EVAL | Runs one eval id across configurations × runs in parallel, composes pressure blocks, grades externally |
grade_run | EVAL | (Re)grades an existing run directory after editing assertions |
aggregate_benchmark | EVAL | Stats, named-config delta, ship gates, notes → benchmark.json + benchmark.md |
compare_outputs | EVAL (subjective skills) | Blind A/B judgment with shuffled sides |
If these tools are missing, the server isn't running: tell the user to run /reload-plugins (dependencies install via the plugin's SessionStart hook on next session start).
validate_skill WARNs on these constructs.| Counterpart | Hand-off |
|---|---|
oberskills:prompt | Skill bodies are prompts — apply its DESIGN principles to body text; it owns example-count guidance, snippets, and the porting reference for non-Claude/weak-model targets. All prompt-wording review, including the prompt body inside an agent definition, goes to its REVIEW mode; the definition file as an artifact (structure, frontmatter, evals) stays here |
oberskills:agent | Dispatching any subagent during development or review (analyzer, behavioral tests); owns model/effort selection and the subagent-frontmatter field table (its mechanics reference) |
oberskills:clarify | Ambiguous intake |
| skill-eval MCP server | All checkable steps (table above). Server owns validation rules, pressure-block language, trigger-probe mechanics, grading schema, verdict computation, workspace layout |
Analyzer agent prompt (dispatched during EVAL, see ${CLAUDE_SKILL_DIR}/references/eval.md): ${CLAUDE_SKILL_DIR}/agents/analyzer.md.
| File | Load when |
|---|---|
${CLAUDE_SKILL_DIR}/references/design.md | DESIGN phase: artifact choice, invocation control, structure, disclosure, routers |
${CLAUDE_SKILL_DIR}/references/build.md | BUILD phase: frontmatter, naming, description doctrines, writing rules, model deltas, templates |
${CLAUDE_SKILL_DIR}/references/eval.md | BASELINE/EVAL phases: evals.json authoring, checks, pressure evals, trigger evals, interpretation, ship gates |
${CLAUDE_SKILL_DIR}/references/review-skill.md | REVIEW mode: audit dimensions, behavioral test, verdict |