Skill Conductor

A skill that creates, evaluates, and improves other skills. Meta-level.

Architecture-first skill lifecycle: design → build → test → evaluate → package.

Most skill tools jump straight to "write SKILL.md." Conductor makes you choose the architecture first — because rewriting a wrong pattern costs more than writing it right.

Install

# skills.sh — installs into ~/.claude/skills
npx skills add smixs/skill-conductor

# Claude Code plugin
/plugin marketplace add smixs/skill-conductor
/plugin install skill-conductor@smixs

v3.0.0 — BinEval scoring, English canon, dual-channel install

BinEval evaluation — replaces the old 5-axis 1-10 scoring with atomic binary yes/no questions across 5 dimensions (Discovery, Clarity, Structure, Robustness, Completeness). Each answer carries grounding evidence; the pass criterion is a gate on critical questions, not an opaque number. Adapted from "Ask, Don't Judge" (arXiv 2606.27226).
Deterministic + LLM split — eval_skill.py --json emits structural checks as binary question records; an evaluator agent answers the judgment questions with evidence and a self-update loop feeds failing questions back into edits.
9 authoring principles — a universal canon (pre-flight, no-process-in-description, MOC, fresh-practitioner author, TWI "why", blind-agent test, inline checklists, one-term-per-concept, cut-the-fat) in references/sop-practices.md, applied to every skill.
Dual-channel install — one repo, one source of truth, installable via skills.sh and the Claude Code plugin marketplace.

v3: SOP practices + smoke tests

references/sop-practices.md — 80 years of Standard Operating Procedure wisdom applied to skill authoring. Inline checklists at risk-points, pre-flight checks, programmatic validation, exception handling patterns. Use for procedural skills (client intake, onboarding, reporting, escalation)
scripts/test_smoke.py — fast safety net for skill-conductor scripts themselves. Verifies critical scripts execute on known-good skills, fail on known-bad, produce expected output shapes. Run: uv run scripts/test_smoke.py
Updated eval agents (grader, comparator, analyzer) with refined rubrics
Improved package_skill.py, eval_skill.py, and schema validation
Updated patterns.md and schemas.md with tighter definitions

v2: Anthropic's eval engine meets architecture-first design

Anthropic updated their skill-creator with serious eval infrastructure. We took the best of it:

From Anthropic's skill-creator:

3 specialized agents: grader (assertion checking + claim extraction), comparator (blind A/B testing), analyzer (post-hoc root cause analysis)
Parallel eval execution with isolated contexts (no cross-contamination)
Automated description optimization with train/test split (60/40)
Benchmark tracking: pass rate, tokens, time with variance analysis
HTML eval viewer with qualitative + quantitative tabs

What Conductor adds on top:

Architecture before code. 5 patterns (Sequential, Iterative, Context-Aware, Domain Intelligence, Multi-MCP) with selection criteria. Pick wrong = rewrite everything later
Degrees of freedom. Low (deterministic scripts) → Medium (pseudocode) → High (free text). Match freedom to risk tolerance
TDD RED before writing. Verify the agent fails WITHOUT the skill first. If it already handles the task — you don't need a skill
Quality scoring with a gate (now BinEval — see v1.0.0). Numbers and evidence, not a "vibe check"
Skill categorization. Capability uplift (teaching something new) vs Encoded preference (sequencing known abilities). Different skills need different testing strategies

skill-conductor

Popularity

What's Inside

README

Skill Conductor

Install

Synthesized from

Confidence

Similar Plugins

skill-forge

skillkit

crystools-skills

singularity-claude

skill-creator

skill-creator-pro

More by smixs

autograph