Skill Conductor
A skill that creates, evaluates, and improves other skills. Meta-level.

Architecture-first skill lifecycle: design → build → test → evaluate → package.
Most skill tools jump straight to "write SKILL.md." Conductor makes you choose the architecture first — because rewriting a wrong pattern costs more than writing it right.
Install
# skills.sh — installs into ~/.claude/skills
npx skills add smixs/skill-conductor
# Claude Code plugin
/plugin marketplace add smixs/skill-conductor
/plugin install skill-conductor@smixs
v3.0.0 — BinEval scoring, English canon, dual-channel install
- BinEval evaluation — replaces the old 5-axis 1-10 scoring with atomic binary yes/no questions across 5 dimensions (Discovery, Clarity, Structure, Robustness, Completeness). Each answer carries grounding evidence; the pass criterion is a gate on critical questions, not an opaque number. Adapted from "Ask, Don't Judge" (arXiv 2606.27226).
- Deterministic + LLM split —
eval_skill.py --json emits structural checks as binary question records; an evaluator agent answers the judgment questions with evidence and a self-update loop feeds failing questions back into edits.
- 9 authoring principles — a universal canon (pre-flight, no-process-in-description, MOC, fresh-practitioner author, TWI "why", blind-agent test, inline checklists, one-term-per-concept, cut-the-fat) in
references/sop-practices.md, applied to every skill.
- Dual-channel install — one repo, one source of truth, installable via skills.sh and the Claude Code plugin marketplace.
v3: SOP practices + smoke tests
references/sop-practices.md — 80 years of Standard Operating Procedure wisdom applied to skill authoring. Inline checklists at risk-points, pre-flight checks, programmatic validation, exception handling patterns. Use for procedural skills (client intake, onboarding, reporting, escalation)
scripts/test_smoke.py — fast safety net for skill-conductor scripts themselves. Verifies critical scripts execute on known-good skills, fail on known-bad, produce expected output shapes. Run: uv run scripts/test_smoke.py
- Updated eval agents (grader, comparator, analyzer) with refined rubrics
- Improved
package_skill.py, eval_skill.py, and schema validation
- Updated
patterns.md and schemas.md with tighter definitions
v2: Anthropic's eval engine meets architecture-first design
Anthropic updated their skill-creator with serious eval infrastructure. We took the best of it:
From Anthropic's skill-creator:
- 3 specialized agents: grader (assertion checking + claim extraction), comparator (blind A/B testing), analyzer (post-hoc root cause analysis)
- Parallel eval execution with isolated contexts (no cross-contamination)
- Automated description optimization with train/test split (60/40)
- Benchmark tracking: pass rate, tokens, time with variance analysis
- HTML eval viewer with qualitative + quantitative tabs
What Conductor adds on top:
- Architecture before code. 5 patterns (Sequential, Iterative, Context-Aware, Domain Intelligence, Multi-MCP) with selection criteria. Pick wrong = rewrite everything later
- Degrees of freedom. Low (deterministic scripts) → Medium (pseudocode) → High (free text). Match freedom to risk tolerance
- TDD RED before writing. Verify the agent fails WITHOUT the skill first. If it already handles the task — you don't need a skill
- Quality scoring with a gate (now BinEval — see v1.0.0). Numbers and evidence, not a "vibe check"
- Skill categorization. Capability uplift (teaching something new) vs Encoded preference (sequencing known abilities). Different skills need different testing strategies
Synthesized from