From skill-optimizer
Audits and optimizes Agent Skills (SKILL.md files) across 8 dimensions using session transcripts and static analysis, prioritizing P0-P2 fixes for better triggering.
npx claudepluginhub hqhq1025/skill-optimizerThis skill uses the workspace's default tool permissions.
- **Read-only**: never modify skill files. Only output report.
Diagnoses and optimizes Agent Skills (SKILL.md) using session transcripts and static analysis. Generates reports scoring 8 dimensions with P0-P2 fixes for Claude Code and Codex.
Analyzes and refines skills by identifying issues like time estimates, oversized files, poor structure, redundant content; prioritizes fixes (MUST/SHOULD/NICE); implements improvements with user feedback.
Reviews and improves AI agent SKILL.md files against Agent Skills spec and Anthropic guidelines. Scores 10 quality dimensions, identifies issues, and suggests rewrites for creating, editing, or auditing skills.
Share bugs, ideas, or general feedback.
Analyze skills using historical session data + static quality checks, output a diagnostic report with P0/P1/P2 prioritized fixes. Scores each skill on a 5-point composite scale across 8 dimensions.
CSO (Claude/Agent Search Optimization) = writing skill descriptions so agents select the right skill at the right time. This skill checks for CSO violations.
/optimize-skill → scan all skills/optimize-skill my-skill → single skill/optimize-skill skill-a skill-b → multiple specified skillsAuto-detect the current agent platform and scan the corresponding paths:
| Source | Claude Code | Codex | Shared |
|---|---|---|---|
| Session transcripts | ~/.claude/projects/**/*.jsonl | ~/.codex/sessions/**/*.jsonl | — |
| Skill files | ~/.claude/skills/*/SKILL.md | ~/.codex/skills/*/SKILL.md | ~/.agents/skills/*/SKILL.md |
Platform detection: Check which directories exist. Scan all available sources — a user may have both Claude Code and Codex installed.
Identify target skills
↓
Collect session data (python3 scripts scan JSONL transcripts)
↓
Run 8 analysis dimensions
↓
Compute composite scores
↓
Output report with P0/P1/P2
Scan skill directories in order: ~/.claude/skills/, ~/.codex/skills/, ~/.agents/skills/. Deduplicate by skill name (same name in multiple locations = same skill). For each, read SKILL.md and extract:
If user specified skill names, filter to only those.
Use python3 scripts via Bash to scan session JSONL files. Extract:
Claude Code sessions (~/.claude/projects/**/*.jsonl):
Skill tool_use calls (which skills were invoked)Codex sessions (~/.codex/sessions/**/*.jsonl):
session_meta events → extract base_instructions for skill loading evidenceresponse_item events → assistant outputs (workflow tracking)event_msg events → tool execution and skill-related eventsturn_context events (for reaction analysis)Note: Codex injects skills via context rather than explicit Skill tool calls. Skill loading (present in base_instructions) does NOT equal active invocation. To detect actual use, search for skill-specific workflow markers (step headers, output formats) in response_item content within that session. A skill is "invoked" only if the agent produced output following the skill's defined workflow.
Aggregated:
You MUST run ALL 8 dimensions. The baseline behavior without this skill is to skip dimensions 4.2, 4.3, 4.5b, and 4.8. These are the most valuable dimensions — do not skip them.
Count how many times each skill was actually invoked vs how many times its trigger keywords appeared in user messages.
Claude Code: count Skill tool_use calls in transcripts.
Codex: count sessions where the agent produced output following the skill's workflow markers (not merely loaded in context).
Diagnose:
This dimension is critical and easy to skip. Do not skip it.
After a skill is invoked in a session, read the user's next 3 messages. Classify:
Report per-skill satisfaction rate.
This dimension is critical and easy to skip. Do not skip it.
For each skill invocation found in session data:
Report: {skill-name} (N steps): avg completed Step X/N (Y%)
If a specific step is frequently where execution stops, flag it.
Check each SKILL.md against these 14 rules:
| Check | Pass Criteria |
|---|---|
| Frontmatter format | Only name + description, total < 1024 chars |
| Name format | Letters, numbers, hyphens only |
| Description trigger | Starts with "Use when..." or has explicit trigger conditions |
| Description workflow leak | Description does NOT summarize the skill's workflow steps (CSO violation) |
| Description pushiness | Description actively claims scenarios where it should be used, not just passive |
| Overview section | Present |
| Rules section | Present |
| MUST/NEVER density | Count ALL-CAPS directive words; >5 per 100 words = flag. Note: Meincke et al. (2025) found persuasion directives have inconsistent effects across models. Suggest converting to concrete bright-line rules with rationale, not mere emphasis. |
| Word count | < 500 words (flag if over) |
| Narrative anti-pattern | No "In session X, we found..." storytelling — skills should be instructions, not post-hoc reports |
| YAML quoting safety | description containing : must be wrapped in double quotes, otherwise YAML parse failure makes skill invisible |
| Critical info position | Core trigger conditions and primary actions must be in the first 20% of SKILL.md, not buried in the middle (Lost in the Middle, Liu et al. TACL 2024: U-shaped attention curve) |
| Description 250-char check | Primary trigger keywords must appear within the first 250 characters of description (skill listing truncation point in most agents) |
| Trigger condition count | ≤ 2 trigger conditions in description is ideal; consistent with IFEval (Zhou et al. 2023) finding that LLMs struggle with multi-constraint prompts |
Skill was invoked but user immediately rejected or ignored it.
This is the highest-value dimension. Memento-Skills (arXiv:2603.18743) demonstrates that skills stored as structured files require accurate retrieval/routing to be effective — skills that are never retrieved cannot improve through their read-write learning loop, making undertriggering a compounding problem.
For each skill, extract its capability keywords (not just trigger keywords — what the skill CAN do). Then scan user messages for tasks that match those capabilities but where the skill was NOT invoked.
Example: user says "run these tasks in parallel" but parallel-runner was not triggered → undertrigger.
Report: which user messages SHOULD have triggered the skill but didn't, and suggest description improvements.
Compounding Risk Assessment: For skills with chronic undertriggering (0 triggers across 5+ sessions where relevant tasks appeared), flag as "compounding risk" — undertriggered skills cannot self-improve through usage feedback, causing the gap to widen over time. Recommend immediate description rewrite as P0.
Compare all skill pairs:
For each skill, extract referenced:
test -e)which)Flag any broken references.
This dimension is critical and easy to skip. Do not skip it.
For each skill:
Progressive Disclosure Tier Check: Evaluate each skill against the 3-tier loading model (Agent Skills spec):
Flag skills that put 500+ words in SKILL.md without using reference files as "poor progressive disclosure".
Rate each skill on a 5-point scale:
| Score | Meaning |
|---|---|
| 5 | Healthy: high trigger rate, positive reactions, complete workflows, clean static |
| 4 | Good: minor issues in 1-2 dimensions |
| 3 | Needs attention: significant gap in 1 dimension or minor gaps in 3+ |
| 2 | Problematic: never triggered, or negative user reactions, or major static issues |
| 1 | Broken: doesn't work, references missing, or fundamentally misaligned |
Scored dimensions (weighted average):
Qualitative dimensions (reported but not scored — no reliable numeric metric):
(If a scored dimension has no data — e.g., skill was never invoked so no user reaction — mark as "N/A" and redistribute weight.)
# Skill Optimization Report
**Date**: {date}
**Scope**: {all / specified skills}
**Session data**: {N} sessions, {date range}
## Overview
| Skill | Triggers | Reaction | Completion | Static | Undertrigger | Token | Score |
|-------|----------|----------|------------|--------|--------------|-------|-------|
| example-skill | 2 | 100% | 86% | B+ | 1 miss | 486w | 4/5 |
## P0 Fixes (blocking usage)
1. ...
## P1 Improvements (better experience)
1. ...
## P2 Optional Optimizations
1. ...
## Per-Skill Diagnostics
### {skill-name}
#### 4.1 Trigger Rate
...
#### 4.2 User Reaction
...
(all 8 dimensions)
The analysis dimensions in this report are grounded in the following research: