From drupal-workflow
Analyzes drupal-workflow plugin's autopilot effectiveness from session data and proposes tunings to policies, thresholds, and classifiers. Use when interventions are too noisy, quiet, ignored, or misclassified.
npx claudepluginhub gkastanis/drupal-workflow --plugin drupal-workflowThis skill uses the workspace's default tool permissions.
Closes the feedback loop on the drupal-workflow plugin. Reads real session data, identifies what's working and what isn't, and proposes targeted changes to policies, thresholds, classifier keywords, and escalation behavior.
Assesses Claude Code plugins by analyzing structure, scoring quality, and generating phased refactoring plans with design maps and task files for parallel agent execution.
Audits plugin suites using subagents to analyze usability, coverage gaps, duplications, and workflow optimizations. Generates reports with user-approved todo conversion. Useful for plugin ecosystem improvements.
Audits Claude Code configurations for best practices in skills, instructions, MCP servers, hooks, plugins, security, over-engineering, and context efficiency via file scans and focused checks. Invoke with /claudit [focus-area].
Share bugs, ideas, or general feedback.
Closes the feedback loop on the drupal-workflow plugin. Reads real session data, identifies what's working and what isn't, and proposes targeted changes to policies, thresholds, classifier keywords, and escalation behavior.
The plugin generates data every session (intervention logs, outcome archives, state snapshots). This skill turns that data into actionable improvements instead of letting it sit unread.
Run the diagnostic script to gather all session data into a single report:
python3 "${CLAUDE_PLUGIN_ROOT}/skills/autopilot-tuner/scripts/diagnose.py" --json
This outputs a structured JSON report with:
If the data is sparse (< 5 sessions), say so. Recommendations from small samples are directional, not definitive.
Also run the compare mode if Phase 2 is deployed:
python3 "${CLAUDE_PLUGIN_ROOT}/scripts/session-analysis/analyze-replays.py" --compare --json
The diagnostic script gives you numbers. Numbers alone miss structural issues. Before diagnosing, skim these three files to understand what the data means:
# The monitor — where interventions fire, what conditions trigger them, what tool types are checked
Read ${CLAUDE_PLUGIN_ROOT}/scripts/autopilot-monitor.sh
# The classifier — keyword priority order, what falls through to default
Read ${CLAUDE_PLUGIN_ROOT}/scripts/task-classifier.sh
# Check for competing hooks that might duplicate or conflict with the monitor
grep -l "skill\|nudge\|workflow" ${CLAUDE_PLUGIN_ROOT}/scripts/*.sh
Look for things the data can't tell you:
For broad health checks, also scan:
Read the diagnostic output and your code investigation together. Present the highest-impact findings to the user in priority order — focus on what matters most, not an exhaustive dump.
Five questions to answer:
Are interventions being followed? Look at acceptance rates. Below 30% means the intervention is noise — agents have learned to ignore it. Above 70% means it's working well.
Does escalation help? Compare acceptance at level 1 (hint) vs level 2 (command). If level 2 doesn't improve compliance, the problem isn't volume — it's the message content or the rule itself.
Are tasks classified correctly? If "maintenance" sessions routinely exceed 10 edits or need delegation, they were probably implementation tasks that got misclassified. If "implementation" sessions finish in 3 edits, they were probably maintenance.
Do plans actually help? Compare outcomes of sessions with plans vs without. If planned sessions don't end better (lower drift, more verification), then requiring plans is adding friction without value.
Are thresholds calibrated? If plan_missing fires at edit 3 but agents only comply at edit 8+, the threshold is too eager. Move it closer to where compliance actually happens.
Number every proposal as [P1], [P2], etc. for easy reference. For each one:
Types of changes this skill can propose:
| Category | Files affected | Example |
|---|---|---|
| Policy thresholds | scripts/policies/*.json | Lower min_skills from 2 to 1 |
| Classifier keywords | scripts/task-classifier.sh | Add "remove" to debugging keywords |
| Escalation behavior | scripts/autopilot-monitor.sh | Change edit threshold from 3 to 5 |
| Intervention text | scripts/autopilot-monitor.sh | Reword hint to be more specific |
| Phase budgets | scripts/policies/*.json | Increase implementing max_turns |
| Structural | scripts/*.sh, hooks/hooks.json | Remove competing hook, consolidate logic |
| Classifier logic | scripts/task-classifier.sh | Add post-classification reclassification heuristic |
Think laterally, not just parametrically. Threshold tuning is the obvious lever, but it's not always the right one. Consider:
The best proposals change the structure, not just the numbers.
Present each proposal to the user. For each one:
Never auto-apply. The user knows their workflow better than the data does.
After applying changes, remind the user:
analyze-replays.py --compare to see before/afterAfter applying, append a summary to the plugin's changelog:
echo "$(date -Iseconds) | autopilot-tuner | <summary of changes>" >> "${CLAUDE_PLUGIN_ROOT}/CHANGELOG.log"
This creates an audit trail so future tuning runs can see what was tried before.
The diagnose.py script outputs JSON with these sections:
{
"summary": {
"sessions_analyzed": 12,
"date_range": ["2026-04-10", "2026-04-14"],
"total_interventions": 34,
"total_outcomes": 10
},
"acceptance": {
"plan_missing": {"fires": 8, "accepted": 3, "rate": 0.375, "avg_level": 1.5},
...
},
"outcomes": {
"avg_drift": 0.35,
"pct_verified": 0.40,
"pct_planned": 0.60,
"by_task_type": { "implementation": {...}, "maintenance": {...} }
},
"proposals": [
{
"id": "P1",
"category": "threshold",
"target": "scripts/policies/implementation.json",
"field": "min_skills",
"old_value": 2,
"new_value": 1,
"reason": "skill_suggest has 0% acceptance across 11 fires — agents consistently ignore this. Reducing to 1 means the first skill invocation clears the requirement.",
"confidence": "high",
"risk": "Agents may consult fewer skills, reducing code quality."
}
]
}
Some things should stay fixed regardless of data:
With fewer than 5 sessions: