Evaluates MCP tool schemas via static analysis, tests Claude's tool selection accuracy for intents, and iteratively optimizes descriptions. Measures and improves MCP server tool usage.
npx claudepluginhub joshuarweaver/cascade-code-general-misc-1 --plugin pproenca-dot-skills-1This skill uses the workspace's default tool permissions.
Tool descriptions are prompt engineering — they land directly in Claude's context window and determine whether Claude picks the right tool with the right arguments. This skill makes tool quality **measurable and improvable** instead of guesswork.
Applies Acme Corporation brand guidelines including colors, fonts, layouts, and messaging to generated PowerPoint, Excel, and PDF documents.
Builds DCF models with sensitivity analysis, Monte Carlo simulations, and scenario planning for investment valuation and risk assessment.
Calculates profitability (ROE, margins), liquidity (current ratio), leverage, efficiency, and valuation (P/E, EV/EBITDA) ratios from financial statements in CSV, JSON, text, or Excel for investment analysis.
Tool descriptions are prompt engineering — they land directly in Claude's context window and determine whether Claude picks the right tool with the right arguments. This skill makes tool quality measurable and improvable instead of guesswork.
Three levels of testing, each building on the last:
build-mcp-server and wants to validate qualityPhase 1: Connect → Phase 2: Static Analysis → Phase 3: Selection Testing → Phase 4: Optimize
↑__________________________|
Phase 4 loops back: apply rewrites → refetch schemas → retest → compare accuracy.
npx)tools/list. Use build-mcp-server/scripts/test-server.sh to verify connectivity first.Connect to the user's MCP server and fetch the tool schemas.
Ask the user how to reach their server:
http://localhost:3000/mcp)node dist/server.js)bash scripts/fetch-tools.sh <url-or-command> <transport> <workspace>/tools.json
This calls tools/list via the MCP Inspector CLI and saves the schemas.
Show a summary table:
| # | Tool | Description (preview) | Params | Annotations |
|---|------|-----------------------|--------|-------------|
| 1 | search_issues | Search issues by keyword... | 3 | readOnlyHint |
| 2 | create_issue | Create a new issue... | 4 | — |
Flag tool count: 1-15 optimal, 15-30 warning, 30+ excessive (consider search+execute pattern).
Create workspace at {server-name}-eval/ adjacent to the skill directory or in the user's project:
{server-name}-eval/
├── tools.json
├── evals/
│ └── evals.json
└── iteration-N/
Run deterministic quality checks — no Claude calls needed. This gives immediate feedback during development.
bash scripts/analyze-schemas.sh <workspace>/tools.json <workspace>/iteration-N/static-analysis.json
Show per-tool quality scores. Read references/quality-checklist.md for the criteria being checked.
| Tool | Desc | Params | Schema | Annotations | Overall | Issues |
|------|------|--------|--------|-------------|---------|--------|
| search_issues | 3/3 | 3/3 | 2/3 | 2/3 | 2.5 | No negation |
| create_issue | 1/3 | 1/3 | 0/3 | 0/3 | 0.5 | 4 issues |
If the analysis found tools with high description overlap, highlight them as confusion risks:
### Sibling Pairs (confusion risk)
| Tool A | Tool B | Overlap | Risk |
|--------|--------|---------|------|
| search_issues | list_issues | 52% | HIGH |
If critical issues exist (missing descriptions, zero annotations), recommend fixing them before Phase 3. Static issues create noise in selection testing — fix the obvious problems first, then measure the subtle ones.
If all tools score well, proceed to Phase 3.
Test whether Claude picks the right tool for each user intent. This is the core eval.
Read references/eval-patterns.md for intent generation patterns.
For each tool, generate:
For each sibling pair flagged in Phase 2:
Present all intents to the user for review. Ask if any should be added, removed, or modified.
Save to {workspace}/evals/evals.json:
{
"server_name": "my-server",
"generated_from": "tools.json",
"intents": [
{
"id": 1,
"intent": "Are there any open bugs related to checkout?",
"expected_tool": "search_issues",
"type": "should_trigger",
"target_tool": "search_issues",
"notes": "Implicit intent — doesn't name the action"
}
]
}
For each intent, spawn a subagent that receives:
The subagent prompt:
You have access to the following MCP tools:
{tool schemas as JSON}
A user sends this message:
"{intent text}"
Which tool would you call? Respond with JSON:
{
"selected_tool": "tool_name" or null,
"arguments": { ... } or {},
"reasoning": "One sentence explaining your choice"
}
If no tool fits the user's request, set selected_tool to null.
Select exactly ONE tool. Do not suggest calling multiple tools.
Save each result to {workspace}/iteration-N/selection/intent-{ID}/result.json.
Launch all selection tests in parallel for efficiency.
bash scripts/grade-selection.sh \
<workspace>/iteration-N/selection \
<workspace>/evals/evals.json \
<workspace>/iteration-N/benchmark.json
## Selection Results — Iteration N
**Accuracy:** 82% (41/50 correct)
| Metric | Count |
|--------|-------|
| Correct | 41 |
| Wrong tool | 5 |
| False accept | 2 |
| False reject | 2 |
### Per-Tool Accuracy
| Tool | Precision | Recall |
|------|-----------|--------|
| search_issues | 0.90 | 0.85 |
| create_issue | 1.00 | 1.00 |
### Worst Confusions
| Expected | Selected Instead | Times |
|----------|-----------------|-------|
| list_issues | search_issues | 3 |
| get_user | find_user_by_email | 2 |
Analyze confusion patterns and suggest description improvements. Read references/optimization.md for rewrite patterns.
For each confused pair (from worst_confusions):
## Suggested Improvements
### search_issues ↔ list_issues (confused 3 times)
**search_issues — Before:**
> Search issues by keyword.
**search_issues — After:**
> Search issues by keyword across title and body. Returns up to `limit` results ranked by relevance. Does NOT filter by status, assignee, or date — use list_issues for structured filtering.
**Reason:** Adding scope boundary and cross-reference to disambiguate from list_issues.
Save to {workspace}/iteration-N/suggestions.json (format defined in optimization.md).
After the user applies the rewrites to their server code:
iteration-N+1 using the same evals.json## Iteration Comparison
| Metric | Iteration 1 | Iteration 2 | Delta |
|--------|------------|------------|-------|
| Accuracy | 82% | 94% | +12% |
| search↔list confusion | 3 | 0 | -3 |
Read these when you reach the relevant phase — not upfront:
references/quality-checklist.md — Testable quality criteria for tool schemas (Phase 2)references/eval-patterns.md — How to write tool selection test intents (Phase 3)references/optimization.md — How to improve descriptions from eval results (Phase 4)build-mcp-server — Design and scaffold MCP servers (run this first, then eval-mcp to validate)build-mcp-app — MCP servers with interactive UI widgets