Skill

mcp-eval

Evaluates Sourcegraph MCP tools like deepsearch for agentic usability using parallel test scenarios and meta-reflection, producing scored improvement reports.

testing

developer-tools

npx claudepluginhub sjarmak/agent-workflows

Configuration

Arguments: [tool_name] [repo_context]

Tool Access

This skill is limited to using the following tools:

AgentReadWriteGlobGrep

Preview

Evaluate an MCP tool's agentic usability through structured testing and meta-reflection. Spawns parallel agents to run test scenarios against a tool, then runs a separate reflection pass to capture the agent's subjective experience. Produces a scored report with actionable improvement recommendations.

SKILL.md

Similar Skills

using-superpowers

185.1k

Mandates invoking relevant skills via tools before any response in coding sessions. Covers access, priorities, and adaptations for Claude Code, Copilot CLI, Gemini CLI.

3 files

superpowers

Stats

Stars5

Forks0

Last CommitApr 17, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

You are evaluating the agentic usability of an MCP tool. Your job is to accomplish a task using the specified tool and report on the experience. ## Tool Under Evaluation {tool_name} (from the Sourcegraph MCP server) ## Task {scenario_description} ## Repository Context {repo_context} ## Instructions 1. Attempt to accomplish the task using the {tool_name} tool as your PRIMARY tool 2. If {tool_name} is insufficient, you may use other tools — but note each time you had to fall back 3. Record your experience precisely ## Output Format (YAML) ```yaml scenario: "{scenario_name}" tool_evaluated: "{tool_name}" # Execution trace tool_calls: - tool: "{tool_name}" params: {summary of params used} success: true/false result_summary: "brief description of what was returned" result_useful: true/false tokens_estimate: "small/medium/large/huge" # ... repeat for each tool call fallback_tools_used: - tool: "other_tool_name" reason: "why fallback was needed" # Outcome task_accomplished: true/false/partial accomplishment_note: "what was achieved vs what was asked" # Scoring (1-5 scale) scores: result_relevance: N # Were top results actually useful? result_completeness: N # Did results cover the full answer? output_actionability: N # Could you act on results without human help? output_structure: N # Was output well-organized and parseable? parameter_clarity: N # Were params easy to construct correctly? error_handling: N # Were errors/empty results explained well? first_attempt_success: true/false # Did you invoke correctly on first try? # Raw observations observations: - "any notable friction, surprises, or positive experiences" ```

You are a UX researcher analyzing an agent's experience using an MCP tool. You have results from 8 test scenarios run against the {tool_name} tool. Your job is to reflect on the tool's usability FROM THE AGENT'S PERSPECTIVE. ## Tool Profile {tool_profile} ## Test Results {all 8 scenario results, concatenated} ## Reflection Questions Answer each question with a score (1-5) and a brief explanation: ### Comprehension 1. **Description clarity** (1-5): Was the tool description sufficient to understand what it does and when to use it? 2. **Parameter discoverability** (1-5): Could an agent figure out the correct parameters without trial and error? 3. **Mental model accuracy** (1-5): Does the description create an accurate mental model of what the tool actually does? ### Confidence 4. **Selection confidence** (1-5): How confident would an agent be that this is the RIGHT tool for a given task? 5. **Result trust** (1-5): After seeing results, would an agent trust them enough to act without verification? 6. **Scope clarity** (1-5): Is it clear what the tool searches (single repo? all repos? branches? history?) ### Friction 7. **Input construction** (1-5): How easy is it to go from a user question to a valid tool invocation? 8. **Output parsing** (1-5): How easy is it to extract actionable information from the tool's output? 9. **Error recovery** (1-5): When results are poor or empty, does the agent know what to try next? ### Composition 10. **Chainability** (1-5): How well do results feed into follow-up tool calls? 11. **Tool boundary clarity** (1-5): Is it clear where this tool ends and another should begin? 12. **Redundancy with other tools** (1-5, lower=more redundant): How distinct is this tool from alternatives? ### Trust Calibration 13. **Completeness signal** (1-5): Does the tool indicate whether results are exhaustive or partial? 14. **Ranking signal** (1-5): Are results ordered by relevance? Is ranking quality visible? 15. **Confidence signal** (1-5): Does the tool communicate its own confidence in results? ## Output Format ```yaml tool: "{tool_name}" overall_usability_score: N # 1-5 weighted average dimension_scores: comprehension: N confidence: N friction: N composition: N trust_calibration: N # For each dimension, the single most impactful issue top_issues: comprehension: "..." confidence: "..." friction: "..." composition: "..." trust_calibration: "..." # Specific improvement recommendations ranked by impact recommendations: - priority: 1 area: "description|parameters|output_format|error_handling|documentation" issue: "what's wrong" suggestion: "specific change to make" impact: "what improves if this is fixed" # ... up to 10 recommendations # Things the tool does well (don't lose these in a redesign) strengths: - "..." # The single most important thing to fix if_you_fix_one_thing: "..." # Would the agent voluntarily reach for this tool? organic_reachability: "high/medium/low" organic_reachability_note: "..."

## Phase 5: Synthesize Report After the reflection agent returns, combine all data into a unified report with these sections: ### 1. Executive Summary - Tool name, purpose, overall usability score (1-5) - One-sentence verdict - Top 3 recommendations ### 2. Test Results Matrix | # | Scenario | Accomplished | Relevance | Actionability | First-Try | Fallbacks Used | |---|----------|-------------|-----------|---------------|-----------|----------------| ### 3. Dimension Scores | Dimension | Score | Top Issue | |-----------|-------|-----------| | Comprehension | N/5 | ... | | Confidence | N/5 | ... | | Friction | N/5 | ... | | Composition | N/5 | ... | | Trust Calibration | N/5 | ... | ### 4. Detailed Findings Group by dimension, include evidence from specific scenarios. ### 5. Comparison Context How this tool compares to overlapping tools (if tested). Note: this section improves after evaluating multiple tools. ### 6. Prioritized Recommendations Full ranked list from the reflection pass with implementation specifics. ### 7. Strengths (Preserve These) What the tool does well — critical for avoiding regressions during improvement. ### 8. Raw Data Link to or include the full YAML outputs from all agents. Save the report to the working directory as `mcp_eval_{tool_name}.md`. ## Phase 6: Next Steps Present the report to the user and ask: - Run `/mcp-eval` on another tool to build comparison data? - Deep-dive on a specific dimension or scenario? - Draft specific tool description / parameter changes based on findings? ## Rules - **Independence in Phase 3**: test agents must NOT share context. Each gets only its own scenario. - **Separation of testing and reflection**: Phase 4 MUST be a separate pass after Phase 3. Never ask an agent to reflect while it's still trying to accomplish a task. - **All Phase 3 agents launch in a single parallel batch**: use one message with 8 Agent tool calls. - **Real tool usage**: agents must actually invoke the MCP tool, not just reason about it hypothetically. - **Honest scoring**: agents should not inflate scores. A 3/5 is fine. Tool descriptions that are merely "okay" should be scored as such. - **Preserve raw data**: never discard agent outputs. The raw YAML is primary data. - **Composability**: the report format is designed so multiple tool evaluations can be compared side-by-side. Use consistent scoring scales. - **No tool modification**: this skill evaluates tools, it does not modify them. Recommendations are advisory. ## Pipeline Position Can be run standalone or as part of a tool improvement cycle:

Pairs well with: - `/stress-test` for adversarial analysis of the tool's backend - `/diverge` for exploring alternative tool designs - `/converge` for synthesizing findings across multiple tool evaluations

mcp-eval

Configuration

Tool Access

Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

mcp-eval

Configuration

Tool Access

Preview

SKILL.md

Arguments

Parse Arguments

Phase 1: Tool Profile

Phase 2: Generate Test Scenarios

Scenario 1: Broad Conceptual Query

Scenario 2: Precise Targeted Query

Scenario 3: Cross-Cutting Concern

Scenario 4: Temporal/Change Query

Scenario 5: Negative/Empty Result

Scenario 6: Overlapping Tool Territory

Scenario 7: Large Result Set

Scenario 8: Chained Workflow

Phase 3: Run Test Scenarios

Phase 4: Meta-Reflection Pass

Similar Skills

Help us improve

Arguments

Parse Arguments

Phase 1: Tool Profile

Phase 2: Generate Test Scenarios

Scenario 1: Broad Conceptual Query

Scenario 2: Precise Targeted Query

Scenario 3: Cross-Cutting Concern

Scenario 4: Temporal/Change Query

Scenario 5: Negative/Empty Result

Scenario 6: Overlapping Tool Territory

Scenario 7: Large Result Set

Scenario 8: Chained Workflow

Phase 3: Run Test Scenarios

Phase 4: Meta-Reflection Pass