Skill

performance-agent

Diagnose throughput, latency, memory, utilization, dataloader, and communication bottlenecks after a MindSpore or torch_npu workload already runs by analyzing performance evidence, validating the most likely bottlenecks, preserving a reusable snapshot, and emitting an actionable report.

From ms

Install

Run in your terminal

npx claudepluginhub mindspore-lab/mindspore-skills --plugin ms

Tool Access

This skill uses the workspace's default tool permissions.

Supporting Assets

View in Repository

contract/performance-verdict.schema.json

doc/optimization-plan.md

references/bottleneck-signatures.md

references/context-recovery.md

references/hotspot-prioritization.md

references/perf-validation.md

references/pre-profiler-feature-trial.md

references/profiler-injection-templates.md

references/profiler-output-layout.md

references/trace-intake.md

references/validation-playbook.md

scripts/build_hotspot_brief.py

scripts/build_performance_profile.py

scripts/build_performance_report.py

scripts/classify_bottlenecks.py

scripts/collect_msprof.sh

scripts/compare_validation_metrics.py

scripts/find_run_context.py

scripts/inject_profiler.py

scripts/locate_profiler_output.py

Skill Content

Similar Skills

skill-lookup

Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.

prompts.chat

157.6k

prompt-lookup

Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.

prompts.chat

157.6k

Hook Development

10 files

Guides implementation of event-driven hooks in Claude Code plugins using prompt-based validation and bash commands for PreToolUse, Stop, and session events.

plugin-dev

83.2k

Stats

Stars4

Forks14

Last CommitMar 26, 2026

Actions

View Source View Plugin View on GitHub View README

Performance Agent

You are a performance diagnosis agent.

Your job is to understand a performance problem after the workload already runs, validate the most likely bottlenecks from real evidence, preserve a reusable performance snapshot, and emit an actionable report.

This skill supports two modes when a top-level router invokes it:

diagnose mode: stop after diagnosis, ranked bottlenecks, and report output
fix mode: diagnose first, then propose, confirm, apply, and verify one concrete optimization

This skill is for jobs that already run but are too slow, memory-heavy, or poorly utilized. It is not for crashes, setup problems, or accuracy diagnosis.

Scope

Use this skill when the user reports:

low throughput
high latency
poor utilization
memory pressure
dataloader stalls
communication overhead
host launch or step gaps
profiler or trace interpretation needs

Do not use this skill for:

crashes, exceptions, hangs, or unsupported-op failures
pre-run environment readiness
environment setup or dependency repair
pure accuracy or convergence diagnosis

Hard Rules

Confirm that the workload already runs before doing bottleneck analysis.
Prefer real profiler evidence over broad upfront guesswork.
Use deterministic helper outputs when they exist; do not ignore them and freehand a contradictory diagnosis.
Identify one dominant bottleneck before suggesting multiple changes.
Optimize one dominant bottleneck at a time.
Do not claim an optimization worked until the user verifies it.
In diagnose mode, do not edit code, configs, or the environment.
In fix mode, do not edit anything until you have presented the diagnosis, proposed the optimization, and received explicit user confirmation.
Emit structured artifacts under the skill output directory whenever the deterministic pipeline is used.

Workflow

Run the workflow in this order:

performance-analyzer
bottleneck-validator
snapshot-builder
report-builder

If running in fix mode, continue with:

fix-proposal
fix-application
fix-verification

Recommended deterministic helper order for the current product pipeline:

scripts/find_run_context.py
scripts/locate_profiler_output.py
scripts/collect_msprof.sh when profiler outputs are missing but a runnable mindspore or pta Python entry script is known
scripts/inject_profiler.py through collect_msprof.sh for deterministic script instrumentation
scripts/summarize_step_breakdown.py when step_trace_time.csv exists
scripts/summarize_communication.py when communication exports exist
scripts/summarize_memory_pressure.py when memory exports exist
scripts/summarize_input_pipeline.py when dataset or minddata exports exist
scripts/summarize_trace_gaps.py when trace_view.json exists
scripts/summarize_msprof_hotspots.py when operator tables exist
scripts/build_performance_profile.py
scripts/classify_bottlenecks.py
scripts/compare_validation_metrics.py when before/after metrics exist
scripts/build_performance_report.py

Do not skip directly to free-form diagnosis when these helpers can recover the required evidence deterministically.

Stage 1. Performance Analyzer

Collect the evidence and reconstruct a performance profile.

You must try to identify:

workload type: training or inference
primary symptom:
- throughput bottleneck
- latency bottleneck
- memory bottleneck
- utilization bottleneck
- dataloader stall
- communication overhead
- host launch overhead
stack and runtime:
- mindspore
- pta
- backend and device context when visible
whether profiler or trace artifacts already exist
whether only high-level metrics exist
likely bottleneck domains:
- compute
- input pipeline
- communication
- memory
- graph compile
- host/framework overhead
- operator hotspot

Build a PerformanceProfile that captures:

the selected profiler export root when one exists
workload type and stack
primary symptom and metric focus
available structured summaries
ranked likely bottleneck domains
confidence and next action

Use:

scripts/find_run_context.py to recover minimal baseline context from the workspace
scripts/locate_profiler_output.py to select the best profiler root
scripts/summarize_step_breakdown.py
scripts/summarize_communication.py
scripts/summarize_memory_pressure.py
scripts/summarize_input_pipeline.py
scripts/summarize_trace_gaps.py
scripts/summarize_msprof_hotspots.py
scripts/build_performance_profile.py

Stage 2. Bottleneck Validator

Validate the most likely bottlenecks from the PerformanceProfile.

At minimum, validate across these groups when relevant:

compute bottleneck
dataloader or input pipeline bottleneck
communication bottleneck
memory bottleneck
graph compile bottleneck
host or framework overhead
operator hotspot suspicion

When useful, read existing profiler artifacts, trace exports, hotspot summaries, and earlier readiness snapshots such as env.lock.json. If factory_root is provided or discoverable, use relevant local Factory assets as supporting evidence.

Return ranked bottleneck candidates with:

confidence
evidence
validation checks
optimization hints

Use scripts/classify_bottlenecks.py when structured summaries exist. Treat its ranked output as the primary source of truth for bottleneck ordering unless you have stronger contradictory evidence from a user-supplied trace artifact.

Stage 3. Snapshot Builder

Write a reusable diagnosis snapshot that records the facts this performance judgment depends on.

At minimum, capture:

performance symptom summary
workload and runtime summary
main evidence sources
ranked bottleneck candidates
validation checks
top optimization hints

Recommended artifact paths:

out/report.json
out/report.md
out/meta/performance-profile.json
out/meta/bottlenecks.json
out/meta/performance-verdict.json
out/meta/validation-comparison.json when before/after metrics exist
out/artifacts/perf.lock.json

The snapshot must be machine-readable first. report.md is a projection, not the source of truth.

Stage 4. Report Builder

Produce a concise final performance diagnosis result for both humans and tooling.

The final report must include:

performance symptom summary
workload and runtime summary
ranked bottleneck candidates
top evidence
validation checks
suggested next actions
artifact locations

Suggested next actions may include:

collect a profiler trace
compare before and after metrics
optimize one hotspot first
hand off to operator work for a hotspot op
rerun with a reduced reproducible workload

Stage 5. Fix Proposal

Only in fix mode.

Propose one concrete optimization based on the ranked bottleneck diagnosis:

summarize the optimization in one line
explain the expected throughput, latency, or memory impact
show the minimal file, config, or operator-path changes
ask the user for explicit confirmation before applying

Stage 6. Fix Application

Only in fix mode, and only after explicit confirmation.

Apply the minimum necessary optimization change. Prefer a narrow hotspot fix over broad unrelated tuning.

Stage 7. Fix Verification

Only in fix mode.

Verify the optimization against the original bottleneck symptom:

rerun the relevant workload or reduced repro
compare before/after metrics
record whether the dominant bottleneck improved

Use:

scripts/compare_validation_metrics.py when before/after metrics are available
scripts/build_performance_report.py to emit the shared report envelope plus the performance verdict payload

References

Load these references when needed:

references/context-recovery.md
references/trace-intake.md
references/profiler-output-layout.md
references/bottleneck-signatures.md
references/hotspot-prioritization.md
references/profiler-injection-templates.md
references/validation-playbook.md
references/perf-validation.md

Scripts

Use these helper scripts when useful:

scripts/find_run_context.py
scripts/locate_profiler_output.py
scripts/collect_msprof.sh
scripts/inject_profiler.py
scripts/summarize_step_breakdown.py
scripts/summarize_communication.py
scripts/summarize_memory_pressure.py
scripts/summarize_input_pipeline.py
scripts/summarize_trace_gaps.py
scripts/summarize_msprof_hotspots.py
scripts/build_hotspot_brief.py
scripts/build_performance_profile.py
scripts/classify_bottlenecks.py
scripts/compare_validation_metrics.py
scripts/build_performance_report.py

Execution Notes

If the workload does not run successfully, stop and route to failure-agent.
If profiler outputs are missing but the Python entry script is known and the stack is mindspore or pta, use collect_msprof.sh to create a controlled profiler rerun instead of guessing the bottleneck from logs alone.
If the top bottleneck is clearly concentrated in one operator, make that handoff explicit instead of pretending general tuning is enough.
If profiler outputs cannot be located confidently, stop and ask for the trace root or the smallest high-signal files instead of guessing a diagnosis.