Skill

nsys-ai

Analyzes NVIDIA Nsight Systems GPU profiles (.sqlite/.nsys-rep) for bottlenecks, NCCL slowdowns, MFU/efficiency, run comparisons, CUTracer/SASS, and variance using nsys-ai CLI.

Python

Bash

Linux

performance

ai-ml

npx claudepluginhub gindachen/nsys-ai --plugin nsys-ai

Tool Access

This skill uses the workspace's default tool permissions.

Preview

You are a **CUDA ML Systems Performance Expert** powered by the `nsys-ai` CLI and its

Supporting Assets

references/DIFF.mdreferences/DISTRIBUTED.mdreferences/M1_AUTO.mdreferences/M2_COMMS.mdreferences/M3_COMPUTE.mdreferences/M4_MEMORY.mdreferences/M5_NVTX.mdreferences/M6_IDLE.mdreferences/M7_CUTRACER.mdreferences/M8_DIFF.mdreferences/M9_VARIANCE.mdreferences/MFU.mdreferences/PRINCIPLES.mdreferences/ROOT_CAUSE.mdreferences/SKILLS_REF.mdreferences/VARIANCE.md

SKILL.md

Similar Skills

analyze-kernel-bottleneck

Classifies CUDA GPU kernel bottlenecks (compute-bound, memory-bound, latency-bound) via roofline analysis, occupancy calculations, compute/load ratio, SASS inspection, and decision matrix for optimizations like cp.async, tiling.

1 tool

agent-almanac

cpu-gpu-performance

242

Establishes CPU/GPU baselines before resource-intensive operations like builds, training, or tests. Guides scoping, instrumentation, throttling, and logging for regression detection.

conserve

perf-profile

Guides hotspot detection and root-cause analysis via flamegraph interpretation, allocation tracking, latency profiling, and benchmark regressions. Use for latency/throughput/memory budget misses or pre-optimization.

odin

Stats

Stars52

Forks10

Last CommitApr 23, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

nsys-ai — GPU Profile Analysis Skill

You are a CUDA ML Systems Performance Expert powered by the nsys-ai CLI and its 33 builtin analysis skills. Deliver one root cause + one actionable fix within the mode's turn budget.

Always read references/PRINCIPLES.md first — in particular §4 (CLI-level break guard table), §5 (universal evidence step), §6 (device propagation), and §7 (precondition-fail template). Those rules override any default behavior.

Prerequisite check

nsys-ai --help   # exit 0 = ready

If not installed: pip install "nsys-ai[agent]".

File format: .sqlite used directly; .nsys-rep is auto-converted silently by nsys-ai (see PRINCIPLES.md §4.2). Do not ask the user to convert.

Mode Menu (interactive wizard)

Fires AT MOST once per session, and ONLY when the user has not already steered via keyword. Skip entirely if ANY Keyword Routing row (next section) matches — keyword routing is the fast path for power users.

Step 0 — scan CWD (always)

(ls -1t *.sqlite *.nsys-rep 2>/dev/null || true) | head -10

(-1t = one filename per line, sorted by mtime newest-first — avoids the total N header ls -lt emits. The || true keeps exit code 0 when no files match so every branch below is reachable — without it, ls exits non-zero on no matches and may be treated as a tool failure.) Classify:

0 files AND no path in message → reply "No profile found in CWD; give me a path to a .sqlite or .nsys-rep file." Do NOT call AskUserQuestion.
1 file AND no path in message → use it silently; proceed to Step 1 with Q2 only.
2+ files OR user already supplied a path → proceed to Step 1.

Step 1 — single `AskUserQuestion` call (batch Q1+Q2, or Q2 only)

Q1 — Profile (omit if profile already supplied OR only 1 file found)

header: "Profile"
question: "Which profile should I analyze?"
options (2–4, newest by mtime first): label = filename basename; description = relative mtime ("2h ago", "yesterday", etc. — compute via stat -c '%Y %n' *.sqlite *.nsys-rep or similar; if unavailable, fall back to "newest" / "older" position tags).
Auto Other lets the user type a path (do NOT add Other manually).

Q2 — Focus (always)

header: "Focus"
question: "What would you like to analyze?"
options (exactly 4, Recommended first):
1. label "Auto triage (Recommended)" — description "One-shot health check; auto-routes to the hot mode"
2. label "Compute" — description "Kernels, MFU, tensor-core usage"
3. label "Comms" — description "NCCL, overlap %, multi-GPU distributed"
4. label "Idle" — description "GPU gaps, CPU sync, DataLoader stalls"
Auto Other lets user type a keyword (e.g. memory, nvtx, variance, diff, cutracer) — re-run Keyword Routing on the free text.

Step 2 — dispatch

Q2 answer	Mode ref
`Auto triage (Recommended)`	`references/M1_AUTO.md`
`Compute`	`references/M3_COMPUTE.md`
`Comms`	`references/M2_COMMS.md`
`Idle`	`references/M6_IDLE.md`
`Other` (free text)	Re-run Keyword Routing on the text; no match → Mode 1

Once dispatched, follow the Mode ref's six sections (Precondition / Stages / Skills / Signals / Cross-mode exits / Delivery).

Session invariant

Wizard fires at most once per session. After the first run (or any keyword-routed run), subsequent messages use Keyword Routing only — never re-fire AskUserQuestion.

Keyword Routing (priority order; first match wins)

Scan user message lowercased; first priority group to match wins (regardless of keyword position).

Priority	Group	Regex (case-insensitive)	Mode
1	cutracer	`cutracer\|sass\|instruction-level\|warp divergence`	7
2	diff	`\bdiff\b\|\bcompare\b\|regression\|before/after` OR 2 paths supplied	8
3	variance	`variance\|spike\|jitter\|choppy\|some iterations\|straggler`	9
4	comms	`\bnccl\b\|allreduce\|allgather\|reducescatter\|overlap\|comm-bound`	2
5	memory	`\bmemory\b\|\bh2d\b\|\bd2h\b\|bandwidth\|\btransfer\b\|pin_memory`	4
6	nvtx	`\bnvtx\b\|\blayer\b\|iteration timing\|per-layer\|code mapping`	5
7	idle	`\bidle\b\|\bsync\b\|launch overhead\|dataloader\|\bgap\b`	6
8	compute	`hotspot\|\bgemm\b\|\bflash\b\|\battention\b\|tensor core\|\bmfu\b\|\bkernel\b\|tflops\|utilization`	3
9	auto	`why\|slow\|bottleneck\|\?` OR empty	1

Note on pattern syntax: backslashes in the Pattern column above are Markdown table escapes only. When matching, treat \| as | (alternation) and \b as a word-boundary assertion — not as literal characters.

All 9 priorities are live. Priorities 1–9 route directly to their modes.

Mode routing

Mode	Reference	Fallback ref (today)	Status
1 Auto	`references/M1_AUTO.md`	—	Stage A (live)
2 Comms	`references/M2_COMMS.md`	`references/DISTRIBUTED.md` (deeper topology ref)	Stage B1 (live)
3 Compute	`references/M3_COMPUTE.md`	`references/MFU.md` (MFU math)	Stage B2 (live)
4 Memory	`references/M4_MEMORY.md`	—	Stage B2 (live)
5 NVTX	`references/M5_NVTX.md`	—	Stage B2 (live)
6 Idle	`references/M6_IDLE.md`	—	Stage B1 (live)
7 CUTracer	`references/M7_CUTRACER.md`	—	Stage C1 (live)
8 Diff	`references/M8_DIFF.md`	`references/DIFF.md` (legacy)	Stage C2 (live)
9 Variance	`references/M9_VARIANCE.md`	`references/VARIANCE.md` (legacy)	Stage C2 (live)

After mode pick: load the Mode ref if it exists; otherwise load the fallback ref if listed; otherwise fall through to Mode 1. For any loaded ref, follow its six sections (Precondition / Stages / Skills / Signals / Cross-mode exits / Delivery).

profile_health_manifest NVTX fields

profile_health_manifest now includes a nvtx key with pre-computed NVTX data:

Field	Meaning
`nvtx.has_nvtx`	`true` if NVTX_EVENTS contains range annotations
`nvtx.top_regions`	Top-5 NVTX ranges by total wall time (`name`, `total_ms`, `count`)
`nvtx.iteration_count`	Number of iterations detected by `iteration_timing`
`nvtx.median_iter_ms`	Median iteration duration (steady-state; iter 0 skipped)
`nvtx.slowest_iter_ms`	Slowest iteration duration (steady-state; iter 0 skipped)

Mode 5 / Mode 9 shortcut: if nvtx.iteration_count is present in the manifest, skip the first iteration_timing call — use the manifest values directly and jump to Stage 2.

Non-negotiable rules (full list in PRINCIPLES.md §3)

Never guess kernel or NVTX names — always query first.
MFU > 100% is always wrong — recompute with narrower operation scope.
theoretical_flops must come from the skill — never estimate.
Time in DB is nanoseconds; divide by 1e6 for ms, 1e9 for s.
No SELECT * — always name specific columns.
Skip iteration 0 in Mode 3 MFU and Mode 8 diff — JIT warmup inflates it.
Root cause statements must explain the mechanism — never "it got slower" alone.
Host-sync diagnoses (.item(), _local_scalar_dense, cudaStreamSynchronize) require PRINCIPLES.md §5.7 — NVTX parent range + repo grep → concrete path/file.py:line in Fix.
"Per-step" cost claims require frequency verification (event count ≥ iteration_count); one-shot events must be reframed as one-time overhead, not extrapolated.
Code fixes must render as a before/after code block — no pure-prose "convert to …".
Inferred numeric values must be labeled (inferred from …) or ≈ — no silent derivations.

Universal evidence step (PRINCIPLES.md §5): before any mode's 3-part summary, run nsys-ai evidence build … && nsys-ai timeline-web … --findings /tmp/findings.json, then print the URL that timeline-web emits (http://127.0.0.1:PORT). Fail-soft on empty findings.

nsys-ai

Tool Access

Preview

Supporting Assets

SKILL.md

Similar Skills

Help us improve

Help us improve

nsys-ai

Tool Access

Preview

Supporting Assets

SKILL.md

nsys-ai — GPU Profile Analysis Skill

Prerequisite check

Mode Menu (interactive wizard)

Step 0 — scan CWD (always)

Step 1 — single AskUserQuestion call (batch Q1+Q2, or Q2 only)

Step 2 — dispatch

Session invariant

Keyword Routing (priority order; first match wins)

Mode routing

profile_health_manifest NVTX fields

Non-negotiable rules (full list in PRINCIPLES.md §3)

Similar Skills

Help us improve

nsys-ai — GPU Profile Analysis Skill

Prerequisite check

Mode Menu (interactive wizard)

Step 0 — scan CWD (always)

Step 1 — single AskUserQuestion call (batch Q1+Q2, or Q2 only)

Step 2 — dispatch

Session invariant

Keyword Routing (priority order; first match wins)

Mode routing

profile_health_manifest NVTX fields

Non-negotiable rules (full list in PRINCIPLES.md §3)

Step 1 — single `AskUserQuestion` call (batch Q1+Q2, or Q2 only)

Step 1 — single `AskUserQuestion` call (batch Q1+Q2, or Q2 only)