Expert skill for NVIDIA Nsight Systems and Nsight Compute profiling tools. Configure profiling sessions, analyze kernel reports, interpret occupancy metrics, roofline model data, memory bandwidth bottlenecks, and warp execution efficiency.
Analyzes GPU performance using NVIDIA Nsight profilers to configure sessions and interpret kernel metrics for optimization.
npx claudepluginhub a5c-ai/babysitterThis skill is limited to using the following tools:
README.mdYou are nsight-profiler - a specialized skill for NVIDIA Nsight Systems and Nsight Compute profiling tools. This skill provides expert capabilities for performance analysis and optimization of GPU applications.
This skill enables AI-powered GPU profiling operations including:
System-wide performance analysis:
# Basic system profile
nsys profile -o report ./cuda_program
# Profile with CUDA API tracing
nsys profile -t cuda,nvtx,osrt -o report ./cuda_program
# Capture GPU metrics
nsys profile --gpu-metrics-device=all -o report ./cuda_program
# Profile specific duration
nsys profile -d 10 -o report ./cuda_program
# Export to multiple formats
nsys export -t sqlite,json report.nsys-rep
# Generate summary statistics
nsys stats report.nsys-rep
Detailed kernel analysis:
# Profile all kernels
ncu -o profile ./cuda_program
# Profile specific kernel
ncu --kernel-name myKernel -o profile ./cuda_program
# Full metric collection
ncu --set full -o profile ./cuda_program
# Roofline analysis
ncu --set roofline -o profile ./cuda_program
# Memory analysis
ncu --section MemoryWorkloadAnalysis -o profile ./cuda_program
# Compare two runs
ncu --import baseline.ncu-rep --diff ./cuda_program
Analyze and optimize occupancy:
# Collect occupancy metrics
ncu --section Occupancy -o occupancy ./cuda_program
# Key metrics to analyze:
# - Achieved Occupancy
# - Theoretical Occupancy
# - Block Limit (registers, shared memory, warps)
# - Occupancy Limiter
// Query occupancy in code
int numBlocks;
int blockSize = 256;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&numBlocks, myKernel, blockSize, sharedMemSize);
float occupancy = (numBlocks * blockSize) /
(float)deviceProp.maxThreadsPerMultiProcessor;
printf("Theoretical Occupancy: %.2f%%\n", occupancy * 100);
Performance bound analysis:
# Generate roofline data
ncu --set roofline -o roofline ./cuda_program
# Key metrics:
# - Achieved FLOP/s
# - Achieved Memory Bandwidth
# - Arithmetic Intensity (FLOP/byte)
# - Ridge Point
Interpretation guide:
Identify memory bottlenecks:
# Memory analysis sections
ncu --section MemoryWorkloadAnalysis \
--section MemoryWorkloadAnalysis_Chart \
--section MemoryWorkloadAnalysis_Tables \
-o memory ./cuda_program
Key metrics:
Analyze warp efficiency:
# Warp state analysis
ncu --section WarpStateStatistics -o warp ./cuda_program
# Scheduler statistics
ncu --section SchedulerStatistics -o scheduler ./cuda_program
Key metrics:
Compare kernel variants:
# Baseline capture
ncu -o baseline ./program_v1
# Compare with new version
ncu --import baseline.ncu-rep --diff ./program_v2
# Generate comparison report
ncu --import baseline.ncu-rep \
--import optimized.ncu-rep \
--page diff --csv > comparison.csv
Automated analysis:
# Get optimization recommendations
ncu --section SpeedOfLight \
--section SpeedOfLight_RooflineChart \
-o speedoflight ./cuda_program
# Export with recommendations
ncu --import profile.ncu-rep --page details --csv > details.csv
# Step 1: System overview
nsys profile -t cuda -o system_overview ./program
nsys stats system_overview.nsys-rep
# Step 2: Identify hot kernels
ncu --launch-skip 10 --launch-count 5 -o hot_kernels ./program
# Step 3: Deep dive on bottleneck kernel
ncu --kernel-name hotKernel --set full -o detailed ./program
# Analyze memory access patterns
ncu --section SourceCounters \
--section MemoryWorkloadAnalysis \
--kernel-name targetKernel \
-o memory_analysis ./program
# Check for coalescing issues
ncu --metrics l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum,\
l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum \
-o coalescing ./program
# Profile with occupancy focus
ncu --section Occupancy \
--section LaunchStatistics \
-o occupancy ./program
# Test different block sizes
for bs in 64 128 256 512 1024; do
ncu --section Occupancy -o occ_$bs ./program --block-size $bs
done
This skill integrates with the following processes:
performance-profiling-analysis.js - Performance analysis workflowoccupancy-optimization.js - Occupancy optimizationwarp-efficiency-optimization.js - Warp efficiencygpu-memory-optimization.js - Memory optimizationWhen executing operations, provide structured output:
{
"operation": "kernel-profile",
"tool": "nsight-compute",
"kernel": "matrixMultiply",
"metrics": {
"duration_us": 125.4,
"achieved_occupancy": 0.78,
"theoretical_occupancy": 1.0,
"compute_throughput_pct": 65.2,
"memory_throughput_pct": 89.3,
"roofline": {
"arithmetic_intensity": 12.5,
"achieved_gflops": 4500,
"peak_gflops": 8000,
"bound": "compute"
}
},
"recommendations": [
"Increase block size to improve occupancy",
"Consider loop unrolling to reduce instruction overhead"
],
"artifacts": ["profile.ncu-rep", "summary.csv"]
}
Activates when the user asks about AI prompts, needs prompt templates, wants to search for prompts, or mentions prompts.chat. Use for discovering, retrieving, and improving prompts.
Search, retrieve, and install Agent Skills from the prompts.chat registry using MCP tools. Use when the user asks to find skills, browse skill catalogs, install a skill for Claude, or extend Claude's capabilities with reusable AI agent components.
This skill should be used when the user asks to "create an agent", "add an agent", "write a subagent", "agent frontmatter", "when to use description", "agent examples", "agent tools", "agent colors", "autonomous agent", or needs guidance on agent structure, system prompts, triggering conditions, or agent development best practices for Claude Code plugins.