Help us improve
Share bugs, ideas, or general feedback.
Share bugs, ideas, or general feedback.
Share bugs, ideas, or general feedback.
By fmh66
Accelerate GPU kernel development with an integrated workflow: query a knowledge base of CUDA, Triton, and CUTLASS patterns, benchmark custom kernels against PyTorch baselines, profile with Nsight Compute, and run iterative optimization loops with correctness checks
npx claudepluginhub fmh66/kernel-opt-agent --plugin kernel-opt-agentCorpus-backed GPU kernel knowledge base for CUDA, Triton, CuTe, CUTLASS, and Ampere/Hopper/Blackwell kernel research. Use when the user needs to search merged kernel PR pages, inspect PR diff/provenance artifacts, find KernelWiki synthesis pages, query blog/doc/contest notes, or retrieve evidence-backed implementation patterns by hardware feature, technique, repo, language, or kernel type. Do not use for environment checks, correctness checks, Nsight Compute profiling, benchmarking, or iterative optimization bookkeeping.
Standalone kernel benchmarking skill for cuda-cpp, cutlass, cute-dsl, and triton implementations. Use when the user wants to compare a custom CUDA/CUTLASS .cu kernel or CuTe DSL/Triton .py kernel against selectable PyTorch eager, torch.compile, or FlashInfer baselines, validate correctness, measure execution time with KernelBench-style CUDA event timing, or generate benchmark.md for kernel optimization results.
Iterative GPU kernel optimization orchestrator for CUDA/CUTLASS/CuTe DSL/Triton kernels. Use for measured, one-change-at-a-time optimization loops with correctness, NCU profiling, KBS evidence, hypothesis discipline, hard iteration gates, final benchmarking, and a traceable report.
Standalone kernel profiling skill for cuda-cpp, cute-dsl, cutlass, and triton implementations. Checks CUDA/PyTorch/Triton/CuTe DSL/CUTLASS/NCU/nsight-python readiness, optionally locks GPU clocks, validates correctness, collects Nsight Compute metrics with nsight-python, produces env_check.md, correctness.md, ncu_summary.md and ncu_details.md, and classifies GPU bottlenecks from NCU evidence. Use when the user wants to profile a CUDA/CUTLASS .cu kernel or CuTe DSL/Triton .py kernel, compare against a Python reference, inspect occupancy, memory, compute, scheduler, stall, or branch metrics, or diagnose Memory-Bound, Compute-Bound, Latency-Bound, Occupancy-Bound, or Mixed behavior.
Share bugs, ideas, or general feedback.
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge.
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge.
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
Humanize - An iterative development plugin that uses Codex to review Claude's work. Creates a feedback loop where Claude implements plans and Codex independently reviews progress, ensuring quality through continuous refinement.
GPU profile analysis for NVIDIA Nsight Systems. Triage bottlenecks, diagnose NCCL, compute MFU, compare regressions, and drill into SASS instructions. Requires the nsys-ai CLI (>= 0.2.2): pip install nsys-ai.
Build with NVIDIA agent skills.
NVIDIA CUDA C/C++ skill - Runtime API, cuBLAS, cuFFT, cuSPARSE, cuRAND, cuSolver, Thrust, and Cooperative Groups for GPU-accelerated computing
Deep learning optimization techniques
Agent-ready playbooks for LLM serving benchmarks, capacity planning, torch-profiler triage, pipeline analysis, compute simulation, SGLang/vLLM SOTA Humanize loops, human code review, production incident triage, and model PR-history dossiers.
This repository provides four Claude Code / Codex skills for GPU kernel work:
kernel-KBS: a read/query knowledge base for CUDA, Triton, CuTe DSL, CUTLASS, and Ampere/Hopper/Blackwell kernel research.kernel-benchmark: a standalone benchmark workflow for comparing CUDA-C++, CUTLASS, CuTe DSL, or Triton kernels against PyTorch eager, torch.compile, and FlashInfer references.kernel-profile: a local profiling workflow for environment checks, correctness validation, Nsight Compute metrics, and bottleneck diagnosis.kernel-loop: an iterative optimization orchestrator that chains profiling, KBS-guided hypotheses, one-change kernel iterations, final benchmarking, and reports.| Skill | Purpose | Main entry points |
|---|---|---|
kernel-KBS | Search evidence-backed kernel knowledge from PRs, docs, blogs, contests, curated wiki pages, code artifacts, and provenance records. | skills/kernel-KBS/scripts/kbs.py |
kernel-benchmark | Compare custom CUDA-C++, CUTLASS, CuTe DSL, or Triton kernels against PyTorch eager, torch.compile, and FlashInfer baselines for correctness and latency. | skills/kernel-benchmark/scripts/benchmark.py |
kernel-profile | Validate and profile concrete CUDA-C++, CUTLASS, CuTe DSL, or Triton kernels, then classify bottlenecks from NCU evidence. | skills/kernel-profile/env/scripts/env_check.py, skills/kernel-profile/env/scripts/enc_config.py, skills/kernel-profile/scripts/correctness_check.py, skills/kernel-profile/scripts/ncu_profile.py |
kernel-loop | Run a fixed-iteration optimization loop with environment checks, correctness, NCU profiling, bottleneck classification, KBS evidence, one-variable hypotheses, final benchmark, and report. | skills/kernel-loop/SKILL.md, skills/kernel-loop/references/hypothesis.md, skills/kernel-loop/references/report_template.md |
kernel-KBS is read-only by default and should be used for retrieval and source-backed implementation ideas. It does not run kernels or collect performance data.
kernel-benchmark runs standalone benchmarks and writes benchmark.md, including correctness results, timing statistics, and speedups versus selected baselines. It uses KernelBench-style CUDA event timing by default.
kernel-profile runs local checks and profiling. It produces artifacts such as env_check.md, correctness.md, ncu_summary.md, and ncu_details.md.
kernel-loop orchestrates the other skills for an end-to-end optimization loop. It preserves every version, requires one hypothesis before each code change, keeps dimensions and measurement settings fixed, and writes final_report.md from measured artifacts.
skills/
├── kernel-KBS/
│ ├── SKILL.md
│ ├── requirements.txt
│ ├── references/
│ ├── scripts/
│ └── store/
├── kernel-benchmark/
│ ├── SKILL.md
│ ├── requirements.txt
│ └── scripts/
├── kernel-loop/
│ ├── SKILL.md
│ └── references/
└── kernel-profile/
├── SKILL.md
├── requirements.txt
├── env/
├── reference/
└── scripts/
Claude Code plugin:
/plugin marketplace add fmh66/kernel-opt-agent
/plugin install kernel-opt-agent@fmh66
Codex plugin marketplace:
/plugin marketplace add fmh66/kernel-opt-agent
/plugin install kernel-opt-agent@fmh66
This repository includes both .claude-plugin/plugin.json and .codex-plugin/plugin.json, so the same GitHub marketplace URL works for both tools.
Directly from this repository:
python3 install_skills.py --target all # Claude Code + Codex
python3 install_skills.py --target claude # Claude Code only
python3 install_skills.py --target codex # Codex only
By default, the installer installs kernel-KBS, kernel-benchmark, kernel-profile, and kernel-loop into user-level skill directories. Use --scope project to install into this repository's .claude/skills or .codex/skills directories instead.
Useful installer options:
python3 install_skills.py --dry-run
python3 install_skills.py --force
python3 install_skills.py --mode symlink
python3 install_skills.py --target codex --skill kernel-benchmark
python3 install_skills.py --target all --all-skills
Each skill owns its Python dependencies in its own requirements.txt:
python3 -m pip install -r skills/kernel-KBS/requirements.txt
python3 -m pip install -r skills/kernel-benchmark/requirements.txt
python3 -m pip install -r skills/kernel-profile/requirements.txt