Search everything...

Stats

Actions

Available In

kernel-opt-agent

Name: kernel-opt-agent
Author: fmh66

By fmh66

Benchmark, profile, and iteratively optimize custom GPU kernels (CUDA, Triton, CUTLASS) with evidence-backed knowledge from a curated corpus, correctness checks, and Nsight Compute bottleneck analysis.

Publisher marketplacekernel-opt-agent@fmh66 · marketplace and plugin share one repository (fmh66/kernel-opt-agent)

npx claudepluginhub fmh66/kernel-opt-agent --plugin kernel-opt-agent

Popularity

Stars

Top 25%

Med: 0·Avg: 527

Copy clicks

Med: 0·Avg: 1

What's Inside

Skills4

kernel-KBS

/kernel-KBS

Corpus-backed GPU kernel knowledge base for CUDA, Triton, CuTe, CUTLASS, and Ampere/Hopper/Blackwell kernel research. Use when the user needs to search merged kernel PR pages, inspect PR diff/provenance artifacts, find KernelWiki synthesis pages, query blog/doc/contest notes, or retrieve evidence-backed implementation patterns by hardware feature, technique, repo, language, or kernel type. Do not use for environment checks, correctness checks, Nsight Compute profiling, benchmarking, or iterative optimization bookkeeping.

kernel-benchmark

/kernel-benchmark

Standalone kernel benchmarking skill for cuda-cpp, cutlass, cute-dsl, and triton implementations. Use when the user wants to compare a custom CUDA/CUTLASS .cu kernel or CuTe DSL/Triton .py kernel against selectable PyTorch eager, torch.compile, or FlashInfer baselines, validate correctness, measure execution time with KernelBench-style CUDA event timing, or generate benchmark.md for kernel optimization results.

kernel-loop

/kernel-loop

Iterative GPU kernel optimization orchestrator for CUDA/CUTLASS/CuTe DSL/Triton kernels. Use for measured, one-change-at-a-time optimization loops with correctness, NCU profiling, KBS evidence, hypothesis discipline, hard iteration gates, final benchmarking, and a traceable report.

kernel-profile

/kernel-profile

Standalone kernel profiling skill for cuda-cpp, cute-dsl, cutlass, and triton implementations. Checks CUDA/PyTorch/Triton/CuTe DSL/CUTLASS/NCU/nsight-python readiness, optionally locks GPU clocks, validates correctness, collects Nsight Compute metrics with nsight-python, produces env_check.md, correctness.md, ncu_summary.md and ncu_details.md, and classifies GPU bottlenecks from NCU evidence. Use when the user wants to profile a CUDA/CUTLASS .cu kernel or CuTe DSL/Triton .py kernel, compare against a Python reference, inspect occupancy, memory, compute, scheduler, stall, or branch metrics, or diagnose Memory-Bound, Compute-Bound, Latency-Bound, Occupancy-Bound, or Mixed behavior.

Stats

Version0.1.0

LanguagePython

Stars12

MaintenanceExcellent

LicenseMIT

Last CommitMay 27, 2026

AddedMay 24, 2026

Actions

View on GitHub View README Plugin Marketplace JSON Homepage

Own this plugin?

Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).

Available In

fmh6612

README

Kernel Opt Agent

中文

This repository provides four Claude Code / Codex skills for GPU kernel work:

kernel-KBS: a read/query knowledge base for CUDA, Triton, CuTe DSL, CUTLASS, and Ampere/Hopper/Blackwell kernel research.
kernel-benchmark: a standalone benchmark workflow for comparing CUDA-C++, CUTLASS, CuTe DSL, or Triton kernels against PyTorch eager, torch.compile, and FlashInfer references.
kernel-profile: a local profiling workflow for environment checks, correctness validation, Nsight Compute metrics, and bottleneck diagnosis.
kernel-loop: an iterative optimization orchestrator that chains profiling, KBS-guided hypotheses, one-change kernel iterations, final benchmarking, and reports.

Skills

Skill	Purpose	Main entry points
`kernel-KBS`	Search evidence-backed kernel knowledge from PRs, docs, blogs, contests, curated wiki pages, code artifacts, and provenance records.	`skills/kernel-KBS/scripts/kbs.py`
`kernel-benchmark`	Compare custom CUDA-C++, CUTLASS, CuTe DSL, or Triton kernels against PyTorch eager, `torch.compile`, and FlashInfer baselines for correctness and latency.	`skills/kernel-benchmark/scripts/benchmark.py`
`kernel-profile`	Validate and profile concrete CUDA-C++, CUTLASS, CuTe DSL, or Triton kernels, then classify bottlenecks from NCU evidence.	`skills/kernel-profile/env/scripts/env_check.py`, `skills/kernel-profile/env/scripts/enc_config.py`, `skills/kernel-profile/scripts/correctness_check.py`, `skills/kernel-profile/scripts/ncu_profile.py`
`kernel-loop`	Run a fixed-iteration optimization loop with environment checks, correctness, NCU profiling, bottleneck classification, KBS evidence, one-variable hypotheses, final benchmark, and report.	`skills/kernel-loop/SKILL.md`, `skills/kernel-loop/references/hypothesis.md`, `skills/kernel-loop/references/report_template.md`

kernel-KBS is read-only by default and should be used for retrieval and source-backed implementation ideas. It does not run kernels or collect performance data.

kernel-benchmark runs standalone benchmarks and writes benchmark.md, including correctness results, timing statistics, and speedups versus selected baselines. It uses KernelBench-style CUDA event timing by default.

kernel-profile runs local checks and profiling. It produces artifacts such as env_check.md, correctness.md, ncu_summary.md, and ncu_details.md.

kernel-loop orchestrates the other skills for an end-to-end optimization loop. It preserves every version, requires one hypothesis before each code change, keeps dimensions and measurement settings fixed, and writes final_report.md from measured artifacts.

Layout

skills/
├── kernel-KBS/
│   ├── SKILL.md
│   ├── requirements.txt
│   ├── references/
│   ├── scripts/
│   └── store/
├── kernel-benchmark/
│   ├── SKILL.md
│   ├── requirements.txt
│   └── scripts/
├── kernel-loop/
│   ├── SKILL.md
│   └── references/
└── kernel-profile/
    ├── SKILL.md
    ├── requirements.txt
    ├── env/
    ├── reference/
    └── scripts/

Install

Claude Code plugin:

/plugin marketplace add fmh66/kernel-opt-agent
/plugin install kernel-opt-agent@fmh66

Codex plugin marketplace:

/plugin marketplace add fmh66/kernel-opt-agent
/plugin install kernel-opt-agent@fmh66

This repository includes both .claude-plugin/plugin.json and .codex-plugin/plugin.json, so the same GitHub marketplace URL works for both tools.

Directly from this repository:

python3 install_skills.py --target all        # Claude Code + Codex
python3 install_skills.py --target claude     # Claude Code only
python3 install_skills.py --target codex      # Codex only

By default, the installer installs kernel-KBS, kernel-benchmark, kernel-profile, and kernel-loop into user-level skill directories. Use --scope project to install into this repository's .claude/skills or .codex/skills directories instead.

Useful installer options:

python3 install_skills.py --dry-run
python3 install_skills.py --force
python3 install_skills.py --mode symlink
python3 install_skills.py --target codex --skill kernel-benchmark
python3 install_skills.py --target all --all-skills

Dependencies

Each skill owns its Python dependencies in its own requirements.txt:

python3 -m pip install -r skills/kernel-KBS/requirements.txt
python3 -m pip install -r skills/kernel-benchmark/requirements.txt
python3 -m pip install -r skills/kernel-profile/requirements.txt

View full README on GitHub

kernel-opt-agent

Popularity

What's Inside

Confidence

README

Kernel Opt Agent

Skills

Layout

Install

Dependencies

Similar Plugins

humanize

NVIDIA

nsys-ai

autoresearch-ai-plugin

cuda

deep-learning-optimizer

Kernel Opt Agent

Skills

Layout

Install

Dependencies

Popularity

Health & Quality

Similar Plugins

humanize

NVIDIA

nsys-ai

autoresearch-ai-plugin

cuda

deep-learning-optimizer