Help us improve
Share bugs, ideas, or general feedback.
From kernel-opt-agent
Orchestrates iterative GPU kernel optimization loops with correctness checks, NCU profiling, KBS evidence, hypothesis discipline, iteration gates, and final benchmarking.
npx claudepluginhub fmh66/kernel-opt-agent --plugin kernel-opt-agentHow this skill is triggered — by the user, by Claude, or both
Slash command
/kernel-opt-agent:kernel-loopThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Run a measured kernel optimization loop. The loop is valid only when every version is correct, profiled, evidence-backed, and accepted by the gate script before the next version is created.
Autonomous GPU kernel optimization loop: plans, writes, benchmarks, and tunes CUDA kernels with correctness checks, profiling, and shape-aware dispatch.
Profiles CUDA/CUTLASS/CuTe DSL/Triton GPU kernels: checks environment, validates correctness, collects Nsight Compute metrics, and classifies bottlenecks (memory/compute/latency/occupancy/mixed bound).
Auto-detects repository type (FlagGems, vLLM, or general Python/Triton) and dispatches to sub-skills for GPU kernel operator generation, iterative optimization, platform specialization, and feedback.
Share bugs, ideas, or general feedback.
Run a measured kernel optimization loop. The loop is valid only when every version is correct, profiled, evidence-backed, and accepted by the gate script before the next version is created.
This is an orchestration skill. It does not own profiling, KBS search, or benchmarking scripts. It owns only the loop rules, output layout, local report templates, and the iteration gate.
Use the existing skills below for their own responsibilities. Paths in this table are relative to that skill's root directory. Do not assume a fixed install location.
| Skill | Role |
|---|---|
kernel-loop | Orchestrates version order, artifact layout, hypothesis discipline, final report shape, and the gate script at scripts/check_iteration_gate.sh. |
kernel-profile | Owns environment readiness, optional GPU clock config, correctness checks, NCU profiling, bottleneck evidence, and generated artifacts such as env_check.md, correctness.md, ncu_summary.md, and ncu_details.md. See its SKILL.md, scripts/scripts.md, and reference/NCU.md. |
kernel-KBS | Owns read-only evidence search for optimization patterns, hardware features, prior PRs, and implementation examples. See its SKILL.md and scripts/scripts.md. |
kernel-benchmark | Owns final correctness/timing comparison against PyTorch eager, torch.compile, or FlashInfer baselines, and writes benchmark.md. See its SKILL.md and scripts/scripts.md. |
The file names under <output_dir> are artifacts produced or written during the loop. They are not skill source paths.
Required: <kernel>, <ref.py>, <implementation>, <dims>.
Defaults: <gpu>=0, <N>=3.
Keep these fixed for the whole run: implementation, dimensions, seed, tolerances, GPU, pointer sizing, warmups, and timing trials.
v0 is the baseline. v1..vN are optimization attempts. There are N transitions:
v0 -> v1 -> ... -> vN
Each version directory must contain measured artifacts:
<output_dir>/
+-- ref.py
+-- env_check.md
+-- current_iteration.txt (managed only by the gate)
+-- v0/
| +-- correctness.md
| +-- ncu_summary.md
| +-- ncu_details.md
| +-- kbs_evidence.md
| +-- hypothesis.txt (required only when creating v1)
| +-- kernel.py | kernel.cu | <single .py/.cu kernel>
+-- v1/
+-- ...
+-- vN/ (final version: no next-change hypothesis required)
+-- benchmark.md
+-- final_report.md
Do not pre-populate future version directories. Empty placeholders are allowed but unnecessary. A future vK+1 kernel, correctness file, NCU file, KBS file, or hypothesis before vK passes the gate invalidates the loop.
<output_dir>/current_iteration.txt. If the file does not exist, the first valid gate run must be v0.kbs_evidence.md after NCU profiling and before hypothesis.txt.hypothesis.txt before creating the next kernel version. It belongs to the source version, for example v1/hypothesis.txt explains v1 -> v2.vK+1 artifacts while current_iteration.txt still says vK.current_iteration.txt and continue only from that version. If later version artifacts already exist, stop and treat the loop state as drifted.For v0:
kernel-profile to run environment check and write <output_dir>/env_check.md; copy ref.py.v0.kernel-profile correctness tooling to write <output_dir>/v0/correctness.md.kernel-profile NCU tooling to write <output_dir>/v0/ncu_summary.md and <output_dir>/v0/ncu_details.md.kernel-KBS to query measured NCU symptoms. Minimum: one kernel-specific query and one bottleneck-pattern query.kbs_evidence.md, then hypothesis.txt.v0. Only after it passes, create v1.For each vK where 0 < K < N, repeat steps 3-7. Preserve regressions; never overwrite an older version.
For final vN, use kernel-profile for correctness and NCU artifacts, use kernel-KBS for evidence, then run the final gate. Do not write a next-change hypothesis unless the user extends N.
Transition gate, before creating vK+1:
bash <kernel-loop-skill>/scripts/check_iteration_gate.sh <output_dir>/v<K> --verbose
Final pre-completion audit, before benchmarking and reporting:
bash <kernel-loop-skill>/scripts/check_iteration_gate.sh <output_dir>/v<N> --final --verbose
<kernel-loop-skill> means the resolved root directory of this skill, not a hard-coded repository path.
The gate enforces:
K > 0current_iteration.txt matches the checked versionUse references/kbs_evidence.md from this skill as a suggested output format. Query evidence through kernel-KBS; do not treat this template as a KBS retrieval script. The gate checks only that <output_dir>/vK/kbs_evidence.md exists, is non-empty, and has the expected H1/H2 title order.
Use references/hypothesis.md from this skill as a suggested output format. The hypothesis should describe the next one-variable change. The gate checks only that <output_dir>/vK/hypothesis.txt exists, is non-empty, and has the expected top-level section order before the next version is created.
Use kernel-benchmark only after the final gate passes. Write final_report.md from measured artifacts only, using references/report_template.md from this skill.
The report must compare every version, summarize every transition hypothesis, list KBS evidence actually used, identify the best correct version, and cite benchmark.md for final speedups. Use N/A for uncollected metrics and unknown only when an expected value cannot be located in artifacts.