Help us improve
Share bugs, ideas, or general feedback.
From humanize
Autonomous GPU kernel optimization loop: plans, writes, benchmarks, and tunes CUDA kernels with correctness checks, profiling, and shape-aware dispatch.
npx claudepluginhub bbuf/kernel-pilot --plugin humanizeHow this skill is triggered — by the user, by Claude, or both
Slash command
/humanize:humanize-kernel-agent-loopThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Use this flow when the user wants an autonomous kernel optimization loop, not a
Orchestrates iterative GPU kernel optimization loops with correctness checks, NCU profiling, KBS evidence, hypothesis discipline, iteration gates, and final benchmarking.
Auto-detects repository type (FlagGems, vLLM, or general Python/Triton) and dispatches to sub-skills for GPU kernel operator generation, iterative optimization, platform specialization, and feedback.
Profiles GPU kernels with Nsight Compute, exports metrics/source/PM-sampling reports, compares baseline vs candidate, classifies stalls, and produces one actionable kernel edit.
Share bugs, ideas, or general feedback.
Use this flow when the user wants an autonomous kernel optimization loop, not a general software feature loop. This is a kernel-specialized wrapper around Humanize RLCR.
This skill follows the paper-style Humanize kernel loop architecture in a KernelPilot standalone-repo variant. The input contract is:
Require: kernel definition K, correctness reference R, workload distribution W
Ensure: kernel implementation K_hat that passes correctness checks on W and
minimizes latency, plus dispatcher and tuning decisions when W contains
multiple regimes.
The loop has three stages:
kernel-knowledge evidence.
Extract implementation recipes, profile hypotheses, and benchmark priorities.ncu-report/Nsight Compute when useful, query
kernel-knowledge when prior art or hardware evidence can guide the next
edit, and record every lineage decision.Planning decomposes the work into task-acceptance pairs P = {(t_i, ac_i)}.
Each pair is executed until the review gate accepts the evidence:
for each (t_i, ac_i) in P:
done = false
while not done:
writer executes t_i in the standalone repo
# inspect/edit code, compile, test, benchmark, profile, query KernelWiki
verdict, feedback = reviewer checks evidence against ac_i
if verdict == pass:
done = true
else:
writer refines the kernel using feedback
# if blocked, consult reviewer/tool evidence; ask human only if still unresolved
return final kernels, dispatcher, and tuning decisions
In diagrams or papers that say "Claude executes, Codex reviews", read "Claude" as "the current writer agent". In a Claude Code session that may be Claude; in a Codex session that may be Codex. The review model is controlled by Humanize configuration or CLI flags, not by this skill.
The installer hydrates these paths:
Humanize runtime: {{HUMANIZE_RUNTIME_ROOT}}
KernelPilot root: {{KERNELPILOT_ROOT}}
If {{KERNELPILOT_ROOT}} was not hydrated, locate a repository containing
knowledge/SKILL.md and knowledge/evidence/pull-bundles/.
Run the Humanize steps inside this skill. Do not ask the user to run separate
gen-plan, refine-plan, or humanize-rlcr commands.
Given a kernel task and target, you must recover or define K, R, and W,
then:
W and benchmark/correctness contract.Ask the user only if the kernel definition K, correctness reference R,
workload distribution W, target GPU, comparison target, or hard scope
constraint is missing and cannot be inferred safely.
After those inputs exist, do not ask the user to approve each implementation
strategy, knowledge query, profiling run, benchmark expansion, or lineage reset.
The loop owns those tactical decisions and records the evidence behind them.
W as a workload distribution, not just a smoke test. If the user gives
explicit benchmark cases, those cases define W. If the user gives a model,
trace, or serving scenario, derive representative cases and record how they
were sampled. If only one focused shape exists, record that W is a single
regime and make the dispatcher/tuning decision trivial.kernel-knowledge and ncu-report/Nsight Compute as autonomous tools
for deciding what to try next. The user should not need to tell the loop when
to research, profile, or switch implementation strategy.ncu-report digest for a representative case
after baseline correctness/benchmark succeeds. Skip it when Nsight Compute is
unavailable or when a cheaper measurement is sufficient for the current round,
and record the reason.ncu-report again whenever profiler evidence is the best available way
to explain a regression, plateau, surprising win, bottleneck shift, or next
implementation edit.Create these before starting RLCR, then keep them updated during the loop:
.gitignore
.humanize/kernel-agent/refined-plan.md
README.md
workloads/
src/<task_name>/
bindings/
tests/
benchmarks/
dispatch/
ledgers/attempt-ledger.md
ledgers/optimization-ledger.md
ledgers/lineage.jsonl
ledgers/research-digest.md
ledgers/tuning-decisions.md
benchmarks/performance-map.json
profile-artifacts/README.md
The plan file may stay gitignored under .humanize/ so RLCR can start without
tracking local loop state.
KernelPilot provides a PR-driven evidence corpus. Use it whenever it helps the current plan, implementation choice, benchmark result, profile digest, plateau/regression explanation, reviewer question, or next kernel edit. PR diffs and materialized source snapshots are the primary evidence, while wiki syntheses, docs, blogs, contest notes, and query indices are supporting knowledge that Humanize may use whenever they clarify hardware behavior, techniques, profile interpretation, or implementation choices.
Useful entry points from {{KERNELPILOT_ROOT}}/knowledge:
cd {{KERNELPILOT_ROOT}}/knowledge
python3 scripts/query.py "tcgen05" --architecture B200 --limit 10
python3 scripts/query.py --repo pytorch/pytorch --compact
python3 scripts/get_page.py pr-pytorch-157241 --follow-sources
Useful wiki/doc/blog entry points:
cd {{KERNELPILOT_ROOT}}/knowledge
# Synthesized wiki pages: hardware, techniques, patterns, languages, kernels.
python3 scripts/query.py "Blackwell memory hierarchy" --type hardware --limit 10
python3 scripts/query.py --type technique --tag pipeline-stages --compact
python3 scripts/query.py "tail effect persistent scheduling" --type pattern --compact
python3 scripts/query.py "PTX cache policy" --type language --compact
# Source docs and blogs. Source pages use source_category values as --type.
python3 scripts/query.py "tcgen05 tmem tuning guide" --type official-doc --limit 10
python3 scripts/query.py "Blackwell microbenchmark tensor memory" --type benchmark-blog --limit 10
python3 scripts/query.py "CuTe DSL TMA swizzle" --type community-note --limit 10
python3 scripts/get_page.py doc-nvidia-tuning-guide --body-only
python3 scripts/get_page.py blog-blackwell-microbenchmarking --body-only
# Regex search: wiki-only for synthesized pages, sources-only for docs/blogs.
python3 scripts/grep_wiki.py "tcgen05\\.fence" --only wiki
python3 scripts/grep_wiki.py "long scoreboard" "prefetch" --only sources --any
Prefer materialized bundles under:
knowledge/evidence/pull-bundles/<repo-id>/gh-<number>/
Each bundle should contain review.diff, ORIGIN.yaml, upstream.json,
and key changed source/test/benchmark files under source-snapshot/.
Typical query flow:
scripts/query.py for broad routing by architecture, repo, tag,
operator, bottleneck, or exact instruction/feature term.scripts/get_page.py --follow-sources to expand a promising wiki or PR
page into its cited evidence.review.diff, ORIGIN.yaml, upstream.json, and
source-snapshot/ files for the PR before borrowing any idea.scripts/grep_wiki.py for exact terms such as instruction mnemonics,
CuTe atoms, profiler counters, dtype names, and memory/cache policy names.A separate reading ledger is unnecessary just to prove that pages were opened. When a source directly affects code, record the actionable provenance in the relevant attempt row, lineage entry, or profile digest.
Write .humanize/kernel-agent/refined-plan.md in the standalone repo. It must
use the Humanize gen-plan schema and include these acceptance criteria:
K, R, and W are explicit: the kernel semantics, correctness oracle,
target workload distribution, target GPU, comparison baseline, and hard scope
exclusions are recorded before the first candidate is selected.W, representative edge cases, dtype/layout/mode
boundaries, and baseline/reference parity.ncu-report digest or explain why the loop is
using cheaper evidence first.ncu-report profiles are captured autonomously when they are the best
tool for regressions, plateaus, surprising wins, bottleneck shifts,
profile-driven edits, or reviewer-requested evidence, then converted into NCU
Report Digests.W, tests selected variants
or configurations across all required regimes, and emits dispatcher/tuning
decisions. If W is a single focused case, record why a trivial dispatcher is
sufficient.ncu-report names a specific low-risk edit with plausible
order-of-magnitude impact.Invoke the ncu-report skill autonomously when profiler evidence is the best
next source of truth. These are heuristics, not user-facing gates:
Persist .ncu-rep, raw CSV, source export, PM-sampling export when available,
and comparison paths in the digest. Each digest must end with one concrete next
kernel edit.
After writing and committing the standalone repo scaffolding, start the loop from inside the standalone repo:
"{{HUMANIZE_RUNTIME_ROOT}}/scripts/setup-rlcr-loop.sh" .humanize/kernel-agent/refined-plan.md --yolo
If setup exits non-zero, stop and report the error. Do not bypass the gate.
The loop uses Humanize's configured review model and default max iteration
limit unless the caller explicitly passes overrides such as --max or a model
flag. The current default max iteration limit is 84 rounds.
After setup succeeds:
.humanize/rlcr/<timestamp>/round-0-prompt.md.If the hook blocks exit, follow the generated next-round prompt exactly.