Help us improve
Share bugs, ideas, or general feedback.
From ai-infra-auto-driven-skills
Autonomously optimizes SGLang serving performance for a given LLM model by benchmarking against vLLM/TensorRT-LLL, then iteratively profiling bottlenecks, patching SGLang code, and revalidating until SGLang matches or beats the best framework under the same workload.
npx claudepluginhub bbuf/ai-infra-auto-driven-skills --plugin ai-infra-auto-driven-skillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/ai-infra-auto-driven-skills:sglang-sota-humanize-loopThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Use this skill when the user names a model and wants the SGLang serving path to
Autonomously optimizes a vLLM serving path until it matches or exceeds SGLang/TensorRT-LLM performance on a given model. Runs baseline benchmarks, then repeatedly profiles, analyzes layers/kernels, patches vLLM code, and revalidates.
Provides LLM serving optimization recommendations for latency, inference costs, and throughput. Scans configs, detects stacks like vLLM/TGI, suggests quantization, batching, KV cache, and framework changes.
Autonomous GPU kernel optimization loop: plans, writes, benchmarks, and tunes CUDA kernels with correctness checks, profiling, and shape-aware dispatch.
Share bugs, ideas, or general feedback.
Use this skill when the user names a model and wants the SGLang serving path to autonomously keep improving until it matches or beats the best reproducible vLLM or TensorRT-LLM result in the same target environment.
This workflow has two durable parts:
Do not split the campaign into a pre-loop profiling phase plus a later patch
loop. After the fixed benchmark exists, Phase 2 gap decisions, Phase 3 profiling,
llm-pipeline-analysis, kernel evidence, and code changes all belong inside the
same model-level RLCR loop.
This skill can run from Claude Code, Codex, or another compatible skill runtime. Resolve companion roots in this order:
~/.claude/skills when running in
Claude Code.${CODEX_HOME:-~/.codex}/skills when
running in Codex.Example local paths:
Humanize runtime: ${CODEX_HOME:-~/.codex}/skills/humanize
ncu-report-skill: ${CODEX_HOME:-~/.codex}/skills/ncu-report-skill/SKILL.md
Model PR history knowledge: <repo>/model-pr-optimization-history
For Claude Code installs, the equivalent defaults are typically:
Humanize runtime: ~/.claude/skills/humanize
ncu-report-skill: ~/.claude/skills/ncu-report-skill/SKILL.md
Model PR history knowledge: ~/.claude/skills/model-pr-history-knowledge
If the Humanize runtime is missing, locate a plugin or skill directory
containing scripts/setup-rlcr-loop.sh. If ncu-report-skill is unavailable,
kernel edits may still proceed from torch-profiler/source evidence, but record
the missing NCU evidence path as a blocker when a kernel change would normally
need Nsight Compute diagnostics.
Read these before a real run:
../llm-serving-auto-benchmark/SKILL.md../llm-torch-profiler-analysis/SKILL.md../llm-pipeline-analysis/SKILL.md../../model-pr-optimization-history/SKILL.mdRead ncu-report-skill/SKILL.md only when the active RLCR round is writing or
evaluating a CUDA, Triton, CuTe, CUTLASS, TileLang, or torch.compile kernel path
and Nsight Compute evidence is needed.
Given a model-level SGLang SOTA request, do not ask the user to run separate benchmark, profiler, gen-plan, refine-plan, or Humanize setup commands. Do the setup yourself.
Ask the user only if the model, target GPU environment, or precision/quantization policy is missing and cannot be inferred from local configs or the active host skill.
Keep only the fixed benchmark phase outside the RLCR patch loop. Once the fixed cross-framework benchmark and model PR history notes exist, start Humanize. The RLCR loop itself must decide whether a gap still exists, collect current profiler evidence, run layer pipeline analysis, patch SGLang, and revalidate.
Treat the model optimization campaign as the durable unit, not one terminal session. The campaign is recoverable from the run artifact root, checkpoint files, benchmark/profile artifacts, NCU digests, and ledgers.
Collect or infer:
Create one run directory:
runs/YYYYMMDD_<model_slug>_sota_humanize/
manifest.md
help/
benchmark/
profiles/
analysis/
root-cause.md
layer-pipeline.md
history/
model-pr-history-notes.md
kernel/
ncu-digests/
patches/
humanize/
model-loop-checkpoint.md
final_report.md
Never save Hugging Face tokens or other secrets in artifacts.
Before the fixed benchmark and before any patch planning, query and read
model-pr-optimization-history for the target model family.
Rules:
scripts/query.py "<model id or family>" from
the knowledge root and choose the closest model-family history.history/model-pr-history-notes.md with the paths read, PR numbers,
source files, symbols, validation risks, and the concrete decision each item
influences.If the knowledge root is unavailable, record the blocker in the same notes file and continue with benchmark/profile evidence.
This phase is mandatory and happens exactly once before Humanize starts.
Use llm-serving-auto-benchmark as the source of truth for candidate generation,
result schema, workload, and comparison.
Hard requirements:
llm-serving-auto-benchmark unless
the user explicitly provides a production workload:
random, num_prompts: 80chat: random input 1000, output 1000summarization: random input 8000, output 1000trtllm-serve serve --backend pytorch; reject
non-PyTorch TensorRT-LLM server backends for this skill.Write:
benchmark/candidates.jsonlbenchmark/summary.mdbenchmark/winning-commands.mdhelp/Do not choose a code patch outside RLCR. The fixed winner table is the baseline input to the model loop.
Create a Humanize plan inside the SGLang checkout that will be patched:
.humanize/sglang-sota-agent/refined-plan.md
Use references/refined-plan-template.md as the skeleton and fill it with the actual model, workload, benchmark winners, artifact root, model PR history notes, and target SGLang checkout.
The plan must require every RLCR round to:
history/model-pr-history-notes.md before choosing
model-specific SGLang source pathsllm-torch-profiler-analysis inside the loop when SGLang is behind or
when the previous patch changed the profiled pathllm-pipeline-analysis inside the loop after profiler triage and before
choosing a source path, representative layer, or kernel targetncu-report-skill inside the same loop when a kernel edit needs Nsight
Compute evidenceBefore starting Humanize from the SGLang checkout:
.humanize* is gitignored so RLCR state, round summaries, and local
checkpoints cannot be staged accidentally.--base-branch <branch> if Humanize's auto-detection would be ambiguous..humanize/rlcr/*/state.md is active
in the SGLang checkout. Resume, finish, or cancel the old model loop first.From the SGLang checkout, run:
"$HUMANIZE_RUNTIME_ROOT/scripts/setup-rlcr-loop.sh" \
.humanize/sglang-sota-agent/refined-plan.md --yolo --strict-success
If HUMANIZE_RUNTIME_ROOT is not already set by the client/plugin environment,
resolve it to the installed Humanize runtime first. In Codex, this is often
${CODEX_HOME:-~/.codex}/skills/humanize; in Claude Code it is often
~/.claude/skills/humanize or a plugin-provided Humanize runtime. If setup
exits non-zero, stop and report the error. Do not bypass the gate.
After setup succeeds:
find .humanize/rlcr -maxdepth 2 -name state.md -print.strict_success: true..humanize/rlcr/<timestamp>/round-0-prompt.md.If no active state file exists, or if strict_success: true is missing, stop
and report that RLCR did not start correctly. Do not continue into SGLang patch
work outside the Humanize loop. If the hook blocks exit, follow the generated
next-round prompt exactly.
At the start of every round, compute current SGLang's gap against the best SLA-passing framework for each fixed scenario.
Use 1% as the default stable noise threshold. If the current result is within
+/-1%, rerun the winning commands enough times to decide whether the gap is
stable before choosing a patch.
Patch only when SGLang is slower than the best framework by more than 1%,
fails SLA while another framework passes, or has a profiled bottleneck that
explains the remaining gap under the fixed workload.
If SGLang is already best or tied within the stable threshold, write the final report and stop under the normal Humanize review path.
When SGLang is behind, profile the current best SGLang command and the leading
competitor command with llm-torch-profiler-analysis.
Rules:
1% ahead of SGLang in a stable
result, profile both.1 output token1 input token -> slow output lengthFor every profiled framework, save the same three tables:
Then write or update analysis/root-cause.md with the current cross-framework
comparison: which stage is slower, which table rows explain it, and which SGLang
source paths or kernel families are plausible patch targets.
Do not patch SGLang until this report exists for the current gap.
Run llm-pipeline-analysis inside every RLCR round after profiler triage and
before choosing a patch target.
The report must identify:
compress_ratiosUse the profiled SGLang trace and the served model config. Write
analysis/layer-pipeline.md with the chosen forward pass, layer-type timing
table, representative layers, top hot kernels, and any Perfetto ranges used for
inspection. Do not choose a SGLang patch before this report exists for the
current round.
Use ncu-report-skill only when the active RLCR round is writing a concrete
kernel or small kernel-family patch and torch-profiler evidence is not enough to
choose or validate the next edit.
Eligibility gate:
1% behind the best framework for the fixed
benchmark scenario after the required repeat/profiler checks.1% cumulative GPU-time share. Do not spend
kernel-specialist effort on a lone kernel below 1% share unless a shared
implementation affects an aggregated family above 1%.llm-pipeline-analysis has identified the representative layer/forward pass
and top hot kernels for the current round.For each eligible kernel target:
ncu-report-skill/SKILL.md and follow its Nsight Compute workflow for
harness construction, ncu collection, report parsing, stall diagnosis, and
evidence-backed next-edit selection.kernel/ncu-digests/<version>/ or the host's
equivalent artifact root. Each digest must compare baseline vs candidate and
end with exactly one concrete next edit.If no focused harness exists, build the smallest harness that preserves the model-derived shapes/dtypes/layouts. If NCU cannot run on the host, record the blocker in the digest path and keep the next edit grounded in the available torch-profiler, layer-pipeline, and source evidence.
Do not start any standalone .humanize/rlcr session for a kernel target. Kernel
work stays inside the active model RLCR loop.
After every accepted round, update humanize/model-loop-checkpoint.md with:
This checkpoint is for campaign recovery inside the same model-level workflow. It records enough context to resume the campaign without losing benchmark/profile lineage.
Keep these files under the run artifact root or the SGLang checkout, depending on the host convention:
humanize/attempt-ledger.md
humanize/optimization-ledger.md
humanize/source-idea-ledger.md
humanize/lineage.jsonl
humanize/profile-digests/
Every patch attempt gets an attempt row. Only correct patches with measured improvement get optimization rows. Source ideas must include profiler rows, layer-pipeline evidence, NCU report paths when used, and code provenance so later rounds can avoid re-reading the same source. Model PR history evidence should be recorded beside SGLang, vLLM, TensorRT-LLM, and NCU source ideas when it influenced the patch.
After two consecutive rounds with less than 1% geomean improvement over the
prior best SGLang result, expand code-first research before editing again.
Prefer code and PR evidence from SGLang, vLLM, TensorRT-LLM, and relevant kernel
source guides before prose-only articles.
Stop only when one of these is true:
1% threshold after repeat runs.The final report must include the fixed benchmark table, post-patch benchmark table, all winner commands, model PR history paths, profile paths, layer-pipeline paths when used, NCU digest paths when used, SGLang changed files, tests, and whether SGLang reached target-environment SOTA.