Help us improve
Share bugs, ideas, or general feedback.
Share bugs, ideas, or general feedback.
Share bugs, ideas, or general feedback.
By BBuf
Automates analyzing, profiling, and optimizing LLM serving infrastructure, enabling benchmark comparisons, capacity planning, performance tuning, and code review for SGLang/vLLM/TensorRT-LLM deployments.
npx claudepluginhub bbuf/ai-infra-auto-driven-skills --plugin ai-infra-auto-driven-skillsInspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.
Framework-independent LLM serving benchmark skill for comparing SGLang, vLLM, TensorRT-LLM, or another serving framework. Use when a user wants to find the best deployment command for one model across multiple serving frameworks under the same workload, GPU budget, and latency SLA.
Parse SGLang/vLLM startup logs to explain GPU memory use and request capacity. Use for KV cache budget, mem-fraction-static comparisons, OOM triage, and max-concurrency estimates.
Unified LLM torch-profiler triage skill for `sglang`, `vllm`, and `TensorRT-LLM`. Use it to inspect an existing `trace.json(.gz)` or profile directory, or to drive live profiling against a running server and return one three-table report with kernel, overlap-opportunity, and fuse-pattern tables.
Return public original model architecture diagrams for user-specified LLM, VLM, MoE, diffusion, OCR, and SGLang/sgl-cookbook model families. Use when the user asks for a model structure chart, architecture diagram, or rendered image link for a specific model such as DeepSeek, GLM, Qwen, Kimi, MiniMax, Step, Hunyuan, or Qwen3-VL.
Share bugs, ideas, or general feedback.
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge.
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge.
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
Deploy and benchmark vLLM with Claude Code
Agent Skills for NeMo Evaluator SDK
Claude Code skill pack for Langfuse LLM observability (24 skills)
Benchmark, evaluate, and optimize skills to ensure reliable performance across all LLMs
ML engineering plugin: Give your AI coding agent ML engineering superpowers.
TensorZero
Humanize - An iterative development plugin that uses Codex to review Claude's work. Creates a feedback loop where Claude implements plans and Codex independently reviews progress, ensuring quality through continuous refinement.
Agent-ready playbooks for LLM serving benchmarks, capacity planning, torch-profiler triage, pipeline analysis, compute simulation, SGLang/vLLM optimization, human code review, production incidents, and model PR intelligence.
This repository is built for AI infrastructure engineers who want agents to do real work, not recite generic prompts.
It gives an agent the operational memory needed to benchmark SGLang, vLLM, and TensorRT-LLM fairly; explain serving capacity from startup logs; split prefill and decode profiler evidence; inspect traces at layer and kernel level; estimate operator FLOPs and MFU; review SGLang patches against real maintainer discussion patterns; run Humanize-governed SGLang and vLLM SOTA loops; triage SGLang production incidents from a replay; and keep model-family optimization history close to the code that actually changed.
For standalone kernel campaigns and kernel evidence tools, see the sibling project KernelPilot.
If this saves you one stale model-support assumption, one misleading profiler trace, or one late-night benchmark loop, a star helps more AI-infra engineers find it.
| Skill | Use it when |
|---|---|
llm-serving-auto-benchmark | You need a fair, bounded serving benchmark search for SGLang, vLLM, TensorRT-LLM, or another OpenAI-compatible stack. |
llm-serving-capacity-planner | You need to explain SGLang or vLLM startup memory, KV cache budget, request capacity, or OOM pressure from logs. |
llm-torch-profiler-analysis | You need a three-table profiler report that keeps extend/prefill and decode evidence separate. |
llm-pipeline-analysis | You need forward-pass, layer, and kernel-level timing from a torch profiler trace, including anchor boundaries and Perfetto ranges. |
model-compute-simulation | You need operator shapes, FLOPs, MFU estimates, kernel-to-op mapping, or parallelism what-if analysis for an LLM serving shape. |
sglang-humanize-review | You need SGLang code-review findings grounded in 2024-2025 human review threads, including inline code context, comments, and discussions. |
sglang-sota-humanize-loop | You want one model-level Humanize RLCR loop that owns gap decisions, profiler triage, required layer-pipeline deep dives, SGLang patches, optional ncu-report-skill evidence, and real-model revalidation after the fixed fair benchmark. |
vllm-sota-humanize-loop | You want one model-level Humanize RLCR loop that owns gap decisions, profiler triage, required layer-pipeline deep dives, vLLM patches, optional ncu-report-skill evidence, and real-model revalidation after the fixed fair benchmark. |
sglang-prod-incident-triage | You need to turn queue growth, timeouts, wrong outputs, crashes, or distributed stalls into a replay and next debug step. |
model-architecture-diagram | You need original public architecture diagrams for popular LLM, VLM, MoE, OCR, and diffusion model families. |
The model optimization layer is now one knowledge base:
model-pr-optimization-history. It contains
58 PR-driven history dossiers and a small query helper. These are not
per-model runbook skills; they preserve diff-backed model evolution records for
SGLang and vLLM so SOTA loops can read prior source and PR evidence before
patching.