Help us improve
Share bugs, ideas, or general feedback.
From ai-infra-auto-driven-skills
Parse SGLang/vLLM startup logs to explain GPU memory use and request capacity. Use for KV cache budget, mem-fraction-static comparisons, OOM triage, and max-concurrency estimates.
npx claudepluginhub bbuf/ai-infra-auto-driven-skills --plugin ai-infra-auto-driven-skillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/ai-infra-auto-driven-skills:llm-serving-capacity-plannerThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Use this when a serving log has enough memory lines to explain where GPU HBM
Analyzes torch.profiler traces for sglang, vLLM, and TensorRT-LLM. Inspects existing traces or drives live profiling, returning kernel, overlap, and fuse-pattern tables.
Provides patterns for LLM inference infrastructure with serving frameworks like vLLM, TGI, TensorRT-LLM; quantization, batching strategies, KV cache, and streaming responses. Use for optimizing latency and scaling deployments.
Estimates GPU VRAM requirements for training configurations (SFT, GRPO, LoRA, QLoRA) and checks fit against available GPUs. Runs detection then estimation scripts.
Share bugs, ideas, or general feedback.
Use this when a serving log has enough memory lines to explain where GPU HBM went. The analyzer reads SGLang/vLLM startup logs, extracts weight load, KV pool, CUDA graph, framework overhead, and token-capacity lines, then estimates concurrent requests for common token lengths.
Before running analysis, collect or verify these inputs:
| Item | Why it matters | How to obtain | Default if user skips |
|---|---|---|---|
| Log file path | Primary input; all memory data comes from here | Ask user for the serving startup log | — (required) |
| GPU type | Determines total HBM for decomposition validation | Ask user or infer from log | Auto-detected from log if possible |
| nvidia-smi output | Provides per-rank actual memory for cross-validation | Capture with nvidia-smi --query-gpu=index,memory.used,memory.free --format=csv,noheader > smi.txt | — (optional, but recommended) |
| Model config.json | Enables theoretical KV cache byte calculation and replication factor analysis | Ask user for the model's config.json path | — (optional, log data used instead) |
| Request token length | Determines concurrency estimate denominator | Ask user | 4096, 6144, 8192 |
The user should provide the startup log from an SGLang or vLLM serving instance. Key log lines that the analyzer needs:
Load weight begin. avail mem=XX GBMemory profiling: available_gpu_memory=XX GB, ... (newer sglang)SW KV memory calculation: bytes_per_full_token=XX, available_bytes=XX GB, full_token=XX (SWA models like DeepSeek-V4)Memory pool end. avail mem=XX GBCapture cuda graph end. ... mem usage=XX GB. avail mem=XX GB.max_total_num_tokens=XX, ... max_running_requests=XX, ... available_gpu_mem=XX GBserver_args=ServerArgs(...) (for serving parameters)If the log is from a running instance, capture it by redirecting stdout/stderr to a file at launch time.
For per-rank memory comparison:
docker exec <container> nvidia-smi --query-gpu=index,memory.used,memory.free --format=csv,noheader > smi.txt
python3 skills/llm-serving-capacity-planner/scripts/capacity_analyzer.py \
--log-file /path/to/sglang.log \
--nvidia-smi-file /path/to/smi.txt \
--gpu h200 \
--config-json /path/to/config.json
For JSON output (automation):
python3 skills/llm-serving-capacity-planner/scripts/capacity_analyzer.py \
--log-file /path/to/sglang.log \
--format json
The analyzer prints:
--mem-fraction-static values and their impact on KV pool capacityControls what fraction of available GPU memory after weight loading is reserved for the KV cache pool. Higher values give more KV capacity but less headroom for CUDA graph and other runtime buffers.
0.88 (default): aggressive — 88% of post-weight memory goes to KV pool0.60: conservative — more free memory left for runtime, but significantly less KV capacityWhen num_key_value_heads < tp_size, KV cache is replicated across all TP ranks rather than split. For example, models with kv_heads=1, tp=8 means each of the 8 cards stores a full copy of the KV cache — 8x the per-card KV memory compared to a split scenario.
Models like DeepSeek-V4 use CSA (Compressed Sliding Attention) and HCA
(Hierarchical Context Attention) with sliding windows. This reduces per-token
KV cache bytes compared to the theoretical full-attention calculation. The
bytes_per_full_token reported in the log already accounts for this
compression.
Include:
| Limitation | Detail | Workaround |
|---|---|---|
| SGLang-specific patterns | Currently only SGLang log patterns are fully supported | vLLM patterns to be added as encountered |
| SWA compression models | Per-token KV bytes cannot be independently calculated from model config for CSA/HCA attention — the framework's internal SWA window parameters are needed | Use bytes_per_full_token from the log directly |
| DeepGEMM JIT memory | The analyzer categorizes DeepGEMM JIT compilation memory as "other" because it is not explicitly reported in the log | Compare with nvidia-smi total for accurate accounting |
| PP (Pipeline Parallelism) | Memory decomposition is per-rank; PP configurations may have uneven memory across stages | Specify --target-rank for each PP stage |
| MoE expert buffer | Some frameworks allocate additional buffers for expert routing that are not separately reported | Included in "model weights" or "other" depending on when allocated |
references/log-patterns.md: log line patterns and their semantics for memory analysis.references/gpu-specs.json: GPU HBM specifications for h20, h100, h200, and b200 aliases.scripts/capacity_analyzer.py: the core analysis script.