Skill

llm-serving-capacity-planner

Parses SGLang/vLLM startup logs to decompose GPU HBM usage (weights, KV cache, CUDA graphs, framework overhead) and estimates concurrent request capacity for given token lengths.

Python

performance

Popularity

Stars

682

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/ai-infra-auto-driven-skills:llm-serving-capacity-planner

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Use this when a serving log has enough memory lines to explain where GPU HBM

Supporting Files

references/gpu-specs.jsonreferences/log-patterns.mdscripts/capacity_analyzer.py

SKILL.md

134 lines · ~1.7k tokens

Stats

LanguagePython

Stars682

Forks58

MaintenanceExcellent

Last CommitJul 14, 2026

Actions

View Source View Plugin View on GitHub View README

LLM Serving Capacity Planner

Overview

Use this when a serving log has enough memory lines to explain where GPU HBM went. The analyzer reads SGLang/vLLM startup logs, extracts weight load, KV pool, CUDA graph, framework overhead, and token-capacity lines, then estimates concurrent requests for common token lengths.

Confirmation Required

Before running analysis, collect or verify these inputs:

Item	Why it matters	How to obtain	Default if user skips
Log file path	Primary input; all memory data comes from here	Ask user for the serving startup log	— (required)
GPU type	Determines total HBM for decomposition validation	Ask user or infer from log	Auto-detected from log if possible
nvidia-smi output	Provides per-rank actual memory for cross-validation	Capture with `nvidia-smi --query-gpu=index,memory.used,memory.free --format=csv,noheader > smi.txt`	— (optional, but recommended)
Model config.json	Enables theoretical KV cache byte calculation and replication factor analysis	Ask user for the model's config.json path	— (optional, log data used instead)
Request token length	Determines concurrency estimate denominator	Ask user	4096, 6144, 8192

Workflow

Step 1: Collect the serving log

The user should provide the startup log from an SGLang or vLLM serving instance. Key log lines that the analyzer needs:

Load weight begin. avail mem=XX GB
Memory profiling: available_gpu_memory=XX GB, ... (newer sglang)
SW KV memory calculation: bytes_per_full_token=XX, available_bytes=XX GB, full_token=XX (SWA models like DeepSeek-V4)
Memory pool end. avail mem=XX GB
Capture cuda graph end. ... mem usage=XX GB. avail mem=XX GB.
max_total_num_tokens=XX, ... max_running_requests=XX, ... available_gpu_mem=XX GB
server_args=ServerArgs(...) (for serving parameters)

If the log is from a running instance, capture it by redirecting stdout/stderr to a file at launch time.

Step 2: Optionally capture nvidia-smi data

For per-rank memory comparison:

docker exec <container> nvidia-smi --query-gpu=index,memory.used,memory.free --format=csv,noheader > smi.txt

Step 3: Run the analyzer

python3 skills/llm-serving-capacity-planner/scripts/capacity_analyzer.py \
  --log-file /path/to/sglang.log \
  --nvidia-smi-file /path/to/smi.txt \
  --gpu h200 \
  --config-json /path/to/config.json

For JSON output (automation):

python3 skills/llm-serving-capacity-planner/scripts/capacity_analyzer.py \
  --log-file /path/to/sglang.log \
  --format json

Step 4: Review and interpret results

The analyzer prints:

Memory breakdown table: each category (weights, KV pool, CUDA graph, framework, other) with GiB, MiB, percentage, and derivation
Per-rank comparison: nvidia-smi data across all TP ranks
KV pool detail: pool configuration, KV dtype, replication factor, per-token byte calculation
Concurrency estimate: max concurrent requests for different token lengths
Tuning notes: configuration changes that may increase capacity

When To Use It

After launching an LLM serving instance, to understand how GPU memory is distributed
When comparing different --mem-fraction-static values and their impact on KV pool capacity
When planning deployment capacity: how many concurrent requests can a given GPU configuration support
When investigating OOM issues: identifying which memory category is consuming the most
When evaluating whether fp8 KV cache or EP can improve concurrency

Key Concepts

mem-fraction-static

Controls what fraction of available GPU memory after weight loading is reserved for the KV cache pool. Higher values give more KV capacity but less headroom for CUDA graph and other runtime buffers.

0.88 (default): aggressive — 88% of post-weight memory goes to KV pool
0.60: conservative — more free memory left for runtime, but significantly less KV capacity

KV Head Replication

When num_key_value_heads < tp_size, KV cache is replicated across all TP ranks rather than split. For example, models with kv_heads=1, tp=8 means each of the 8 cards stores a full copy of the KV cache — 8x the per-card KV memory compared to a split scenario.

SWA (Sliding Window Attention) Compression

Models like DeepSeek-V4 use CSA (Compressed Sliding Attention) and HCA (Hierarchical Context Attention) with sliding windows. This reduces per-token KV cache bytes compared to the theoretical full-attention calculation. The bytes_per_full_token reported in the log already accounts for this compression.

Reporting Checklist

Include:

Serving configuration: model, GPU, TP/PP/EP, mem-fraction-static, kv-cache-dtype
Memory breakdown table: category / GiB / MiB / percentage / derivation source
Per-rank nvidia-smi comparison: used and free memory per TP rank
KV pool detail: pool size, bytes_per_full_token, KV dtype, replication factor, theoretical per-token KV calculation (when config.json provided)
Concurrency estimate table: request token length / token-limit / request-limit / max concurrent
Tuning notes based on free memory and configuration

Known Limitations

Limitation	Detail	Workaround
SGLang-specific patterns	Currently only SGLang log patterns are fully supported	vLLM patterns to be added as encountered
SWA compression models	Per-token KV bytes cannot be independently calculated from model config for CSA/HCA attention — the framework's internal SWA window parameters are needed	Use `bytes_per_full_token` from the log directly
DeepGEMM JIT memory	The analyzer categorizes DeepGEMM JIT compilation memory as "other" because it is not explicitly reported in the log	Compare with nvidia-smi total for accurate accounting
PP (Pipeline Parallelism)	Memory decomposition is per-rank; PP configurations may have uneven memory across stages	Specify `--target-rank` for each PP stage
MoE expert buffer	Some frameworks allocate additional buffers for expert routing that are not separately reported	Included in "model weights" or "other" depending on when allocated

References

references/log-patterns.md: log line patterns and their semantics for memory analysis.
references/gpu-specs.json: GPU HBM specifications for h20, h100, h200, and b200 aliases.
scripts/capacity_analyzer.py: the core analysis script.

llm-serving-capacity-planner

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

llm-serving-capacity-planner

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

LLM Serving Capacity Planner

Overview

Confirmation Required

Workflow

Step 1: Collect the serving log

Step 2: Optionally capture nvidia-smi data

Step 3: Run the analyzer

Step 4: Review and interpret results

When To Use It

Key Concepts

mem-fraction-static

KV Head Replication

SWA (Sliding Window Attention) Compression

Reporting Checklist

Known Limitations

References

Similar Skills

LLM Serving Capacity Planner

Overview

Confirmation Required

Workflow

Step 1: Collect the serving log

Step 2: Optionally capture nvidia-smi data

Step 3: Run the analyzer

Step 4: Review and interpret results

When To Use It

Key Concepts

mem-fraction-static

KV Head Replication

SWA (Sliding Window Attention) Compression

Reporting Checklist

Known Limitations

References

Similar Skills