Help us improve
Share bugs, ideas, or general feedback.
From vllm-skills
Benchmarks vLLM or OpenAI-compatible LLM serving endpoints using vllm bench serve CLI. Measures throughput, latency (TTFT/TPOT), goodput with datasets like ShareGPT/HF and request-rate control. Use for inference performance testing.
npx claudepluginhub vllm-project/vllm-skills --plugin vllm-skillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/vllm-skills:vllm-bench-serveThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Benchmark vLLM or any OpenAI-compatible serving endpoint using the `vllm bench serve` CLI. Measures throughput, latency (TTFT, TPOT), and goodput against configurable request load.
Runs vLLM serving benchmarks with synthetic random data to measure throughput, TTFT, TPOT, inter-token latency. Quick tests without external datasets.
Orchestrates online benchmarks for vLLM inference services using `vllm bench serve`. Supports single/multi-case batch execution with result aggregation and auto-optimization for throughput under latency SLOs (TTFT, TPOT, P99).
Runs accuracy (FlagEval) and performance benchmarks (vllm bench serve) across 5 workload profiles against a served model, collecting throughput, latency, TTFT, and TPOT metrics.
Share bugs, ideas, or general feedback.
Benchmark vLLM or any OpenAI-compatible serving endpoint using the vllm bench serve CLI. Measures throughput, latency (TTFT, TPOT), and goodput against configurable request load.
Reference: vLLM Bench Serve Documentation
Basic benchmark against local vLLM server (default random dataset, 1000 prompts):
vllm bench serve \
--backend openai-chat \
--host 127.0.0.1 \
--port 8000 \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions
Save results to JSON:
vllm bench serve \
--backend openai-chat \
--host 127.0.0.1 \
--port 8000 \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--save-result \
--result-dir ./bench-results \
--metadata "version=0.6.0" "tp=1"
Note: When using
--backend openai-chat, you must specify--endpoint /v1/chat/completions(default is/v1/completions).
| Argument | Default | Description |
|---|---|---|
--backend | openai | Backend type: openai, openai-chat, openai-embeddings, vllm, vllm-pooling, vllm-rerank, etc. |
--host | 127.0.0.1 | Server host |
--port | 8000 | Server port |
--base-url | - | Alternative: full base URL instead of host:port |
--endpoint | /v1/completions | API endpoint; use /v1/chat/completions for openai-chat |
--model | (from /v1/models) | Model name |
--num-prompts | 1000 | Number of prompts to process |
--request-rate | inf | Requests per second; inf = burst all at once |
--max-concurrency | - | Max concurrent requests (caps parallelism) |
--num-warmups | 0 | Warmup requests before measuring |
--dataset-name | Use Case |
|---|---|
random | Synthetic random prompts (default) |
sharegpt | ShareGPT conversation format; requires --dataset-path |
sonnet | Sonnet-style prompts |
hf | HuggingFace dataset; requires --dataset-path (dataset ID) |
custom / custom_mm | Custom dataset; requires --dataset-path |
prefix_repetition | Prefix repetition benchmark |
random-mm | Random multimodal (images/videos) |
spec_bench | Spec bench dataset |
Dataset-specific options (examples):
# Random: control input/output length
--dataset-name random --random-input-len 1024 --random-output-len 128
# Sonnet defaults: input 550, output 150, prefix 200
--dataset-name sonnet --sonnet-input-len 550 --sonnet-output-len 150
# HuggingFace dataset
--dataset-name hf --dataset-path "lmarena-ai/VisionArena-Chat" --hf-split test
# General overrides (map to dataset-specific args)
--input-len 512 --output-len 256
# Fixed request rate (Poisson process)
--request-rate 10
# More bursty arrivals (gamma distribution, burstiness < 1)
--request-rate 10 --burstiness 0.5
# Ramp-up from low to high RPS
--ramp-up-strategy linear --ramp-up-start-rps 1 --ramp-up-end-rps 50
# Limit concurrency (useful for rate-limited APIs)
--max-concurrency 32
| Argument | Description |
|---|---|
--save-result | Save benchmark results to JSON |
--save-detailed | Include per-request TTFT, TPOT, errors in JSON |
--append-result | Append to existing result file |
--result-dir | Directory for result files |
--result-filename | Custom filename (default: {label}-{request_rate}qps-{model}-{timestamp}.json) |
--percentile-metrics | Metrics for percentiles: ttft, tpot, itl, e2el (default: ttft,tpot,itl) |
--metric-percentiles | Percentile values, e.g. 25,50,99 (default: 99) |
--goodput | SLO for goodput: ttft:500 tpot:50 (ms) |
--temperature 0.7 --top-p 0.95 --top-k 50
--frequency-penalty 0 --presence-penalty 0 --repetition-penalty 1.0
1. Throughput test with random dataset (burst):
vllm bench serve --backend openai-chat --host 127.0.0.1 --port 8000 \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random \
--num-prompts 500 --random-input-len 512 --random-output-len 128
2. Latency test with fixed QPS:
vllm bench serve --backend openai-chat --host 127.0.0.1 --port 8000 \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--request-rate 5 --num-prompts 200 \
--save-result --percentile-metrics ttft,tpot --metric-percentiles 50,99
3. Benchmark against remote API (base-url):
vllm bench serve --backend openai-chat \
--base-url "https://api.example.com/v1" \
--model my-model \
--header "Authorization=Bearer $API_KEY"
4. Run inside Docker (when vLLM client not on host):
docker exec <container-name> vllm bench serve \
--backend openai-chat --host 127.0.0.1 --port 8000 \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random --num-prompts 100
--host/--port or --base-url are correct.--model explicitly or ensure /v1/models returns the model.--endpoint /v1/chat/completions when --backend openai-chat.--request-rate or --max-concurrency.--ready-check-timeout-sec 60 to wait for the endpoint before benchmarking.--insecure for self-signed certificates.--backend openai-embeddings, vllm-pooling, or vllm-rerank.--profile requires --profiler-config on the server for vLLM profiling.