From vllm-skills
Benchmarks vLLM or OpenAI-compatible LLM serving endpoints using vllm bench serve CLI. Measures throughput, latency (TTFT/TPOT), goodput with datasets like ShareGPT/HF and request-rate control. Use for inference performance testing.
npx claudepluginhub vllm-project/vllm-skills --plugin vllm-skillsThis skill uses the workspace's default tool permissions.
Benchmark vLLM or any OpenAI-compatible serving endpoint using the `vllm bench serve` CLI. Measures throughput, latency (TTFT, TPOT), and goodput against configurable request load.
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
Benchmark vLLM or any OpenAI-compatible serving endpoint using the vllm bench serve CLI. Measures throughput, latency (TTFT, TPOT), and goodput against configurable request load.
Reference: vLLM Bench Serve Documentation
Basic benchmark against local vLLM server (default random dataset, 1000 prompts):
vllm bench serve \
--backend openai-chat \
--host 127.0.0.1 \
--port 8000 \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions
Save results to JSON:
vllm bench serve \
--backend openai-chat \
--host 127.0.0.1 \
--port 8000 \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--save-result \
--result-dir ./bench-results \
--metadata "version=0.6.0" "tp=1"
Note: When using
--backend openai-chat, you must specify--endpoint /v1/chat/completions(default is/v1/completions).
| Argument | Default | Description |
|---|---|---|
--backend | openai | Backend type: openai, openai-chat, openai-embeddings, vllm, vllm-pooling, vllm-rerank, etc. |
--host | 127.0.0.1 | Server host |
--port | 8000 | Server port |
--base-url | - | Alternative: full base URL instead of host:port |
--endpoint | /v1/completions | API endpoint; use /v1/chat/completions for openai-chat |
--model | (from /v1/models) | Model name |
--num-prompts | 1000 | Number of prompts to process |
--request-rate | inf | Requests per second; inf = burst all at once |
--max-concurrency | - | Max concurrent requests (caps parallelism) |
--num-warmups | 0 | Warmup requests before measuring |
--dataset-name | Use Case |
|---|---|
random | Synthetic random prompts (default) |
sharegpt | ShareGPT conversation format; requires --dataset-path |
sonnet | Sonnet-style prompts |
hf | HuggingFace dataset; requires --dataset-path (dataset ID) |
custom / custom_mm | Custom dataset; requires --dataset-path |
prefix_repetition | Prefix repetition benchmark |
random-mm | Random multimodal (images/videos) |
spec_bench | Spec bench dataset |
Dataset-specific options (examples):
# Random: control input/output length
--dataset-name random --random-input-len 1024 --random-output-len 128
# Sonnet defaults: input 550, output 150, prefix 200
--dataset-name sonnet --sonnet-input-len 550 --sonnet-output-len 150
# HuggingFace dataset
--dataset-name hf --dataset-path "lmarena-ai/VisionArena-Chat" --hf-split test
# General overrides (map to dataset-specific args)
--input-len 512 --output-len 256
# Fixed request rate (Poisson process)
--request-rate 10
# More bursty arrivals (gamma distribution, burstiness < 1)
--request-rate 10 --burstiness 0.5
# Ramp-up from low to high RPS
--ramp-up-strategy linear --ramp-up-start-rps 1 --ramp-up-end-rps 50
# Limit concurrency (useful for rate-limited APIs)
--max-concurrency 32
| Argument | Description |
|---|---|
--save-result | Save benchmark results to JSON |
--save-detailed | Include per-request TTFT, TPOT, errors in JSON |
--append-result | Append to existing result file |
--result-dir | Directory for result files |
--result-filename | Custom filename (default: {label}-{request_rate}qps-{model}-{timestamp}.json) |
--percentile-metrics | Metrics for percentiles: ttft, tpot, itl, e2el (default: ttft,tpot,itl) |
--metric-percentiles | Percentile values, e.g. 25,50,99 (default: 99) |
--goodput | SLO for goodput: ttft:500 tpot:50 (ms) |
--temperature 0.7 --top-p 0.95 --top-k 50
--frequency-penalty 0 --presence-penalty 0 --repetition-penalty 1.0
1. Throughput test with random dataset (burst):
vllm bench serve --backend openai-chat --host 127.0.0.1 --port 8000 \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random \
--num-prompts 500 --random-input-len 512 --random-output-len 128
2. Latency test with fixed QPS:
vllm bench serve --backend openai-chat --host 127.0.0.1 --port 8000 \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--request-rate 5 --num-prompts 200 \
--save-result --percentile-metrics ttft,tpot --metric-percentiles 50,99
3. Benchmark against remote API (base-url):
vllm bench serve --backend openai-chat \
--base-url "https://api.example.com/v1" \
--model my-model \
--header "Authorization=Bearer $API_KEY"
4. Run inside Docker (when vLLM client not on host):
docker exec <container-name> vllm bench serve \
--backend openai-chat --host 127.0.0.1 --port 8000 \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random --num-prompts 100
--host/--port or --base-url are correct.--model explicitly or ensure /v1/models returns the model.--endpoint /v1/chat/completions when --backend openai-chat.--request-rate or --max-concurrency.--ready-check-timeout-sec 60 to wait for the endpoint before benchmarking.--insecure for self-signed certificates.--backend openai-embeddings, vllm-pooling, or vllm-rerank.--profile requires --profiler-config on the server for vLLM profiling.