Help us improve
Share bugs, ideas, or general feedback.
From vllm-skills
Runs vLLM serving benchmarks with synthetic random data to measure throughput, TTFT, TPOT, inter-token latency. Quick tests without external datasets.
npx claudepluginhub vllm-project/vllm-skills --plugin vllm-skillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/vllm-skills:vllm-bench-random-syntheticThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Run a quick performance benchmark on a vLLM server using synthetic random data. This skill measures core serving metrics including request throughput, token throughput, TTFT (Time to First Token), TPOT (Time per Output Token), and inter-token latency.
Benchmarks vLLM or OpenAI-compatible LLM serving endpoints using vllm bench serve CLI. Measures throughput, latency (TTFT/TPOT), goodput with datasets like ShareGPT/HF and request-rate control. Use for inference performance testing.
Runs accuracy (FlagEval) and performance benchmarks (vllm bench serve) across 5 workload profiles against a served model, collecting throughput, latency, TTFT, and TPOT metrics.
Orchestrates online benchmarks for vLLM inference services using `vllm bench serve`. Supports single/multi-case batch execution with result aggregation and auto-optimization for throughput under latency SLOs (TTFT, TPOT, P99).
Share bugs, ideas, or general feedback.
Run a quick performance benchmark on a vLLM server using synthetic random data. This skill measures core serving metrics including request throughput, token throughput, TTFT (Time to First Token), TPOT (Time per Output Token), and inter-token latency.
pip install vllm)The simplest way to run the benchmark:
# Start vLLM server (in background or separate terminal)
vllm serve Qwen/Qwen2.5-1.5B-Instruct
# Run benchmark with random synthetic data
vllm bench serve \
--backend openai-chat \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random \
--num-prompts 10
Note:
--backend openai-chat with endpoint /v1/chat/completions for online benchmarks.| Parameter | Description | Default |
|---|---|---|
--backend | Backend type: vllm, openai, openai-chat | vllm |
--model | Model name (must match the server) | Required |
--endpoint | API endpoint path | /v1/completions or /v1/chat/completions |
--dataset-name | Dataset to use | random (synthetic) |
--num-prompts | Number of requests to send | 10 |
--port | Server port | 8000 |
--max-concurrency | Maximum concurrent requests | Auto |
--save-result | Save results to file | Off |
--result-dir | Directory to save results | ./ |
When successful, you will see output like:
============ Serving Benchmark Result ============
Successful requests: 10
Benchmark duration (s): 5.78
Total input tokens: 1369
Total generated tokens: 2212
Request throughput (req/s): 1.73
Output token throughput (tok/s): 382.89
Total token throughput (tok/s): 619.85
---------------Time to First Token----------------
Mean TTFT (ms): 71.54
Median TTFT (ms): 73.88
P99 TTFT (ms): 79.49
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 7.91
Median TPOT (ms): 7.96
P99 TPOT (ms): 8.03
---------------Inter-token Latency----------------
Mean ITL (ms): 7.74
Median ITL (ms): 7.70
P99 ITL (ms): 8.39
==================================================
vllm bench serve \
--backend openai-chat \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random \
--num-prompts 100
vllm bench serve \
--backend openai-chat \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random \
--num-prompts 50 \
--save-result \
--result-dir ./benchmark-results/
vllm bench serve \
--backend openai-chat \
--model meta-llama/Llama-3.1-8B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random \
--num-prompts 100 \
--port 8001 \
--max-concurrency 4
For quick testing (small models, fast):
Qwen/Qwen2.5-1.5B-Instruct (recommended for quick tests)facebook/opt-125mfacebook/opt-350mFor realistic benchmarks (medium models):
Qwen/Qwen2.5-7B-Instructmeta-llama/Llama-3.1-8B-Instructmistralai/Mistral-7B-Instruct-v0.3vllm --version to verifycurl http://localhost:8000/health to checkvllm serve <model-name> (wait for "Application startup complete")vllm bench serve with appropriate parameterskill <PID>Server not responding:
curl http://localhost:8000/health--port flag if server is on different portModel not found:
export HF_TOKEN=<your_token> if neededOut of memory:
--num-prompts or --max-concurrencyConnection refused:
random dataset generates synthetic prompts automatically--num-prompts