From vllm-skills
Runs vLLM serving benchmarks with synthetic random data to measure throughput, TTFT, TPOT, inter-token latency. Quick tests without external datasets.
npx claudepluginhub vllm-project/vllm-skills --plugin vllm-skillsThis skill uses the workspace's default tool permissions.
Run a quick performance benchmark on a vLLM server using synthetic random data. This skill measures core serving metrics including request throughput, token throughput, TTFT (Time to First Token), TPOT (Time per Output Token), and inter-token latency.
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
Run a quick performance benchmark on a vLLM server using synthetic random data. This skill measures core serving metrics including request throughput, token throughput, TTFT (Time to First Token), TPOT (Time per Output Token), and inter-token latency.
pip install vllm)The simplest way to run the benchmark:
# Start vLLM server (in background or separate terminal)
vllm serve Qwen/Qwen2.5-1.5B-Instruct
# Run benchmark with random synthetic data
vllm bench serve \
--backend openai-chat \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random \
--num-prompts 10
Note:
--backend openai-chat with endpoint /v1/chat/completions for online benchmarks.| Parameter | Description | Default |
|---|---|---|
--backend | Backend type: vllm, openai, openai-chat | vllm |
--model | Model name (must match the server) | Required |
--endpoint | API endpoint path | /v1/completions or /v1/chat/completions |
--dataset-name | Dataset to use | random (synthetic) |
--num-prompts | Number of requests to send | 10 |
--port | Server port | 8000 |
--max-concurrency | Maximum concurrent requests | Auto |
--save-result | Save results to file | Off |
--result-dir | Directory to save results | ./ |
When successful, you will see output like:
============ Serving Benchmark Result ============
Successful requests: 10
Benchmark duration (s): 5.78
Total input tokens: 1369
Total generated tokens: 2212
Request throughput (req/s): 1.73
Output token throughput (tok/s): 382.89
Total token throughput (tok/s): 619.85
---------------Time to First Token----------------
Mean TTFT (ms): 71.54
Median TTFT (ms): 73.88
P99 TTFT (ms): 79.49
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 7.91
Median TPOT (ms): 7.96
P99 TPOT (ms): 8.03
---------------Inter-token Latency----------------
Mean ITL (ms): 7.74
Median ITL (ms): 7.70
P99 ITL (ms): 8.39
==================================================
vllm bench serve \
--backend openai-chat \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random \
--num-prompts 100
vllm bench serve \
--backend openai-chat \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random \
--num-prompts 50 \
--save-result \
--result-dir ./benchmark-results/
vllm bench serve \
--backend openai-chat \
--model meta-llama/Llama-3.1-8B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random \
--num-prompts 100 \
--port 8001 \
--max-concurrency 4
For quick testing (small models, fast):
Qwen/Qwen2.5-1.5B-Instruct (recommended for quick tests)facebook/opt-125mfacebook/opt-350mFor realistic benchmarks (medium models):
Qwen/Qwen2.5-7B-Instructmeta-llama/Llama-3.1-8B-Instructmistralai/Mistral-7B-Instruct-v0.3vllm --version to verifycurl http://localhost:8000/health to checkvllm serve <model-name> (wait for "Application startup complete")vllm bench serve with appropriate parameterskill <PID>Server not responding:
curl http://localhost:8000/health--port flag if server is on different portModel not found:
export HF_TOKEN=<your_token> if neededOut of memory:
--num-prompts or --max-concurrencyConnection refused:
random dataset generates synthetic prompts automatically--num-prompts