From vllm-skills
Benchmarks vLLM automatic prefix caching efficiency using fixed prompts, ShareGPT dataset, or synthetic prefix/suffix patterns. Compares throughput and latency with/without caching for repeated prompts.
npx claudepluginhub vllm-project/vllm-skills --plugin vllm-skillsThis skill uses the workspace's default tool permissions.
Benchmark the efficiency of vLLM's automatic prefix caching (APC) feature. The offline script `benchmarks/benchmark_prefix_caching.py` runs directly against the vLLM engine (no server required). For online/serving tests, use `vllm bench serve` with the `prefix_repetition` dataset.
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
Benchmark the efficiency of vLLM's automatic prefix caching (APC) feature. The offline script benchmarks/benchmark_prefix_caching.py runs directly against the vLLM engine (no server required). For online/serving tests, use vllm bench serve with the prefix_repetition dataset.
--enable-prefix-caching.Runs a synthetic benchmark with a fixed prompt repeated multiple times to directly measure cache hit efficiency. No dataset download required.
python3 benchmarks/benchmark_prefix_caching.py \
--model Qwen/Qwen3-8B \
--enable-prefix-caching \
--num-prompts 1 \
--repeat-count 100 \
--input-length-range 128:256
To compare against the baseline without caching:
python3 benchmarks/benchmark_prefix_caching.py \
--model Qwen/Qwen3-8B \
--no-enable-prefix-caching \
--num-prompts 1 \
--repeat-count 100 \
--input-length-range 128:256
Uses real-world conversational data from ShareGPT to evaluate prefix caching with naturally occurring prompt sharing.
First, download the dataset:
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
Then run the benchmark:
python3 benchmarks/benchmark_prefix_caching.py \
--model Qwen/Qwen3-8B \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--enable-prefix-caching \
--num-prompts 20 \
--repeat-count 5 \
--input-length-range 128:256
Uses vllm bench serve with the synthetic prefix_repetition dataset to test caching via the serving API. This requires a running vLLM server.
First, start the server:
vllm serve Qwen/Qwen3-8B
Then run the benchmark:
vllm bench serve \
--backend openai \
--model Qwen/Qwen3-8B \
--dataset-name prefix_repetition \
--num-prompts 100 \
--prefix-repetition-prefix-len 512 \
--prefix-repetition-suffix-len 128 \
--prefix-repetition-num-prefixes 5 \
--prefix-repetition-output-len 128
Key parameters for prefix_repetition:
| Parameter | Description |
|---|---|
--prefix-repetition-prefix-len | Number of tokens in the shared prefix portion |
--prefix-repetition-suffix-len | Number of tokens in the unique suffix portion |
--prefix-repetition-num-prefixes | Number of distinct prefixes to cycle through |
--prefix-repetition-output-len | Number of output tokens to generate per request |
cd vllm).Qwen/Qwen3-8B) unless the user specifies a different one or the model is unavailable; change only --model.--repeat-count in Option 1 and 2 controls how many times each sampled prompt is replayed; higher values increase cache hit rate.--input-length-range accepts a min:max token range, e.g. 128:256.--tensor-parallel-size <N>.--prefix-caching-hash-algo xxhash (requires pip install xxhash).benchmark_prefix_caching.py| Argument | Required | Description |
|---|---|---|
--model | Yes | Model name or path (HuggingFace ID or local path) |
--num-prompts | Yes | Number of prompts to process |
--input-length-range | Yes | Token length range for inputs, e.g. 128:256 |
--repeat-count | No | Number of times each prompt is repeated (default: 1) |
--dataset-path | No | Path to a dataset file (e.g. ShareGPT JSON). Omit for synthetic fixed-prompt mode |
--prefix-len | No | Fixed prefix token length to prepend to every prompt |
--output-len | No | Number of output tokens to generate per request |
--sort | No | Sort prompts by length before benchmarking |
--enable-prefix-caching / --no-enable-prefix-caching | No | Toggle APC (recommended: enable to test caching) |
--prefix-caching-hash-algo | No | Hash algorithm: sha256, sha256_cbor, xxhash, xxhash_cbor |
--tensor-parallel-size | No | Number of GPUs for tensor parallelism |
--disable-detokenize | No | Skip detokenization to reduce overhead |
python3 benchmarks/*.py reports file not found, locate your local vLLM repository first and run the command from that repo root.git clone https://github.com/vllm-project/vllm
cd vllm
export HF_TOKEN=<your_token> or pass --hf-token <your_token>.xxhash or cbor2 is not installed and you use those hash algorithms, install them first: pip install xxhash cbor2.