From vllm-skills
Deploys vLLM inference server using Docker (pre-built images or build-from-source) with NVIDIA GPU support and OpenAI-compatible API.
npx claudepluginhub vllm-project/vllm-skills --plugin vllm-skillsThis skill uses the workspace's default tool permissions.
A Claude skill describing how to deploy vLLM with Docker using the official pre-built images or building the image from source supporting NVIDIA GPUs with CUDA. Instructions include NVIDIA CUDA support, example `docker run` and a minimal `docker-compose` snippet, recommended flags, and troubleshooting notes. For AMD, Intel, or other accelerators, please refer to the [vLLM documentation](https:/...
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
A Claude skill describing how to deploy vLLM with Docker using the official pre-built images or building the image from source supporting NVIDIA GPUs with CUDA. Instructions include NVIDIA CUDA support, example docker run and a minimal docker-compose snippet, recommended flags, and troubleshooting notes. For AMD, Intel, or other accelerators, please refer to the vLLM documentation for alternative deployment methods.
--ipc=host, shared cache mounts, and HF_TOKEN handlingcurl for API testsHF_TOKENRun a vLLM OpenAI-compatible server with GPU access, mounting the HF cache and forwarding port 8000:
docker run --rm --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model Qwen/Qwen2.5-1.5B-Instruct
--gpus all exposes all GPUs to the container. Adjust if you need specific GPUs.--ipc=host or an appropriately large --shm-size is recommended so PyTorch and vLLM can share host shared memory.~/.cache/huggingface avoids re-downloading models inside the container.Note: vLLM and this skill recommend using the latest Docker image (
vllm/vllm-openai:latest). For legacy version images, you may refer to the Docker Hub image tags.
You can build and run vLLM from source by using the provided docker/Dockerfile. First, check the hardware of the host machine and ensure you have the necessary dependencies installed (e.g., NVIDIA drivers, CUDA toolkit, Docker with BuildKit support). For ARM64/aarch64 builds, refer to the "Building for ARM64/aarch64" section.
DOCKER_BUILDKIT=1 docker build . \
--target vllm-openai \
--tag vllm/vllm-openai \
--file docker/Dockerfile
The --target vllm-openai specifies that you are building the OpenAI-compatible server image. The DOCKER_BUILDKIT=1 environment variable enables BuildKit, which provides better caching and faster builds.
--build-arg max_jobs=<N> — sets the number of parallel compilation jobs for building CUDA kernels. Useful for speeding up builds on multi-core systems.--build-arg nvcc_threads=<N> — controls CUDA compiler threads. Recommended to use a smaller value than max_jobs to avoid excessive memory usage.--build-arg torch_cuda_arch_list="" — if set to empty string, vLLM will detect and build only for the current GPU's compute capability. By default, vLLM builds for all GPU types for wider distribution.If you have not changed any C++ or CUDA kernel code, you can use precompiled wheels to significantly reduce Docker build time:
--build-arg VLLM_USE_PRECOMPILED="1" to your build command.main branch.--build-arg VLLM_PRECOMPILED_WHEEL_COMMIT=<commit_hash>.Example with precompiled wheels and options for fast compilation:
DOCKER_BUILDKIT=1 docker build . \
--target vllm-openai \
--tag vllm/vllm-openai \
--file docker/Dockerfile \
--build-arg max_jobs=8 \
--build-arg nvcc_threads=2 \
--build-arg VLLM_USE_PRECOMPILED="1"
vLLM does not include optional dependencies (e.g., audio processing) in the pre-built image to avoid licensing issues. If you need optional dependencies, create a custom Dockerfile that extends the base image:
Example: adding audio optional dependencies
# NOTE: MAKE SURE the version of vLLM matches the base image!
FROM vllm/vllm-openai:0.11.0
# Install audio optional dependencies
RUN uv pip install --system vllm[audio]==0.11.0
Example: using development version of transformers:
FROM vllm/vllm-openai:latest
# Install development version of Transformers from source
RUN uv pip install --system git+https://github.com/huggingface/transformers.git
Build this custom Dockerfile with:
docker build -t my-vllm-custom:latest -f Dockerfile .
Then use it like any other vLLM image:
docker run --rm --gpus all \
-p 8000:8000 \
--ipc=host \
my-vllm-custom:latest \
--model Qwen/Qwen2.5-1.5B-Instruct
A Docker container can be built for ARM64 systems (e.g., NVIDIA Grace-Hopper and Grace-Blackwell). Use the flag --platform "linux/arm64":
DOCKER_BUILDKIT=1 docker build . \
--target vllm-openai \
--tag vllm/vllm-openai \
--file docker/Dockerfile \
--platform "linux/arm64"
Note: Multiple modules must be compiled, so this process can take longer. Use build arguments like --build-arg max_jobs=8 --build-arg nvcc_threads=2 to speed up the process (ensure max_jobs is substantially larger than nvcc_threads). Monitor memory usage, as parallel jobs can require significant RAM.
For cross-compilation (building ARM64 on an x86_64 host), register QEMU user-static handlers first:
docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
Then use the --platform "linux/arm64" flag in your build command.
After building, run your image just like the pre-built image:
docker run --rm --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai \
--model Qwen/Qwen2.5-1.5B-Instruct
Replace vllm/vllm-openai with the tag you specified during the build (e.g., my-vllm-custom:latest).
Note:
--runtime nvidiais deprecated for most environments. Prefer--gpus ...with NVIDIA Container Toolkit. Use--runtime nvidiaonly for legacy Docker configurations.
--model <MODEL_ID> — model to load (HF ID or local path)--port <PORT> — server port (default 8000 for OpenAI-compatible server)--log-level — adjust verbosityengine_args after the image tag; see vLLM docs for tuning options.After the container starts, make a quick test request against the OpenAI-compatible endpoint:
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"Qwen/Qwen2.5-1.5B-Instruct","messages":[{"role":"user","content":"Who are you?"}],"max_tokens":128}'
HF_TOKEN secret; prefer passing it via environment variables or a secret manager.nvidia-container-toolkit is installed and restart Docker.HF_TOKEN and network; mount cache directory to persist downloads.--shm-size.VLLM_NCCL_SO_PATH per upstream guidance.docker group or suggest user to add current user to docker group manually following:# 1. Create docker group if it doesn't exist (may already exist on some systems)
sudo groupadd docker
# 2. Add current user to the docker group (replace $USER with your username if needed)
sudo usermod -aG docker $USER
# 3. Apply the new group membership (you may need to log out and log back in for this to take effect)
newgrp docker
# 4. Verify that the user is in the docker group (output should include docker)
groups $USER
HF_TOKEN is passed to the container and is valid. Check if HTTP_PROXY and HTTPS_PROXY are passed to the container if the host is behind a proxy. Also, verify that the model ID is correct and that the model is public or accessible with the provided token.