Help us improve
Share bugs, ideas, or general feedback.
From vllm-skills
Deploys vLLM server on detected hardware (CUDA/ROCm/TPU/CPU), installs in virtual env with uv/pip, starts LLM serving, tests OpenAI-compatible API.
npx claudepluginhub vllm-project/vllm-skills --plugin vllm-skillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/vllm-skills:vllm-deploy-simpleThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
A simple skill to quickly install vLLM, start a server, and validate the OpenAI-compatible API.
Deploys vLLM inference server using Docker (pre-built images or build-from-source) with NVIDIA GPU support and OpenAI-compatible API.
Optimizes local LLM inference, model selection, VRAM usage, and deployment using Ollama, llama.cpp, vLLM, LM Studio. Covers GGUF/EXL2 quantization and privacy-first setups for offline AI apps.
Automates Ollama installation, hardware-based model selection, GPU setup, and client integration (Python/Node.js/REST) for local LLM inference on macOS/Linux/Docker.
Share bugs, ideas, or general feedback.
A simple skill to quickly install vLLM, start a server, and validate the OpenAI-compatible API.
This skill provides a streamlined workflow to:
If user did not specify the venv path or asked to deploy in the current environment, create a venv using uv with python 3.12 in the current folder. If uv not found, make a folder in this path and use python to create a virtual environment.
If user did not specify the venv path, model, or port, use default options:
# Default deployment options (--venv "." --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8)
scripts/quickstart.sh
Or with custom options:
# Use custom virtual environment
scripts/quickstart.sh --venv /path/to/venv
# Use custom model and port
scripts/quickstart.sh --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000
# Use custom GPU memory utilization
scripts/quickstart.sh --gpu_memory_utilization 0.6
# Combine all options
scripts/quickstart.sh --venv /path/to/venv --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8
This will:
Install vLLM:
scripts/quickstart.sh install
# Or with virtual environment
scripts/quickstart.sh install --venv /path/to/venv
Start the server:
scripts/quickstart.sh start
# Or with custom options
scripts/quickstart.sh start --venv /path/to/venv --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8
Test the API:
scripts/quickstart.sh test
# Or with custom port
scripts/quickstart.sh test --port 8000
Stop the server:
scripts/quickstart.sh stop
# Or with virtual environment
scripts/quickstart.sh stop --venv /path/to/venv
Check server status:
scripts/quickstart.sh status
Restart the server:
scripts/quickstart.sh restart
# Or with custom options
scripts/quickstart.sh restart --venv /path/to/venv --port 8000 --gpu_memory_utilization 0.8
The script supports the following command-line options:
scripts/quickstart.sh [command] [OPTIONS]
Commands:
install - Install vLLM and dependencies
start - Start the vLLM server
stop - Stop the vLLM server
test - Test the OpenAI-compatible API
status - Show server status
restart - Restart the server
all - Run complete workflow (default)
Options:
--model MODEL Model to use (default: Qwen/Qwen2.5-1.5B-Instruct)
--port PORT Port to run server on (default: 8000)
--venv VENV_PATH Virtual environment path (default: .)
--gpu_memory_utilization VRAM GPU memory utilization (default: 0.8)
The script automatically detects your hardware and installs the appropriate vLLM version:
nvidia-smi command/dev/kfd and /dev/dri devicesTPU_NAME environment variable or gcloud commandFor Google TPU, the script installs vllm-tpu instead of the standard vllm package.
The test script sends a simple chat completion request:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"messages": [{"role": "user", "content": "Say hello!"}],
"max_tokens": 50
}'
Virtual environment not found:
--venv exists and is a valid virtual environmentbin/activate on Linux/macOS or Scripts/activate on Windows)uv venv /path/to/venv (suggested); or with pip: python3 -m venv /path/to/venvServer won't start:
lsof -i :8000nvidia-smi (for NVIDIA) or rocm-smi (for AMD)python -c "import vllm; print(vllm.__version__)"$VENV_PATH/tmp/vllm-server.logAPI returns errors:
cat $VENV_PATH/tmp/vllm-server.logscripts/quickstart.sh statusOut of memory:
--gpu-memory-utilization parameterWrong backend detected:
nvidia-smi is in your PATHTPU_NAME environment variable or install gcloud$VENV_PATH/tmp/vllm-server.log$VENV_PATH/tmp/vllm-server.pid for easy managementuv if available, otherwise falls back to pipscripts/quickstart.sh --port 8080 start --venv /path/to/venv)