Help us improve
Share bugs, ideas, or general feedback.
From external-gitcode-ascend-skills
Deploys vLLM inference services on Ascend NPU servers with automatic model detection, quantization handling, tensor parallelism configuration, and service health verification. Supports local/remote deployment on bare metal, containers, or Docker images.
npx claudepluginhub ascend-ai-coding/awesome-ascend-skills --plugin migration-ascend-torchnpu-skillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/external-gitcode-ascend-skills:vllm-ascend-serverThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill deploys vLLM inference services on Ascend NPU servers with automatic model detection, quantization handling, and performance optimization.
references/environment-variables.mdreferences/features.mdreferences/graph-mode.mdreferences/launch-templates/docker.mdreferences/launch-templates/health-check.mdreferences/launch-templates/offline-inference.mdreferences/launch-templates/online-serving.mdreferences/launch-templates/speculative-decoding.mdreferences/launch_templates.mdreferences/model_configs/deepseek-v3.yamlreferences/model_configs/glm-4.x.yamlreferences/model_configs/qwen2.5-vl.yamlreferences/model_configs/qwen3-235b-a22b.yamlreferences/model_configs/qwen3-30b.yamlreferences/model_configs/qwen3-8b.yamlreferences/model_configs/qwen3-embedding.yamlreferences/model_configs/qwen3-reranker.yamlreferences/model_configs/qwen3-vl.yamlreferences/models.mdreferences/parameters.mdDeploys vLLM large model inference services on Ascend NPU platforms via SSH, with automatic configuration discovery, user confirmation, cron-based health monitoring, and service validation.
Deploys vLLM inference server using Docker (pre-built images or build-from-source) with NVIDIA GPU support and OpenAI-compatible API.
Provides patterns for LLM inference infrastructure with serving frameworks like vLLM, TGI, TensorRT-LLM; quantization, batching strategies, KV cache, and streaming responses. Use for optimizing latency and scaling deployments.
Share bugs, ideas, or general feedback.
This skill deploys vLLM inference services on Ascend NPU servers with automatic model detection, quantization handling, and performance optimization.
Key Features:
quant_model_description.json)Phase 0: Platform (Local/Remote, Bare metal/Container)
↓
Phase 1: Environment Check (NPU, vLLM, Memory)
↓
Phase 2: Model Discovery (Find models, detect quantization)
↓
Phase 3: Gather Requirements (Port, TP size, mode selection)
↓
Phase 4: Generate Config (Env vars, vLLM command)
↓
Phase 5: Execute (Deploy and start service)
↓
Phase 6: Verify (Health check, test inference)
Detailed workflow: workflow-guide.md
1. Local - Deploy on this machine
2. Remote - Deploy via SSH (→ remote-server-guide skill)
1. Bare metal (裸机) - Virtual environment on host
2. Existing container (已有容器) - Connect to running container
3. Docker image (镜像) - Create with npu-docker-launcher
Docker image defaults:
-v <model-path>:/modelhost (default) or bridge with port mapping# NPU check
npu-smi info
# vLLM check
pip show vllm vllm-ascend
# Memory check
npu-smi info | grep -A 5 "Memory-Usage"
Before deployment, verify selected NPU cards are not occupied:
# Check NPU usage status
npu-smi info
# Check for running processes on specific cards
fuser -v /dev/davinci0 2>/dev/null && echo "Card 0 in use" || echo "Card 0 available"
fuser -v /dev/davinci1 2>/dev/null && echo "Card 1 in use" || echo "Card 1 available"
# Alternative: Check memory usage (high usage = occupied)
npu-smi info -t board | grep -E "NPU|Memory-Usage"
If selected cards are occupied:
## NPU Card Status
Card 0: ❌ In use (Memory: 28GB/32GB, PID: 12345)
Card 1: ✅ Available (Memory: 0GB/32GB)
Card 2: ✅ Available (Memory: 0GB/32GB)
Card 3: ❌ In use (Memory: 30GB/32GB, PID: 67890)
Selected cards [0,1] have conflicts:
- Card 0 is occupied by process 12345
Options:
1. Select different cards
2. Kill occupying process (with user confirmation)
3. Wait and retry
How to proceed? [1/2/3]
Kill process (with confirmation):
# Show what's using the card
ps aux | grep <PID>
# Confirm before killing
"Kill process <PID> (<process-name>)? [yes/no]"
# Kill if confirmed
kill -9 <PID>
Detailed NPU check: workflow-guide.md
/home/weights, /home/weight, /home/data*, /data*
Recursive search for config.json to find models.
# Quantized model
[ -f "<model>/quant_model_description.json" ] → --quantization ascend
# Non-quantized model
[ ! -f "<model>/quant_model_description.json" ] → No param
Critical: Never add --quantization ascend for non-quantized models!
See quantization.md for details.
| Parameter | Default | Notes |
|---|---|---|
| Mode | online | online/offline |
| Port | 8000 | Default for vLLM |
| NPU cards | 0 | 0,1 for TP2 |
| TP size | Auto | Based on model |
| Scenario | Mode | Config |
|---|---|---|
| Production | Graph | --no-enforce-eager |
| Development | Eager | --enforce-eager |
| Debugging | Eager | --enforce-eager |
See graph-mode.md for details.
| Model Size | TP | Cards |
|---|---|---|
| ≤14B | 1 | 1 |
| 14B-70B | 2-4 | 2-4 |
| >70B | 4-8 | 4-8 |
Single card:
export TASK_QUEUE_ENABLE=1
export VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE=1
export ASCEND_RT_VISIBLE_DEVICES=0
Multi-card:
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
export ASCEND_RT_VISIBLE_DEVICES=0,1
export HCCL_BUFFSIZE=1024
export HCCL_CONNECT_TIMEOUT=600
vllm serve /model/<model-name> \
--host 0.0.0.0 \
--port 8000 \
--trust-remote-code \
--max-num-seqs 256 \
--max-model-len 32768 \
--tensor-parallel-size <tp> \
[QUANT_PARAM] \
--gpu-memory-utilization 0.9 \
--async-scheduling \
--additional-config '{"enable_cpu_binding":true}' \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}'
QUANT_PARAM:
--quantization ascendDisplay generated config, confirm with user, then execute.
Persistent session (tmux): If you connected via tmux and are already inside the target environment (remote host / container / both), execute commands directly — same as bare metal.
Stateless (SSH key / sshpass / paramiko / fabric):
| Platform | Method |
|---|---|
| Bare metal | Execute directly in shell |
| Existing container | docker exec to run command |
| Remote | SSH → run command |
| Remote container | SSH → docker exec -d for background |
For containers, start vLLM in background with logging:
# Inside container (bare metal / tmux already in container)
nohup vllm serve /model ... > /tmp/vllm.log 2>&1 &
# From host (stateless, background in container)
docker exec -d <container> bash -c 'cd /workspace && vllm serve /model ... 2>&1 | tee /tmp/vllm.log'
# Remote via SSH (stateless)
ssh user@host "docker exec -d <container> bash -c 'vllm serve /model ... 2>&1 | tee /tmp/vllm.log'"
Inside container (bare metal / tmux already in container):
ps aux | grep vllm
tail -50 /tmp/vllm.log
tail -f /tmp/vllm.log
From host or remote (stateless, each command needs docker exec):
docker exec <container> ps aux | grep vllm
docker exec <container> tail -50 /tmp/vllm.log
docker exec <container> tail -f /tmp/vllm.log
# Health check
curl http://localhost:8000/health
# Test inference
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "<name>", "messages": [{"role": "user", "content": "Hello"}]}'
export TASK_QUEUE_ENABLE=1
export VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE=1
export ASCEND_RT_VISIBLE_DEVICES=0
vllm serve /model/Qwen3-8B-mxfp8 \
--port 8000 \
--trust-remote-code \
--quantization ascend \
--gpu-memory-utilization 0.9 \
--async-scheduling
vllm serve /model/Qwen3-8B \
--port 8000 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--async-scheduling
# NO --quantization param!
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
export ASCEND_RT_VISIBLE_DEVICES=0,1
export HCCL_BUFFSIZE=1024
vllm serve /model/Qwen3-30B-mxfp8 \
--port 8000 \
--tensor-parallel-size 2 \
--quantization ascend \
--gpu-memory-utilization 0.9
docker run -it -d \
--name vllm-server \
--network bridge \
-p 8000:8000 \
-v /home/weights/Qwen3-8B:/model \
-e ASCEND_RT_VISIBLE_DEVICES=0 \
vllm-ascend:latest
# Inside container
vllm serve /model --quantization ascend ...
| Detection | Action |
|---|---|
quant_model_description.json exists | Add --quantization ascend |
| File not found | No quantization param |
| Use Case | Mode |
|---|---|
| Production | Graph (AclGraph) |
| First deployment | Eager → test → Graph |
| Errors in graph | Fall back to Eager |
| Debugging | Eager |
| Network | Port Config |
|---|---|
| host | No mapping needed |
| bridge | Ask user for host port |
| Error | Solution |
|---|---|
| OOM | Reduce max_num_seqs or max_model_len |
| Graph capture failed | Use --enforce-eager |
| Quantization error | Check if model is actually quantized |
| Port in use | Change port or kill process |
| HCCL timeout | Increase HCCL_CONNECT_TIMEOUT |
| Connection reset/timeout | Network issue, retry SSH connection |
| Container exits immediately | Check docker logs, verify mounts exist |