From external-gitcode-ascend-skills
Analyzes and adapts upstream vLLM tests for Ascend NPU compatibility. Debugs failures, ports test cases, and validates CI readiness without modifying upstream vLLM code.
How this skill is triggered — by the user, by Claude, or both
Slash command
/external-gitcode-ascend-skills:vllm-tests-failure-analysisThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill covers two tightly linked workflows for the vllm-ascend project:
This skill covers two tightly linked workflows for the vllm-ascend project:
Both workflows share the same environment setup and root-cause methodology. The key constraint throughout is: never modify upstream vLLM code — only the test code (once copied into vllm-ascend) and vllm-ascend plugin code may be changed.
Before analyzing any test, check whether prior work already covers it — this avoids duplication and ensures consistency.
Decision tree:
Read references/ASCEND_ALL_128_TEST_ANALYSIS.md — the consolidated summary table covering 128 tests with root cause, CI verdict, and "should Ascend pass" classification. If the target test appears here and the vllm/vllm-ascend versions haven't changed significantly since the analysis date (2026-03-27), use the existing conclusion. Only re-analyze if the user explicitly requests a fresh run or versions have changed.
If the test appears in references/TEST_FILES_NEED_ANALYSIS.md but NOT in the summary table, it is a known target that has not yet been analyzed. Proceed to full analysis.
If the test does not appear in either file, it is a new target. Proceed to full analysis.
For deeper details on tests 31–61 (exact failure traces, mitigations tried), read references/ASCEND_31_61_TEST_ANALYSIS.md.
Complete these steps before running any tests. Each step matters — skipping one often produces misleading failures that waste debugging time.
Discover workspace paths. Check in order: paths provided by the user → environment variables $VLLM_WORKSPACE / $VLLM_ASCEND_WORKSPACE → common locations /vllm-workspace/vllm and /vllm-workspace/vllm-ascend → search the user's home directory. Confirm both repo paths before proceeding.
Select idle NPU cards. Run npu-smi info to find cards with no running processes. Set export ASCEND_RT_VISIBLE_DEVICES=<idle_card_ids> (e.g., 6,7). Using cards with existing workloads causes OOM and misleading failures.
Source Ascend toolkit. source /usr/local/Ascend/ascend-toolkit/set_env.sh
Configure model downloads. Try export HF_ENDPOINT=https://hf-mirror.com first. If the HF mirror fails, fall back to export VLLM_USE_MODELSCOPE=True. If ModelScope also lacks the model, fall back to Hugging Face with proxy.
Set proxy if needed. For China mainland environments, configure http_proxy/https_proxy/all_proxy. Always set no_proxy=localhost,127.0.0.1 — without this, localhost requests route through the proxy and cause server-startup timeouts.
Preserve PYTHONPATH. Always append: export PYTHONPATH=/path/to/vllm:$PYTHONPATH. Overwriting loses Ascend toolkit paths and causes silent import failures.
Record the environment for every analysis: vllm commit/version, vllm-ascend commit/version, Python, torch, torch-npu, CANN versions, and whether external network is available.
This section covers how to take an upstream vLLM test and adapt it for vllm-ascend. The fundamental rule: upstream vLLM code is read-only — all changes go into the test copy (inside vllm-ascend) or the vllm-ascend plugin code.
The test's destination in vllm-ascend depends on what it tests:
| Test type | Destination directory | Reason |
|---|---|---|
| Pure unit test (no NPU hardware needed) | tests/ut/<subdomain>/ | Runs in mocked env, no real hardware |
| End-to-end with single NPU | tests/e2e/singlecard/ | Standard singlecard e2e |
| End-to-end with multiple NPUs | tests/e2e/multicard/2-cards/ or 4-cards/ | Matches TP/PP requirements |
| Nightly / heavy benchmark | tests/e2e/nightly/single_node/ | Too slow for presubmit |
| Upstream interface verification | tests/e2e/vllm_interface/ | Tests that exercise vLLM interfaces against Ascend |
Most upstream tests from vllm/tests/ map to tests/e2e/singlecard/ unless they specifically need multi-card or are pure-logic unit tests.
Copy the test file into the chosen destination directory. If the test imports helpers from sibling files (e.g., conftest.py, utils.py in the same upstream directory), check whether vllm-ascend's tests/e2e/conftest.py already provides equivalents before copying upstream helpers.
Add the required license header at the top of every new file:
#
# Copyright (c) 2026 Huawei Technologies Co., Ltd. All Rights Reserved.
# Copyright 2023 The vLLM team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# This file is a part of the vllm-ascend project.
# Adapted from vllm-project/vllm/blob/main/tests/<original/path>
#
Fix imports. Key replacements:
from tests.conftest import ... → use from tests.e2e.conftest import ... (the vllm-ascend e2e conftest already provides VllmRunner, HfRunner, RemoteOpenAIServer, cleanup_dist_env_and_memory, etc.)from tests.e2e.model_utils import check_outputs_equal, ... for output comparison utilitiesimport triton / CUDA-specific imports that aren't neededadapt_patch() in individual test files — it's already called in conftest.pyReplace CUDA-specific constructs:
torch.cuda.device_count() → torch.npu.device_count() or use current_platform from vllmDeviceConfig("cuda") → DeviceConfig("npu")@pytest.mark.skipif(not torch.cuda.is_available(), ...) → replace with NPU availability checks, or remove if the test should always run on NPUcuda in device strings → npuApply vllm-ascend e2e conventions:
VllmRunner as a context manager (its __exit__ auto-cleans memory)os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn" at module level for tests that load modelsgpu_memory_utilization=0.7 (or appropriate value) explicitly to avoid OOMcudagraph_capture_sizes=[1, 2, 4, 8] when needed, or enforce_eager=True to disable@pytest.fixture(autouse=True) with monkeypatch for test-scoped env varsHandle model references: Use HuggingFace Hub IDs (e.g., "Qwen/Qwen3-0.6B") or vllm-ascend specific IDs (e.g., "vllm-ascend/Qwen3-0.6B-W8A8"). Never use local filesystem paths. If the original test uses a model unavailable on the HF mirror or ModelScope, note this as a precondition issue.
After copying and adapting, run the test and iterate:
while test fails:
1. Run: pytest -sv tests/e2e/singlecard/test_<name>.py 2>&1 | tee /tmp/test_output.log
2. Capture the FIRST real failure (ignore cascading errors)
3. Classify: is this environmental noise or a real issue?
- Environmental → fix (install dep, set proxy, select different card) → rerun
- Real failure → diagnose root cause (see §4)
4. Fix: modify the test copy or vllm-ascend plugin code (never upstream vLLM)
5. Rerun and verify
Keep a log of every change made and every error encountered. This record is essential for the final report and for reproducing results.
If after thorough debugging you conclude the test cannot pass on Ascend NPU, produce a clear root-cause report:
The goal is to find the true root cause, not stop at the first error message. Many surface failures (network timeouts, missing deps) mask deeper issues.
vllm and vllm-ascend), inspect both repositories.| Category | Meaning | Example |
|---|---|---|
vllm-ascend adaptation gap | Plugin missed upstream branching/dispatch logic | LoRA wrapper selection missing packed_modules_list check |
upstream test hardcoded CUDA | Test assumes CUDA device, API, or kernel | DeviceConfig("cuda"), torch.cuda.device_count() |
runtime feature gap | NPU doesn't support required op or quantization | fp8 quantization not supported on NPU |
test precondition | Missing dependency, model, or resource (resolvable) | runai-model-streamer not installed |
compiler/runtime compatibility | torch.compile / ACL graph / dynamo issues | torch._dynamo.exc.InternalTorchDynamoError |
environment/resource | OOM, proxy, network, card contention | Free memory below gpu_memory_utilization threshold |
vllm-ascend plugin, the test copy, upstream vllm (documenting as "cannot fix in plugin alone"), or the runtime stack?A test is a strong CI candidate if it:
Exclude tests that:
When summarizing a batch of tests:
| # | Test File | Evidence | Root Cause | Category | Should Ascend Pass | CI Verdict | Fixable in vllm-ascend |
|---|-----------|----------|------------|----------|-------------------|------------|----------------------|
| 1 | `tests/lora/test_add_lora.py` | Dynamic | LoRA wrapper selection gap | adaptation gap | Yes | nightly | Yes |
After the table, include:
When adapting a specific test, the final report should include:
Consult case studies when a failure appears in upstream code but only manifests after installing vllm-ascend. This pattern — "works without plugin, breaks with plugin" — almost always points to an adaptation boundary issue in the plugin.
references/CASE_LORA_WRAPPER_SELECTION_GAP.md
vllm-ascend incomplete migration of upstream LoRA wrapper selection logic causes IndexError during set_lora().packed_modules_list, output_sizes) before suspecting upstream itself.MergedColumnParallelLinear + packed_modules_list + output_sizes + set_lora + IndexError → suspect wrapper selection mismatch.For vllm-ascend, upstream tests should be selected primarily from behavior-contract layers rather than CUDA-specific implementation layers. The goal is to ensure vllm-ascend preserves upstream behavioral contracts, not to reproduce backend-specific implementations.
High priority: hardware plugin loading, platform dispatch, CustomOp fallback, config normalization, API/request validation, scheduler and cache semantics, LoRA adapter loading and module mapping, lightweight multimodal input handling, pooling task contracts, OpenAI-compatible API path correctness.
Lower priority: CUDA/ROCm/TPU backend-specific, heavily network-dependent, benchmark-oriented, or operationally flaky tests.
CI strategy: Presubmit CI should contain the smallest stable subset with high regression signal. Heavier multimodal, runtime-compatibility, and distributed-setup cases should be deferred to nightly CI.
npx claudepluginhub ascend-ai-coding/awesome-ascend-skills --plugin remote-npu-testAutomates migrating PyTorch models to Huawei Ascend NPU with a 7-stage pipeline: analysis, dependency fixing, device adaptation, API replacement, NPU validation, debugging, and report generation.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.