From flagos-skills
Detects GPU vendor (NVIDIA, AMD/ROCm, Ascend, Metax, Iluvatar), selects PyTorch container image from vendor hubs or search, launches Docker container with data mounts, and validates GPU access. Useful for PyTorch GPU setup.
npx claudepluginhub flagos-ai/skills --plugin flagos-skillsThis skill is limited to using the following tools:
This skill automates multi-vendor GPU container setup for PyTorch workloads.
Installs vLLM, FlagTree, FlagGems, FlagCX, and vllm-plugin-FL stack in GPU Docker containers. Handles network mirrors, dependency ordering, wheel selection, and per-package validation after PyTorch setup.
Deploys vLLM inference server using Docker (pre-built images or build-from-source) with NVIDIA GPU support and OpenAI-compatible API.
Deploys ML training jobs and inference services to Vast.ai GPU cloud using optimized Docker images, CLI scripting, and automation for GPU instance provisioning.
Share bugs, ideas, or general feedback.
This skill automates multi-vendor GPU container setup for PyTorch workloads.
| Vendor | PyTorch Backend | Detection |
|---|---|---|
| NVIDIA | CUDA | nvidia-smi |
| AMD | ROCm (HIP) | rocm-smi, /opt/rocm |
| Ascend | torch_npu | npu-smi, /usr/local/Ascend |
| Metax | torch_musa | mx-smi, /opt/metax |
| Iluvatar | torch_corex | ixsmi, /opt/iluvatar |
When invoked, follow these steps:
Check if user provided:
--vendor <name> - Force specific vendor (skip detection)--image <image> - Force specific container image--data <path> - Force specific data mount path--name <name> - Container name (default: pytorch-gpu)Run the detection script:
python3 .claude/skills/gpu-container-setup/scripts/detect_gpu.py
Expected output:
{"vendor": "ascend", "devices": ["Ascend 910B"], "count": 8}
If detection fails and no --vendor flag provided, ask user which vendor to use.
Run the data disk detection:
python3 .claude/skills/gpu-container-setup/scripts/find_data_disk.py
Expected output:
{"data_disk": "/mnt/data", "found": true, "size": "2.0T", "available": "1.5T"}
If no suitable disk found, ask user for data mount path.
Follow strict priority order (only proceed to next if current fails):
1. Primary Vendor Hub (hardcoded) → 2. BAAI Harbor → 3. Web Search → 4. Local Images → 5. Ask User
| Vendor | Registry | API/Query |
|---|---|---|
| NVIDIA | nvcr.io | https://api.ngc.nvidia.com/v2/repos/nvidia/pytorch/tags |
| Ascend | ascendhub.huawei.com | Portal: https://ascendhub.huawei.com |
| Metax | registry.metax-tech.com | https://registry.metax-tech.com/v2/pytorch/metax-pytorch/tags/list |
| Iluvatar | hub.iluvatar.com | https://hub.iluvatar.com/v2/pytorch/iluvatar-pytorch/tags/list |
| AMD | docker.io (rocm/pytorch) | https://hub.docker.com/v2/repositories/rocm/pytorch/tags |
# Example: Query NGC for latest NVIDIA PyTorch
TAG=$(curl -s "https://api.ngc.nvidia.com/v2/repos/nvidia/pytorch/tags" | jq -r '.tags[].name' | grep -E '^[0-9]{2}\.[0-9]{2}-py3$' | sort -rV | head -1)
IMAGE="nvcr.io/nvidia/pytorch:${TAG}"
Only if Step 4.1 fails (unreachable, no image, pull fails).
# Query BAAI Harbor
curl -s "https://harbor.baai.ac.cn/api/v2.0/projects/flagrelease-public/repositories?page_size=100" | jq -r '.[].name' | grep "flagrelease-<vendor>"
Only if Steps 4.1 and 4.2 fail. Search for "<vendor> pytorch docker official".
Only if Steps 4.1-4.3 fail. Check docker images | grep pytorch.
docker pull "${IMAGE}" && docker run --rm "${IMAGE}" python -c "import torch; print(torch.__version__)"
If test fails, try next source. If all fail, ask user for image.
IMPORTANT: If image found via Web Search (Step 4.3) passes all tests, update references/image-sources.md to add the newly discovered vendor hub as a primary source. This makes future lookups faster.
# After successful web search discovery:
# 1. Verify image works (pull + pytorch test + GPU test)
# 2. Extract registry URL pattern
# 3. Update references/image-sources.md Step 1 section with new vendor hub
Refer to references/mount-requirements.md for vendor-specific requirements.
NVIDIA:
docker run -d --gpus all \
--name pytorch-gpu \
--shm-size=16g \
-v <data_disk>:/data \
<image> sleep infinity
AMD/ROCm:
docker run -d \
--device=/dev/kfd --device=/dev/dri \
--group-add video --group-add render \
--name pytorch-gpu \
--shm-size=16g \
-v <data_disk>:/data \
<image> sleep infinity
Ascend:
docker run -d \
--device=/dev/davinci0 --device=/dev/davinci1 ... \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /usr/local/Ascend:/usr/local/Ascend:ro \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi:ro \
--name pytorch-gpu \
--shm-size=16g \
-v <data_disk>:/data \
<image> sleep infinity
Metax:
docker run -d \
--device=/dev/mx0 --device=/dev/mx1 ... \
-v /opt/metax:/opt/metax:ro \
--name pytorch-gpu \
--shm-size=16g \
-v <data_disk>:/data \
<image> sleep infinity
Iluvatar:
docker run -d \
--device=/dev/bi0 --device=/dev/bi1 ... \
-v /opt/iluvatar:/opt/iluvatar:ro \
--name pytorch-gpu \
--shm-size=16g \
-v <data_disk>:/data \
<image> sleep infinity
Execute the docker run command. If container with same name exists:
Copy and run validation script inside container:
docker cp .claude/skills/gpu-container-setup/scripts/validate_pytorch.py pytorch-gpu:/tmp/
docker exec pytorch-gpu python3 /tmp/validate_pytorch.py
Expected output:
{
"status": "PASS",
"backend": "npu",
"device_count": 8,
"device_names": ["Ascend 910B", ...],
"tests": {
"device_detection": true,
"tensor_creation": true,
"matrix_multiply": true,
"gpu_to_cpu_transfer": true
}
}
Summarize to user:
docker exec -it pytorch-gpu bash| Error | Action |
|---|---|
| No GPU detected | Ask user for vendor or check drivers |
| Image pull fails | Try alternative registry or web search |
| Container start fails | Check device permissions, show error |
| Validation fails | Show detailed error, suggest fixes |
references/gpu-detection.md - Detection methods by vendorreferences/image-sources.md - Image discovery guide (registry APIs, priority order, selection criteria)references/mount-requirements.md - Vendor mount specificationsUser: /gpu-container-setup
User: setup a pytorch container
User: start container with ascend GPU
User: /gpu-container-setup --image nvcr.io/nvidia/pytorch:24.01-py3
User: /gpu-container-setup --image harbor.baai.ac.cn/flagrelease-public/ngctorch:2601