From sagemaker-ai
Checks and compares software component versions across SageMaker HyperPod cluster nodes including NVIDIA drivers, CUDA, cuDNN, NCCL, EFA, OFI NCCL, GDRCopy, MPI, Neuron SDK, Python, and PyTorch. Useful for verifying compatibility, detecting mismatches, and planning upgrades.
How this skill is triggered — by the user, by Claude, or both
Slash command
/sagemaker-ai:hyperpod-version-checkerThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Upload to cluster nodes via `hyperpod-ssm` skill, then execute.
Upload to cluster nodes via hyperpod-ssm skill, then execute.
# Text report to console + file
bash hyperpod_check_versions.sh
# JSON only to stdout (text report still saved to file) — best for piping/parsing
bash hyperpod_check_versions.sh --json
# Custom output file
bash hyperpod_check_versions.sh --output /tmp/versions.txt
# No color (for logging)
bash hyperpod_check_versions.sh --no-color
Output file: component_versions_<hostname>_<timestamp>.txt (default)
| Component | Detection Method | Applicable When |
|---|---|---|
| NVIDIA Driver | nvidia-smi | GPU instances (p3/p4/p5/g5) |
| CUDA Toolkit | nvcc, /usr/local/cuda symlink | GPU instances |
| cuDNN | Header file, packages | GPU instances doing deep learning |
| NCCL | Library filename, header, packages | Distributed GPU training |
| EFA | /opt/amazon/efa_installed_packages, fi_info | EFA-capable instances (p4d/p4de/p5/trn1/trn2) |
| AWS OFI NCCL | efa_installed_packages, library search | EFA + NCCL workloads |
| GDRCopy | rpm/dpkg, kernel module | GPU instances with RDMA (p4d+/p5) |
| MPI | mpirun, /opt/amazon/openmpi | Distributed training |
| Neuron SDK | neuronx-cc, neuron-ls, packages | Trainium/Inferentia (trn1/trn2/inf1/inf2) |
| Python/PyTorch | python3, torch import | ML workloads |
| Container runtime | docker, containerd, kubectl, nvidia-ctk | EKS clusters |
Run on each node individually via the hyperpod-ssm skill. With --json, stdout is clean JSON for easy diffing.
The script automatically analyzes CUDA/driver compatibility. For reference:
| Driver Series | Supported CUDA |
|---|---|
| 580+ | 13.x, 12.x, 11.x |
| 570+ | 12.8+ (Blackwell), 12.x, 11.x |
| 545+ | 12.3-12.7, 11.x |
| 525-535 | 12.0-12.2, 11.x |
| 450+ | 11.x only |
NCCL: Use 2.18+ for CUDA 12.x, 2.12+ for CUDA 11.x. Must be consistent across all nodes.
| EFA Installer | AWS OFI NCCL |
|---|---|
| 1.29+ | v1.7.3+ (recommended) |
| 1.26-1.28 | v1.7.0-v1.7.2 |
| 1.20-1.25 | v1.6.0+ |
npx claudepluginhub awslabs/agent-plugins --plugin sagemaker-aiDiagnoses per-node issues on AWS HyperPod clusters (EKS or Slurm): unhealthy, unresponsive, stuck nodes. Covers EFA, GPU hardware (XID, ECC, NVLink, DCGM), Slurm node state, disk/memory pressure, lifecycle scripts, SSM agent, container runtime, kernel panics, pod networking. Read-only triage with suggested remediation commands.
Reviews NVIDIA GPU infrastructure deployments (DGX, HGX, MGX) against reference architectures, checking BMC segmentation, firmware, driver versions, ECC, persistence mode, and MIG configuration.
Controls remote GPU clusters via `rca` CLI — run commands, transfer files, inspect GPUs/nodes, sync with mutagen. Handles install, SSH config, daemon lifecycle, execution, and node status.