From tao-skill-bank
Runs TAO SDK jobs as Docker containers on a local or remote Docker daemon with NVIDIA GPU runtime. Use for development, debugging, or submitting jobs to a remote GPU box via DOCKER_HOST.
How this skill is triggered — by the user, by Claude, or both
Slash command
/tao-skill-bank:tao-run-on-local-dockerThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Single-node execution platform that runs TAO jobs as named Docker containers on
Single-node execution platform that runs TAO jobs as named Docker containers on
a Docker daemon. The daemon can be local to the agent host or remote through
DOCKER_HOST=ssh://user@host / a Docker context. It is useful for development,
debugging, small runs, and workflows where a local coding agent submits jobs to
a remote GPU box.
Use local Docker when the data is local to the Docker host or accessible through mounted volumes/cloud credentials. Do not use it for remote cluster scheduling, multi-node training, or jobs that need SLURM queueing.
Use remote Docker when the agent is running on a workstation or laptop but the Docker daemon and GPUs are on another single GPU server. In remote Docker mode, all local filesystem paths in specs are interpreted on the remote Docker host, not on the agent machine.
The workflow must verify the host GPU runtime before starting Docker jobs. If the check fails, prompt the user to approve the install, run the printed install command, and rerun the preflight.
# Host GPU runtime: NVIDIA driver 580, CUDA 13.0, NVIDIA Container Toolkit 1.19.0.
TAO_SKILL_BANK_ROOT="${TAO_SKILL_BANK_ROOT:-$PWD}"
SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT}/skills/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"
bash "$SETUP_SCRIPT" --backend docker --check-only || {
echo "MISSING: TAO GPU host runtime is not ready."
echo "After user approval, run:"
echo " bash \"$SETUP_SCRIPT\" --backend docker --install --yes"
exit 1
}
# Mode 1 — direct docker (no Python). All you need is docker + the GPU runtime.
docker info >/dev/null 2>&1 || { echo "MISSING: docker daemon not reachable. Start Docker."; exit 1; }
docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi >/dev/null 2>&1 || {
echo "MISSING: NVIDIA Container Toolkit not installed/configured. See:"
echo " bash \"$SETUP_SCRIPT\" --backend docker --install --yes"
exit 1
}
# Mode 2 — TAO SDK wrapper. Adds Job handles, S3 I/O wrapping, ActionWorkflow.
# Skip this block if Mode 1 is sufficient for the user's request.
# When Mode 2 is in scope, read `tao-skill-bank:tao-run-platform` for the DockerSDK
# kwarg contract, build_entrypoint, and monitoring patterns.
# nvidia-tao-sdk is on public PyPI; pin lives in versions.yaml (wheels.tao_sdk_docker).
PIN=$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_docker)
python -c "import tao_sdk" 2>/dev/null || python -m pip install "$PIN"
python -c "import docker" 2>/dev/null || python -m pip install "$PIN"
python -c "import tao_sdk, docker"
# DockerSDK attaches every job container to ${DOCKER_NETWORK:-tao_default}.
# Create the network if it is missing; the operation is local and idempotent.
DOCKER_NETWORK_NAME="${DOCKER_NETWORK:-tao_default}"
docker network inspect "$DOCKER_NETWORK_NAME" >/dev/null 2>&1 || \
docker network create "$DOCKER_NETWORK_NAME" >/dev/null
If a check fails, the agent prompts the user to authorize the install/fix via Bash before proceeding. Pip-installable Python requirements and Docker network creation above are exceptions: install/create them automatically, then rerun preflight.
There are no platform credentials required beyond access to the Docker daemon.
Optional environment:
remote-docker platform option.tao_default.$oauthtoken for NGC.nvcr.io.Before generating scripts or starting containers:
docker run ... nvidia-smi against the remote daemon; do not use local
nvidia-smi from the agent machine.s3:// datasets/results, verify ACCESS_KEY and SECRET_KEY are set
and the exact paths are readable with aws s3 ls. If aws is missing,
report the missing dependency and ask before installing it; rerun preflight
after installation.HF_TOKEN before launch.nvidia-smi and avoid GPUs already used by
other running jobs when the user requested that constraint. Show the selected
GPU ids in the launch review.Use the packaged helper for these checks when possible:
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/check_tao_launch_preflight.py \
--platform local-docker \
--container-image "<selected-image>" \
--path train_annotation=/abs/path/to/annotations.json \
--path train_media=/abs/path/to/media
For a remote Docker daemon, use the remote-docker platform and pass or export
DOCKER_HOST. The helper verifies remote GPU/runtime readiness and checks
remote-host dataset paths through read-only bind mounts:
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/check_tao_launch_preflight.py \
--platform remote-docker \
--docker-host ssh://user@gpu-host \
--container-image "<selected-image>" \
--gpu-smoke-image ubuntu:22.04 \
--path train_annotation=/remote/data/train/annotations.json \
--path train_media=/remote/data/train
The --path values above must exist on the remote Docker host. Do not pass
paths that exist only on the local laptop or Codex host.
Multi-node is not supported on local Docker. One job runs on the local Docker daemon's host with no cross-host coordination.
Multi-GPU on the local host is supported via the NVIDIA Container Toolkit's --gpus flag (--gpus all or --gpus '"device=0,1,2,3"'). DockerSDK.create_job(gpu_count=N) plumbs through to --gpus. Single-host distributed init uses localhost; torchrun --nproc-per-node=N or PyTorch DDP work as usual.
Use the SDK backend value local-docker. The local backend schema has no extra
backend details, so most routing is controlled by environment and job
parameters:
{
"backend_type": "local-docker",
"num_gpu": 1
}
Following the Brev SDK design, platform/control-plane values stay in SDK
state and Docker labels. The SDK does not inject BACKEND, HOST_PLATFORM,
MONGOSECRET, DOCKER_HOST, or DOCKER_NETWORK into the training container.
The TAO SDK local Docker handler starts containers through the Docker Python client:
tao-job-<job_id> form used by SDK handlers.["/bin/bash", "-c", "<job command>"].DOCKER_AUTO_REMOVE=true./dev/shm is mounted as tmpfs.For GPU access, the handler auto-detects the host type:
runtime="nvidia" plus
NVIDIA_VISIBLE_DEVICES and NVIDIA_DRIVER_CAPABILITIES=all.device_requests with GPU capabilities.If num_gpus is 0, no GPUs are assigned. If num_gpus is -1, all visible
GPUs are requested. Prefer explicit GPU counts for shared development machines.
When explicit device ids are available, prefer them over count-only selection
on shared machines so the launch does not steal GPUs occupied by other tasks.
Local Docker accepts local and file:// paths because the container runs on the
same Docker host. Make sure every path in the spec is either:
For remote/shared filesystems, prefer the platform that owns that filesystem.
For example, use SLURM plus lustre:///... for Lustre paths on a cluster.
docker logs tao-job-<job_id>).If the container has exited, died, is being removed, or cannot be found, status reconciliation treats the backend process as terminated.
Cancellation stops the named container. GPU ownership is managed by Docker / the NVIDIA runtime, not by TAO Core's local GPU manager.
If you want Job handles, S3 I/O wrapping via the SDK's script_runner, or
durability across sessions:
from tao_sdk.platforms.docker import DockerSDK
sdk = DockerSDK() # reads DOCKER_HOST, NGC_KEY, S3 creds from env
job = sdk.create_job(
image='nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt',
command='dino train -e /tmp/spec.yaml',
gpu_count=1,
inputs={'/data/train.json': 's3://bucket/coco/train.json'},
outputs=['/results/'],
)
status = sdk.get_job_status(job.id)
logs = sdk.get_job_logs(job.id, tail=200)
This wraps the same docker run invocation under a Job handle and routes
the entrypoint through script_runner so inputs/outputs get downloaded
from / uploaded to S3 automatically. If you don't need those, just use
docker run directly — no SDK install required.
Docker client not initialized: Verify the Docker Python package is installed,
set DOCKER_HOST if you are not using the default local socket, and confirm the
process can talk to the daemon.
GPU assignment failed: Requested GPUs are unavailable, the NVIDIA Container
Toolkit is not configured, or the Docker daemon cannot create GPU device
requests. Use fewer GPUs, wait for another job to finish, or verify
docker run --gpus ... works on the host.
Image pull auth failed: Set a valid NGC_KEY for private nvcr.io images
or run docker login nvcr.io -u '$oauthtoken' on the Docker host.
Container exited unexpectedly: Check docker logs tao-job-<job_id>, the
configured DOCKER_NETWORK, and the command produced by the SDK action runner.
Path missing inside container: A local path on the host is not necessarily mounted into the job container. Use a path convention supported by the action runner or configure an explicit volume through the surrounding service.
npx claudepluginhub nvidia-tao/tao-skills-bank --plugin tao-skillsSubmits and monitors GPU training jobs via the TAO SDK on Brev, SLURM, Docker, and Kubernetes. Use when you need job handles, S3 I/O wrapping, or multi-node distributed training.
Launches Docker containers on Ascend NPU servers with proper device mounting, networking, and volume mounts. Performs pre-launch checks and gathers user requirements.
Automatically detects GPU vendor (NVIDIA, AMD, Ascend, Metax, Iluvatar), finds appropriate PyTorch container image, launches with correct mounts, and validates GPU functionality.