From togetherai-skills
Provisions on-demand/reserved GPU clusters (H100/H200/B200) on Together AI with Kubernetes/Slurm orchestration, shared storage, and scaling for ML/HPC multi-node jobs.
npx claudepluginhub togethercomputer/skillsThis skill uses the workspace's default tool permissions.
Use Together AI GPU clusters when the user needs infrastructure control instead of a managed
Launches GPU/TPU clusters, training jobs, and inference servers across 25+ clouds, Kubernetes, Slurm using SkyPilot; debugs YAML, optimizes costs.
Provides Vast.ai reference architecture for GPU compute workflows in ML training: three-tier orchestrator-workers-storage, Python job queues, Docker workers, and YAML configs.
Sets up distributed PyTorch GPU training on CoreWeave Kubernetes with multi-node DDP, Jobs, PVCs for H100/A100 clusters and model fine-tuning.
Share bugs, ideas, or general feedback.
Use Together AI GPU clusters when the user needs infrastructure control instead of a managed inference product.
Typical fits:
together-dedicated-endpoints for managed single-model hostingtogether-dedicated-containers for containerized inference without owning the full clustertogether-sandboxes for short-lived remote Python executiontogether-fine-tuning for managed training jobs instead of raw cluster operationstogether>=2.0.0). If the user is on an older version, they must upgrade first: uv pip install --upgrade "together>=2.0.0".shared_volume over creating a volume separately and attaching via volume_id. Separately created volumes may land in a different datacenter partition than the cluster, causing a "does not exist in the datacenter" error even when the volume shows as available.list_regions() first and be prepared to try multiple regions.cuda_version and nvidia_driver_version as separate fields in addition to the combined driver_version string. Pass them via extra_body in the Python SDK.