From vastai-pack
Execute Vast.ai production deployment checklist for GPU workloads. Use when deploying training pipelines to production, preparing for large-scale GPU jobs, or auditing production readiness. Trigger with phrases like "vastai production", "deploy vastai", "vastai go-live", "vastai launch checklist".
npx claudepluginhub flight505/skill-forge --plugin vastai-packThis skill is limited to using the following tools:
Complete checklist for running production GPU workloads on Vast.ai, covering account setup, instance selection, data safety, monitoring, and cost controls.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).
Share bugs, ideas, or general feedback.
Complete checklist for running production GPU workloads on Vast.ai, covering account setup, instance selection, data safety, monitoring, and cost controls.
>= 0.98 for production jobsinet_down >= 200 for data transferdph_total set in search queries#!/bin/bash
set -euo pipefail
echo "Vast.ai Production Readiness Check"
# 1. Auth
vastai show user --raw | python3 -c "
import sys, json; u=json.load(sys.stdin)
balance = u.get('balance', 0)
print(f' Auth: OK | Balance: \${balance:.2f}')
assert balance >= 10, f'Balance too low: \${balance:.2f}'
" && echo " Balance: PASS" || echo " Balance: FAIL"
# 2. Offer availability
COUNT=$(vastai search offers 'reliability>0.98 num_gpus=1 rentable=true' --raw --limit 1 | python3 -c "import sys,json; print(len(json.load(sys.stdin)))")
echo " Offers available: $COUNT+ | PASS"
# 3. Docker image pullable
docker pull pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime > /dev/null 2>&1 && echo " Docker image: PASS" || echo " Docker image: FAIL"
echo "Pre-flight checks complete."
| Error | Cause | Solution |
|---|---|---|
| Insufficient balance | Credits depleted mid-job | Set up auto-top-up or balance alerts |
| Instance preempted during final epoch | Spot instance reclaimed | Use on-demand for final training stage |
| Checkpoint corrupted | Interrupted mid-save | Implement atomic checkpoint writes (save to temp, rename) |
| GPU utilization drops to 0% | Data pipeline bottleneck | Profile data loading; increase disk I/O |
For version upgrades, see vastai-upgrade-migration.
Pre-launch audit: Run the verification script, check all boxes, confirm Docker image pulls successfully, and verify at least 3 matching offers are available before starting a production training run.
Budget-safe launch: Set max_dph=2.00, auto-destroy timeout of 12 hours, and daily spend alert at $50 to prevent cost overruns.