From vastai-pack
Executes Vast.ai GPU incident response: triages failures/outages, handles spot preemptions/training crashes, provisions replacements using CLI bash scripts and SSH.
npx claudepluginhub jeremylongshore/claude-code-plugins-plus-skills --plugin vastai-packThis skill is limited to using the following tools:
Rapid incident response procedures for Vast.ai GPU instance failures. Covers triage, mitigation, recovery, and postmortem for common incident types: spot preemption, instance crashes, GPU failures, and billing issues.
Diagnoses Vast.ai errors: API codes, instance statuses, CLI failures, Docker issues. Fixes GPU rental problems with bash commands.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).
Share bugs, ideas, or general feedback.
Rapid incident response procedures for Vast.ai GPU instance failures. Covers triage, mitigation, recovery, and postmortem for common incident types: spot preemption, instance crashes, GPU failures, and billing issues.
#!/bin/bash
set -euo pipefail
echo "=== INCIDENT TRIAGE ==="
echo "Time: $(date -u)"
# 1. Check all instances
echo -e "\n--- Instance Status ---"
vastai show instances --raw | python3 -c "
import sys, json
for inst in json.load(sys.stdin):
status = inst.get('actual_status', '?')
flag = 'ALERT' if status in ('error', 'exited', 'offline') else 'OK'
print(f' [{flag}] ID:{inst[\"id\"]} Status:{status} '
f'GPU:{inst.get(\"gpu_name\",\"?\")} \${inst.get(\"dph_total\",0):.3f}/hr')
"
# 2. Check if affected instance has recent logs
echo -e "\n--- Recent Logs (last 20 lines) ---"
vastai logs ${INSTANCE_ID:-0} --tail 20 2>/dev/null || echo "No logs available"
# 3. Check account balance
echo -e "\n--- Account ---"
vastai show user --raw | python3 -c "import sys,json; u=json.load(sys.stdin); print(f'Balance: \${u.get(\"balance\",0):.2f}')"
Symptoms: Instance status changes from running to exited or offline without user action.
# 1. Verify preemption (not user error)
vastai show instance $ID --raw | python3 -c "
import sys, json; i=json.load(sys.stdin)
print(f'Status: {i.get(\"actual_status\")}')
print(f'Status msg: {i.get(\"status_msg\", \"none\")}')
"
# 2. Check if checkpoint was saved
# (depends on your checkpoint storage — S3, GCS, etc.)
aws s3 ls s3://bucket/checkpoints/ --recursive | tail -5
# 3. Provision replacement instance
vastai search offers "gpu_name=${GPU_NAME} reliability>0.98 rentable=true" \
--order dph_total --limit 3
# 4. Create replacement and resume from checkpoint
vastai create instance $NEW_OFFER_ID --image $IMAGE --disk 50
Symptoms: Instance running but training process exited with error.
# 1. SSH in and check logs
ssh -p $PORT root@$HOST "tail -100 /workspace/train.log 2>/dev/null || echo 'No log file'"
# 2. Common causes
ssh -p $PORT root@$HOST << 'CHECK'
# GPU memory issue?
nvidia-smi | grep -i "out of memory" && echo "OOM detected"
# Disk full?
df -h /workspace | tail -1
# Process still running?
ps aux | grep python | grep -v grep
CHECK
# 3. Restart training from checkpoint
ssh -p $PORT root@$HOST "cd /workspace && python train.py --resume-from latest"
Symptoms: nvidia-smi fails, CUDA errors, or ECC memory errors.
# 1. Check GPU health
ssh -p $PORT root@$HOST "nvidia-smi" || echo "GPU not responding"
# 2. This is a host-level failure — you cannot fix it
# Destroy the instance and provision on a different host
vastai destroy instance $ID
# 3. Report the host to Vast.ai support
echo "Report host ID to Vast.ai support for investigation"
# Stop all billing immediately
echo "EMERGENCY: Destroying all instances"
vastai show instances --raw | python3 -c "
import sys, json, subprocess
for inst in json.load(sys.stdin):
if inst.get('actual_status') in ('running', 'loading'):
subprocess.run(['vastai', 'destroy', 'instance', str(inst['id'])])
print(f'Destroyed instance {inst[\"id\"]}')
"
## Incident Report
- **Date**: YYYY-MM-DD
- **Duration**: X hours
- **Impact**: N instances affected, $X cost
- **Root cause**: [spot preemption / OOM / disk full / GPU failure]
- **Resolution**: [replaced instance / increased VRAM / expanded disk]
- **Prevention**: [higher reliability filter / checkpoints / auto-recovery]
| Incident | MTTR Target | Recovery |
|---|---|---|
| Spot preemption | < 10 min | Auto-provision replacement, resume from checkpoint |
| Training crash | < 5 min | SSH in, diagnose, restart from checkpoint |
| GPU failure | < 15 min | Destroy instance, provision on different host |
| Billing emergency | < 1 min | Destroy all instances immediately |
For data handling and security, see vastai-data-handling.
Auto-recovery script: Run the event poller from vastai-webhooks-events with an auto-recovery handler that provisions a replacement within 5 minutes of preemption.
Kill switch: Keep vastai show instances && vastai destroy instance ALL aliased for emergency billing stops.