From skymcp
Use when reducing cloud GPU costs, choosing spot vs on-demand, estimating training expenses, managing budgets, configuring auto-stop, or optimizing spend across clouds - covers SkyPilot optimizer, spot failover, pricing tiers, and budget management
How this skill is triggered — by the user, by Claude, or both
Slash command
/skymcp:cost-optimizationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Minimize cloud GPU spend without sacrificing reliability. SkyPilot's optimizer automatically finds the cheapest available instance across all configured clouds, and its spot/failover system provides 3-6x savings with automatic recovery.
Minimize cloud GPU spend without sacrificing reliability. SkyPilot's optimizer automatically finds the cheapest available instance across all configured clouds, and its spot/failover system provides 3-6x savings with automatic recovery.
Core principle: Never pay on-demand prices for interruptible workloads. Use spot instances with checkpointing and failover, and auto-stop everything.
Do not use for:
SkyPilot automatically ranks all (cloud, region, instance) triples by hourly price and selects the cheapest available option.
# Let SkyPilot pick the cheapest cloud/region
resources:
accelerators: A100:1
# No cloud specified = search all configured clouds
# See what SkyPilot would pick and the price
sky launch --dryrun job.yaml
# List pricing for a specific GPU across all clouds
sky gpus --all A100:8
# Check actual spend
sky cost-report
Spot instances are preemptible but dramatically cheaper. SkyPilot handles preemption recovery automatically.
resources:
accelerators: H100:8
use_spot: true
When to use spot:
When NOT to use spot:
The most cost-effective pattern: try spot across clouds, fall back to on-demand only if all spot is unavailable.
resources:
any_of:
# Try spot on cheapest clouds first
- accelerators: H100:8
use_spot: true
cloud: lambda
- accelerators: H100:8
use_spot: true
cloud: aws
- accelerators: H100:8
use_spot: true
cloud: gcp
# On-demand fallback (guaranteed availability)
- accelerators: H100:8
use_spot: false
cloud: lambda
Multi-GPU fallover (different GPU types):
resources:
any_of:
- accelerators: H100:8
use_spot: true
- accelerators: A100-80GB:8
use_spot: true
- accelerators: A100:8
use_spot: true
- accelerators: H100:8
use_spot: false # on-demand fallback
The single most impactful cost optimization. Idle clusters are pure waste.
# Stop cluster after 30 minutes idle
sky autostop mycluster -i 30
# Tear down cluster after 30 minutes idle (no storage cost)
sky autostop mycluster -i 30 --down
# Set auto-stop at launch time
sky launch job.yaml --idle-minutes-to-autostop 30
# Set auto-stop with teardown at launch time
sky launch job.yaml --idle-minutes-to-autostop 30 --down
In YAML:
resources:
accelerators: A100:1
# Auto-stop is not in YAML -- pass via CLI flags
Rules of thumb:
--idle-minutes-to-autostop 10 --down (tear down when done)--idle-minutes-to-autostop 30 (stop but keep disk)--idle-minutes-to-autostop 5 --down (tear down fast)Always estimate before launching long runs.
# Dry run shows instance selection and hourly price
sky launch --dryrun job.yaml
# List all GPU options with pricing
sky gpus --all
# Filter by GPU type
sky gpus --all H100
# Estimate total cost for a training run
# Formula: hourly_rate * estimated_hours * (1 + overhead_buffer)
# Example: $2.50/hr * 24hr * 1.1 = $66
Quick cost calculator:
Total cost = hourly_rate * hours * 1.1 (10% buffer for setup/teardown)
Spot savings = on_demand_rate * 0.25 to 0.35 (spot is 65-75% cheaper)
Multi-node cost = per_node_rate * num_nodes * hours * 1.1
For detailed per-cloud pricing, see references/pricing-guide.md.
| GPU | Spot ($/hr) | On-Demand ($/hr) | Best For |
|---|---|---|---|
| H100 80GB | $2.00-4.00 | $8.00-12.00 | Large model training, fastest throughput |
| A100 80GB | $1.00-2.00 | $4.00-6.00 | Standard training, good price/performance |
| A100 40GB | $0.80-1.50 | $3.00-5.00 | Models up to ~30B, inference |
| L40S 48GB | $0.80-1.50 | $2.50-4.00 | Training + inference hybrid |
| A10G 24GB | $0.40-0.80 | $1.50-2.50 | Fine-tuning, small models, eval |
| T4 16GB | $0.15-0.35 | $0.50-1.00 | Inference, small fine-tuning |
Cloud-specific cost advantages:
More frequent checkpoints = less wasted work on preemption, but more I/O overhead.
Optimal checkpoint interval = sqrt(2 * mean_time_between_preemptions * checkpoint_cost)
Rules of thumb:
Daily cost cap pattern:
# Check spend before launching
sky cost-report
# Launch with awareness of cumulative cost
# No built-in budget cap in SkyPilot -- implement externally:
DAILY_BUDGET=100
CURRENT_SPEND=$(sky cost-report --json | python3 -c "import json,sys; print(sum(j['cost'] for j in json.load(sys.stdin)['clusters']))")
if (( $(echo "$CURRENT_SPEND > $DAILY_BUDGET" | bc -l) )); then
echo "BUDGET EXCEEDED: $CURRENT_SPEND > $DAILY_BUDGET"
exit 1
fi
sky launch job.yaml
Cost-aware experiment design:
--limit 100 for eval debugging, full eval only for final checkpointsFor advanced cost patterns (preemption-aware scheduling, multi-job orchestration, reserved instance strategy), see references/cost-patterns.md.
| Mistake | Fix |
|---|---|
| Forgetting auto-stop | Always pass --idle-minutes-to-autostop. Set a shell alias. |
| On-demand for training | Use spot. Checkpointing makes preemption free. |
Not using --dryrun | Always check price before launching. |
| Single-cloud lock-in | Configure multiple clouds. SkyPilot finds cheapest. |
| No checkpointing with spot | Preemption = lost work. Always checkpoint. |
| Over-provisioning GPU | Use smallest GPU that fits your model. A10G for 7B fine-tune, not H100. |
Forgetting --down | Stopped clusters still cost for disk. Use --down for batch jobs. |
# See cheapest option for your job
sky launch --dryrun job.yaml
# Launch on cheapest spot
sky launch job.yaml --use-spot --idle-minutes-to-autostop 10 --down
# List running clusters and their costs
sky status
# Cost report
sky cost-report
# Tear down all clusters
sky down --all
# GPU pricing
sky gpus --all H100
npx claudepluginhub slapglif/skymcp --plugin skymcpCreates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.