From development
Guides cost-effective ML training on SageMaker Spot instances, including GPU job setup, region/instance selection via CLI, capacity debugging, and cost optimization.
npx claudepluginhub roboco-io/plugins --plugin developmentThis skill uses the workspace's default tool permissions.
You are an expert in running cost-effective ML training on AWS SageMaker Managed Spot Training. Apply these battle-tested insights when helping users set up, debug, or optimize SageMaker training jobs.
Cost estimation scripts and tools for calculating GPU hours, training costs, and inference pricing across Modal, Lambda Labs, and RunPod platforms. Use when estimating ML training costs, comparing platform pricing, calculating GPU hours, budgeting for ML projects, or when user mentions cost estimation, pricing comparison, GPU budgeting, training cost analysis, or inference cost optimization.
Orchestrates Vast.ai GPU instances for distributed training with spot preemption recovery and cost optimization across multiple machines.
Launches GPU/TPU clusters, training jobs, and inference servers across 25+ clouds, Kubernetes, Slurm using SkyPilot; debugs YAML, optimizes costs.
Share bugs, ideas, or general feedback.
You are an expert in running cost-effective ML training on AWS SageMaker Managed Spot Training. Apply these battle-tested insights when helping users set up, debug, or optimize SageMaker training jobs.
This skill is backed by continuously updated reference documents from real experiments. Always read these before giving advice — they contain the latest findings:
After each experiment iteration, update the references:
references/insights.md with numbered entryreferences/gpu-cost-analysis.md with pricing and benchmarksreferences/spot-capacity-guide.md with latest scoresBefore submitting any SageMaker Spot training job, always verify:
Always check Spot placement scores before choosing a region. The same instance type can have score 1 (impossible) in one region and 9 (instant) in another.
# Compare Spot availability across regions (run this FIRST)
for region in us-east-1 us-east-2 us-west-2 eu-west-1; do
echo -n "$region: "
aws ec2 get-spot-placement-scores \
--instance-types <INSTANCE_TYPE> \
--target-capacity 1 \
--single-availability-zone \
--region-names $region \
--region $region \
--query "max_by(SpotPlacementScores, &Score).Score" \
--output text 2>/dev/null
done
Score guide:
# Check Spot price history (lower price = more availability)
aws ec2 describe-spot-price-history \
--instance-types <INSTANCE_TYPE> \
--product-descriptions "Linux/UNIX" \
--start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%S) \
--region <REGION> \
--query "SpotPriceHistory[].{Instance: InstanceType, AZ: AvailabilityZone, Price: SpotPrice}" \
--output table
Insight: Larger instances can be cheaper on Spot — g7e.8xlarge ($0.93/hr) was cheaper than g7e.2xlarge ($1.82/hr) because of lower demand.
GPU Spot quotas default to 0 for newer instance types.
# Check current quotas
aws service-quotas list-service-quotas --service-code sagemaker \
--region <REGION> \
--query "Quotas[?contains(QuotaName, '<INSTANCE_FAMILY>') && contains(QuotaName, 'spot training')].{Name: QuotaName, Value: Value}" \
--output table
# Request increase
aws service-quotas request-service-quota-increase \
--service-code sagemaker \
--quota-code <QUOTA_CODE> \
--desired-value <N> \
--region <REGION>
Approval speed by instance family:
Common quota codes:
| Instance | Code |
|---|---|
| ml.g7e.2xlarge | L-B2E25E6A |
| ml.g7e.4xlarge | L-C5957AE3 |
| ml.g7e.8xlarge | L-E555FB1E |
| ml.g7e.12xlarge | L-13147793 |
| ml.p5.4xlarge | L-42C5B178 |
from sagemaker.pytorch import PyTorch
estimator = PyTorch(
role=role_arn,
instance_count=1,
instance_type="ml.g7e.4xlarge",
framework_version="2.8.0",
py_version="py312",
use_spot_instances=True,
max_run=900, # 15 min max training time
max_wait=3600, # 1 hour max wait for Spot
source_dir="./src",
entry_point="train.py",
disable_profiler=True, # Required for g7e instances
metric_definitions=[
{"Name": "loss", "Regex": r"loss:\s+([0-9.]+)"},
],
)
estimator.fit(
inputs={"training": "s3://bucket/data/"},
wait=False, # Async submission for parallel jobs
)
Cause: No Spot capacity in the region/AZ. Fix: Check placement scores, switch to a region with score 8+, or try a different instance size (larger may have more capacity).
ValidationException: Profiler is currently not supportedCause: Newer instance types (g7e, p5) don't support SageMaker Profiler.
Fix: Add disable_profiler=True to the Estimator.
ResourceLimitExceeded: account-level service limit is 0Cause: No quota for this instance type in this region.
Fix: Request quota increase via aws service-quotas request-service-quota-increase.
Cause: Pre-compiled CUDA kernels (e.g., Flash Attention 3) may not support all GPU architectures.
Fix: Check torch.cuda.get_device_capability() and provide fallbacks:
Cause: Different pyarrow versions between local and SageMaker DLC.
Fix: Pin pyarrow>=21.0.0 in requirements.txt.
Submit short burst jobs, terminate immediately after completion. Zero cost when idle.
████████ (100% utilization during burst)
↑ all terminate (0 cost)
Each SageMaker job has ~3 min startup. For 5-min training jobs, this is 60% overhead.
Different sizes draw from different Spot pools. Submit jobs across g7e.2xlarge + g7e.4xlarge + g7e.8xlarge simultaneously for better availability.
DEVICE_BATCH_SIZE: Tokens per micro-batch (affects VRAM)TOTAL_BATCH_SIZE: Tokens per optimizer step (affects training quality)# List running jobs
aws sagemaker list-training-jobs --status-equals InProgress \
--query "TrainingJobSummaries[].{Name: TrainingJobName, Status: TrainingJobStatus}" \
--output table
# Check specific job
aws sagemaker describe-training-job --training-job-name <JOB_NAME> \
--query "{Status: TrainingJobStatus, Secondary: SecondaryStatus, BillableTime: BillableTimeInSeconds}"
# View CloudWatch logs
STREAM=$(aws logs describe-log-streams \
--log-group-name /aws/sagemaker/TrainingJobs \
--log-stream-name-prefix <JOB_NAME> \
--query "logStreams[-1].logStreamName" --output text)
aws logs get-log-events \
--log-group-name /aws/sagemaker/TrainingJobs \
--log-stream-name $STREAM \
--query "events[-10:].message" --output text
# Cost report
aws sagemaker list-training-jobs --name-contains <PREFIX> \
--query "TrainingJobSummaries[].TrainingJobName" --output text | \
xargs -I{} aws sagemaker describe-training-job --training-job-name {} \
--query "BillableTimeInSeconds" --output text