GPU Cost Optimizer

Spend less, compute more.

This skill analyzes workloads and recommends the most cost-effective GPU configuration. Stop guessing and overpaying.

When to Use This Skill

Request Pattern	This Skill Handles
"Which GPU should I use?"	GPU recommendation
"How much will this cost?"	Cost estimation
"Optimize my gpu.jsonc"	Configuration review
"Compare GPUs for my task"	GPU comparison
"I'm getting OOM errors"	Right-sizing help
"Reduce my GPU costs"	Cost optimization

GPU Database (January 2025 Prices)

Consumer/Prosumer GPUs

GPU	VRAM	Price/hr	Best For
RTX 3090	24GB	$0.22	Budget inference, small training
RTX 4090	24GB	$0.44	Best value for 24GB tasks
RTX 4080	16GB	$0.35	Light inference
RTX 4070 Ti	12GB	$0.25	Small models

Professional GPUs

GPU	VRAM	Price/hr	Best For
RTX A4000	16GB	$0.30	Development
RTX A5000	24GB	$0.40	Training
RTX A6000	48GB	$0.80	Large models
L4	24GB	$0.35	Inference
L40	48GB	$0.90	Training/inference
L40S	48GB	$0.95	Fast training

Datacenter GPUs

GPU	VRAM	Price/hr	Best For
A100 40GB	40GB	$1.29	Training
A100 80GB PCIe	80GB	$1.79	Large models
A100 80GB SXM	80GB	$2.09	Multi-GPU training
H100 PCIe	80GB	$2.49	Fast training
H100 SXM	80GB	$3.19	Fastest training
H100 NVL	94GB	$3.99	Very large models
H200	141GB	$4.99	Largest models

VRAM Requirements Guide

LLM Inference (Chat/Completion)

Model Size	FP16 VRAM	INT8 VRAM	INT4 VRAM
7B	14GB	8GB	5GB
13B	26GB	14GB	8GB
34B	68GB	36GB	20GB
70B	140GB	75GB	40GB
405B	810GB	420GB	220GB

Formula: VRAM (GB) = Parameters (B) × Precision (bytes) × 1.2 (overhead)

FP16: 2 bytes/param
INT8: 1 byte/param
INT4: 0.5 bytes/param

LLM Training

Method	7B VRAM	13B VRAM	70B VRAM
Full Fine-tune	56GB	104GB	560GB
LoRA	24GB	40GB	160GB
QLoRA	12GB	20GB	48GB

Formula: Full FT = Model × 8 (weights + gradients + optimizer states)

Image Models

Model	Inference	Training
SD 1.5	6GB	12GB
SDXL	10GB	24GB
FLUX.1-schnell	20GB	40GB
FLUX.1-dev	24GB	48GB

Video Generation

Model	VRAM Required
CogVideoX	40GB
Mochi-1	40GB
Hunyuan Video	80GB

Decision Trees

LLM Inference Selection

What model size?
├── 7-8B (Llama 3.1 8B, Mistral 7B)
│   └── INT4 quantization?
│       ├── Yes → RTX 4070 Ti (12GB) @ $0.25/hr
│       └── No → RTX 4090 (24GB) @ $0.44/hr ✓ Best value
│
├── 13B
│   └── INT4 quantization?
│       ├── Yes → RTX 4090 (24GB) @ $0.44/hr
│       └── No → RTX A6000 (48GB) @ $0.80/hr
│
├── 34B (CodeLlama 34B, etc.)
│   └── RTX A6000 or L40 (48GB) @ $0.80-0.90/hr
│
├── 70B (Llama 3.1 70B)
│   └── Quantization?
│       ├── INT4 → A100 40GB @ $1.29/hr
│       └── FP16 → 2× A100 80GB @ $3.58/hr
│
└── 405B (Llama 3.1 405B)
    └── INT4 → 4× A100 80GB or 2× H200 @ $7-10/hr

Training Selection

What are you training?
├── Image LoRA (SDXL)
│   └── RTX 4090 (24GB) @ $0.44/hr ✓
│
├── Image LoRA (FLUX)
│   └── A100 80GB @ $1.79/hr
│
├── LLM Fine-tune (7B)
│   └── Method?
│       ├── QLoRA → RTX 4090 @ $0.44/hr ✓
│       ├── LoRA → A100 40GB @ $1.29/hr
│       └── Full FT → A100 80GB @ $1.79/hr
│
├── LLM Fine-tune (70B)
│   └── Method?
│       ├── QLoRA → A100 80GB @ $1.79/hr
│       ├── LoRA → 2× A100 80GB @ $3.58/hr
│       └── Full FT → 8× A100 80GB @ $14.32/hr
│
└── Custom PyTorch
    └── Profile first, then select

Cost Optimization Strategies

1. Use the Smallest GPU That Works

Bad: Requesting A100 80GB for a task that fits on RTX 4090

// Don't do this
{"gpu_type": "A100 PCIe 80GB"}  // $1.79/hr for SD 1.5? Wasteful!

Good: Match GPU to workload

// Do this
{"gpu_type": "RTX 4090", "min_vram": 12}  // $0.44/hr, plenty for SD 1.5

2. Use Quantization

INT4 quantization reduces VRAM by 4× with minimal quality loss for inference.

# Instead of loading full precision
model = AutoModelForCausalLM.from_pretrained("model")

# Use 4-bit quantization
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
    "model",
    quantization_config=quantization_config
)

Savings: Llama 70B on 1× A100 ($1.79/hr) instead of 2× A100 ($3.58/hr)

3. Enable Memory Optimizations

# Gradient checkpointing (training)
model.gradient_checkpointing_enable()

# CPU offloading (inference)
pipe.enable_model_cpu_offload()

# VAE tiling (image generation)
pipe.vae.enable_tiling()

# Attention slicing
pipe.enable_attention_slicing()

4. Batch Processing

Process multiple items per GPU session instead of one-at-a-time:

# Bad: Start pod for each image
for image in images:
    gpu_run(f"process {image}")  # Pod startup overhead each time

# Good: Batch process
gpu_run("process_batch images/")  # One pod startup, many images

5. Use Cooldown Wisely

// For interactive work (many short commands)
{"cooldown_minutes": 10}  // Pod stays warm

// For batch jobs
{"cooldown_minutes": 5}   // Stop quickly when done

// For API servers
{"cooldown_minutes": 30}  // Keep serving longer

6. Profile Before Committing

Test with small batch first:

# Test with subset
gpu run python train.py --max-steps 100

# Check VRAM usage
nvidia-smi  # In the output

# Then decide GPU size

Configuration Review Checklist

When reviewing a gpu.jsonc, check:

Check	Question	Optimization
GPU Size	Is VRAM matched to model?	Downsize if <70% utilization
GPU Count	Do you need multi-GPU?	Single GPU if model fits
Cooldown	Matches usage pattern?	Reduce for batch, increase for interactive
Downloads	Are models pre-cached?	Use download spec to avoid re-download
Precision	Can you quantize?	INT8/INT4 for inference

Cost Estimation Formulas

Total Session Cost

Cost = (GPU $/hr) × (Hours Active) + (Download Time × GPU $/hr)

Examples:
- RTX 4090, 2 hours: $0.44 × 2 = $0.88
- A100 80GB, 4 hours: $1.79 × 4 = $7.16
- 2× H100, 8 hours: $6.38 × 8 = $51.04

Training Cost Estimate

Hours = (Dataset Size × Epochs × Time per Sample) / 3600

Examples:
- QLoRA 7B, 10K samples, 3 epochs: ~3 hours on RTX 4090 = $1.32
- Full FT 7B, 10K samples, 3 epochs: ~24 hours on A100 = $42.96

Inference Cost per 1K Tokens

Cost per 1K tokens = (GPU $/hr) / (Tokens/hr / 1000)

Examples (vLLM):
- Llama 8B on RTX 4090: $0.44/hr ÷ 360K/hr = $0.0012/1K tokens
- Llama 70B on 2× A100: $3.58/hr ÷ 108K/hr = $0.033/1K tokens

Common Mistakes

Over-Provisioning

Mistake	Cost	Fix	Savings
A100 for SD 1.5	$1.79/hr	RTX 4090	75%
H100 for 7B inference	$2.49/hr	RTX 4090	82%
2× GPU for model that fits on 1	2× cost	Single GPU	50%

Under-Provisioning

Mistake	Symptom	Fix
RTX 4090 for 70B	OOM error	Use A100 or quantize
24GB for FLUX training	OOM during backprop	Use A100 80GB
Single GPU for 405B	Won't load	Use 4× A100 or 2× H200

GPU Comparison Tool

Compare Options for Your Task

Task: Run Llama 3.1 70B for inference

Option	GPU Setup	$/hr	Tokens/sec	$/1M tokens
A	2× A100 80GB (FP16)	$3.58	30	$33.15
B	1× A100 80GB (INT4)	$1.79	25	$19.89
C	1× H100 80GB (INT4)	$2.49	40	$17.29

Recommendation: Option B for cost, Option C for speed

Output Format

When analyzing GPU requirements:

## GPU Analysis for [Task]

### Requirements

- **Model size**: [X GB]
- **VRAM needed**: [X GB] ([precision])
- **Task type**: [inference/training]

### Recommended Configuration

```jsonc
{
  "gpu_type": "[GPU]",
  "min_vram": [X],
  // ... rest of config
}

Cost Estimate

Scenario	Duration	Cost
[Scenario 1]	[X hours]	$[X]
[Scenario 2]	[X hours]	$[X]

Alternatives

GPU	$/hr	Pro	Con
[GPU 1]	$X	[pro]	[con]
[GPU 2]	$X	[pro]	[con]

Optimization Tips

[Tip 1]
[Tip 2]
[Tip 3]

gpu-cost-optimizer