Analyze GPU workloads and recommend optimal configurations. Get cost estimates, compare GPU options, avoid over-provisioning, and find the cheapest GPU that meets your requirements.
Analyzes GPU workloads to recommend the most cost-effective configuration. Use when users ask "which GPU should I use?", need cost estimates, or want to optimize their gpu.jsonc to avoid over-provisioning.
/plugin marketplace add gpu-cli/gpu/plugin install gpu-cli@gpu-cliThis skill inherits all available tools. When active, it can use any tool Claude has access to.
Spend less, compute more.
This skill analyzes workloads and recommends the most cost-effective GPU configuration. Stop guessing and overpaying.
| Request Pattern | This Skill Handles |
|---|---|
| "Which GPU should I use?" | GPU recommendation |
| "How much will this cost?" | Cost estimation |
| "Optimize my gpu.jsonc" | Configuration review |
| "Compare GPUs for my task" | GPU comparison |
| "I'm getting OOM errors" | Right-sizing help |
| "Reduce my GPU costs" | Cost optimization |
| GPU | VRAM | Price/hr | Best For |
|---|---|---|---|
| RTX 3090 | 24GB | $0.22 | Budget inference, small training |
| RTX 4090 | 24GB | $0.44 | Best value for 24GB tasks |
| RTX 4080 | 16GB | $0.35 | Light inference |
| RTX 4070 Ti | 12GB | $0.25 | Small models |
| GPU | VRAM | Price/hr | Best For |
|---|---|---|---|
| RTX A4000 | 16GB | $0.30 | Development |
| RTX A5000 | 24GB | $0.40 | Training |
| RTX A6000 | 48GB | $0.80 | Large models |
| L4 | 24GB | $0.35 | Inference |
| L40 | 48GB | $0.90 | Training/inference |
| L40S | 48GB | $0.95 | Fast training |
| GPU | VRAM | Price/hr | Best For |
|---|---|---|---|
| A100 40GB | 40GB | $1.29 | Training |
| A100 80GB PCIe | 80GB | $1.79 | Large models |
| A100 80GB SXM | 80GB | $2.09 | Multi-GPU training |
| H100 PCIe | 80GB | $2.49 | Fast training |
| H100 SXM | 80GB | $3.19 | Fastest training |
| H100 NVL | 94GB | $3.99 | Very large models |
| H200 | 141GB | $4.99 | Largest models |
| Model Size | FP16 VRAM | INT8 VRAM | INT4 VRAM |
|---|---|---|---|
| 7B | 14GB | 8GB | 5GB |
| 13B | 26GB | 14GB | 8GB |
| 34B | 68GB | 36GB | 20GB |
| 70B | 140GB | 75GB | 40GB |
| 405B | 810GB | 420GB | 220GB |
Formula: VRAM (GB) = Parameters (B) × Precision (bytes) × 1.2 (overhead)
| Method | 7B VRAM | 13B VRAM | 70B VRAM |
|---|---|---|---|
| Full Fine-tune | 56GB | 104GB | 560GB |
| LoRA | 24GB | 40GB | 160GB |
| QLoRA | 12GB | 20GB | 48GB |
Formula: Full FT = Model × 8 (weights + gradients + optimizer states)
| Model | Inference | Training |
|---|---|---|
| SD 1.5 | 6GB | 12GB |
| SDXL | 10GB | 24GB |
| FLUX.1-schnell | 20GB | 40GB |
| FLUX.1-dev | 24GB | 48GB |
| Model | VRAM Required |
|---|---|
| CogVideoX | 40GB |
| Mochi-1 | 40GB |
| Hunyuan Video | 80GB |
What model size?
├── 7-8B (Llama 3.1 8B, Mistral 7B)
│ └── INT4 quantization?
│ ├── Yes → RTX 4070 Ti (12GB) @ $0.25/hr
│ └── No → RTX 4090 (24GB) @ $0.44/hr ✓ Best value
│
├── 13B
│ └── INT4 quantization?
│ ├── Yes → RTX 4090 (24GB) @ $0.44/hr
│ └── No → RTX A6000 (48GB) @ $0.80/hr
│
├── 34B (CodeLlama 34B, etc.)
│ └── RTX A6000 or L40 (48GB) @ $0.80-0.90/hr
│
├── 70B (Llama 3.1 70B)
│ └── Quantization?
│ ├── INT4 → A100 40GB @ $1.29/hr
│ └── FP16 → 2× A100 80GB @ $3.58/hr
│
└── 405B (Llama 3.1 405B)
└── INT4 → 4× A100 80GB or 2× H200 @ $7-10/hr
What are you training?
├── Image LoRA (SDXL)
│ └── RTX 4090 (24GB) @ $0.44/hr ✓
│
├── Image LoRA (FLUX)
│ └── A100 80GB @ $1.79/hr
│
├── LLM Fine-tune (7B)
│ └── Method?
│ ├── QLoRA → RTX 4090 @ $0.44/hr ✓
│ ├── LoRA → A100 40GB @ $1.29/hr
│ └── Full FT → A100 80GB @ $1.79/hr
│
├── LLM Fine-tune (70B)
│ └── Method?
│ ├── QLoRA → A100 80GB @ $1.79/hr
│ ├── LoRA → 2× A100 80GB @ $3.58/hr
│ └── Full FT → 8× A100 80GB @ $14.32/hr
│
└── Custom PyTorch
└── Profile first, then select
Bad: Requesting A100 80GB for a task that fits on RTX 4090
// Don't do this
{"gpu_type": "A100 PCIe 80GB"} // $1.79/hr for SD 1.5? Wasteful!
Good: Match GPU to workload
// Do this
{"gpu_type": "RTX 4090", "min_vram": 12} // $0.44/hr, plenty for SD 1.5
INT4 quantization reduces VRAM by 4× with minimal quality loss for inference.
# Instead of loading full precision
model = AutoModelForCausalLM.from_pretrained("model")
# Use 4-bit quantization
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
"model",
quantization_config=quantization_config
)
Savings: Llama 70B on 1× A100 ($1.79/hr) instead of 2× A100 ($3.58/hr)
# Gradient checkpointing (training)
model.gradient_checkpointing_enable()
# CPU offloading (inference)
pipe.enable_model_cpu_offload()
# VAE tiling (image generation)
pipe.vae.enable_tiling()
# Attention slicing
pipe.enable_attention_slicing()
Process multiple items per GPU session instead of one-at-a-time:
# Bad: Start pod for each image
for image in images:
gpu_run(f"process {image}") # Pod startup overhead each time
# Good: Batch process
gpu_run("process_batch images/") # One pod startup, many images
// For interactive work (many short commands)
{"cooldown_minutes": 10} // Pod stays warm
// For batch jobs
{"cooldown_minutes": 5} // Stop quickly when done
// For API servers
{"cooldown_minutes": 30} // Keep serving longer
Test with small batch first:
# Test with subset
gpu run python train.py --max-steps 100
# Check VRAM usage
nvidia-smi # In the output
# Then decide GPU size
When reviewing a gpu.jsonc, check:
| Check | Question | Optimization |
|---|---|---|
| GPU Size | Is VRAM matched to model? | Downsize if <70% utilization |
| GPU Count | Do you need multi-GPU? | Single GPU if model fits |
| Cooldown | Matches usage pattern? | Reduce for batch, increase for interactive |
| Downloads | Are models pre-cached? | Use download spec to avoid re-download |
| Precision | Can you quantize? | INT8/INT4 for inference |
Cost = (GPU $/hr) × (Hours Active) + (Download Time × GPU $/hr)
Examples:
- RTX 4090, 2 hours: $0.44 × 2 = $0.88
- A100 80GB, 4 hours: $1.79 × 4 = $7.16
- 2× H100, 8 hours: $6.38 × 8 = $51.04
Hours = (Dataset Size × Epochs × Time per Sample) / 3600
Examples:
- QLoRA 7B, 10K samples, 3 epochs: ~3 hours on RTX 4090 = $1.32
- Full FT 7B, 10K samples, 3 epochs: ~24 hours on A100 = $42.96
Cost per 1K tokens = (GPU $/hr) / (Tokens/hr / 1000)
Examples (vLLM):
- Llama 8B on RTX 4090: $0.44/hr ÷ 360K/hr = $0.0012/1K tokens
- Llama 70B on 2× A100: $3.58/hr ÷ 108K/hr = $0.033/1K tokens
| Mistake | Cost | Fix | Savings |
|---|---|---|---|
| A100 for SD 1.5 | $1.79/hr | RTX 4090 | 75% |
| H100 for 7B inference | $2.49/hr | RTX 4090 | 82% |
| 2× GPU for model that fits on 1 | 2× cost | Single GPU | 50% |
| Mistake | Symptom | Fix |
|---|---|---|
| RTX 4090 for 70B | OOM error | Use A100 or quantize |
| 24GB for FLUX training | OOM during backprop | Use A100 80GB |
| Single GPU for 405B | Won't load | Use 4× A100 or 2× H200 |
Task: Run Llama 3.1 70B for inference
| Option | GPU Setup | $/hr | Tokens/sec | $/1M tokens |
|---|---|---|---|---|
| A | 2× A100 80GB (FP16) | $3.58 | 30 | $33.15 |
| B | 1× A100 80GB (INT4) | $1.79 | 25 | $19.89 |
| C | 1× H100 80GB (INT4) | $2.49 | 40 | $17.29 |
Recommendation: Option B for cost, Option C for speed
When analyzing GPU requirements:
## GPU Analysis for [Task]
### Requirements
- **Model size**: [X GB]
- **VRAM needed**: [X GB] ([precision])
- **Task type**: [inference/training]
### Recommended Configuration
```jsonc
{
"gpu_type": "[GPU]",
"min_vram": [X],
// ... rest of config
}
| Scenario | Duration | Cost |
|---|---|---|
| [Scenario 1] | [X hours] | $[X] |
| [Scenario 2] | [X hours] | $[X] |
| GPU | $/hr | Pro | Con |
|---|---|---|---|
| [GPU 1] | $X | [pro] | [con] |
| [GPU 2] | $X | [pro] | [con] |
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.
Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other static piece. Create original visual designs, never copying existing artists' work to avoid copyright violations.