Skill

coreweave-cost-tuning

Optimizes CoreWeave GPU costs with right-sizing, Knative scale-to-zero, quantization, and instance recommendations for ML inference workloads.

Kubernetes

Python

npx claudepluginhub jeremylongshore/claude-code-plugins-plus-skills --plugin coreweave-pack

Tool Access

This skill is limited to using the following tools:

ReadWriteEditBash(kubectl:*)Grep

Preview

| GPU | Per GPU/hour | Best For |

SKILL.md

Similar Skills

coreweave-performance-tuning

1.9k

Optimizes CoreWeave GPU inference latency and throughput using workload-specific GPU picks, vLLM batching, and Kubernetes HPA autoscaling.

4 tools

coreweave-pack

vastai-cost-tuning

1.9k

Optimizes Vast.ai GPU rental costs using cost-per-TFLOP selection, spot instance analysis, Python auto-destroy timers, and Bash idle detection.

5 tools

vastai-pack

gpu-resource-optimizer

2.0k

Optimizes GPU resources for ML deployment tasks like model serving, MLOps pipelines, monitoring, and production inference. Generates code, configs, and best practices guidance. Auto-activates on 'gpu resource optimizer' or 'gpu optimizer' phrases.

5 tools

jeremylongshore-claude-code-plugins-plus-skills

Stats

Parent Repo Stars1854

Parent Repo Forks248

Last CommitMar 22, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

CoreWeave Cost Tuning

GPU Pricing Reference (approximate)

GPU

Per GPU/hour

Best For

A100 40GB PCIe

~$1.50

Development, smaller models

A100 80GB PCIe

~$2.21

Production inference

H100 80GB PCIe

~$4.76

High-throughput inference

H100 SXM5 (8x)

~$6.15/GPU

Training, multi-GPU

L40

~$1.10

Image generation, light inference

Cost Optimization Strategies

Scale-to-Zero for Dev/Staging

autoscaling.knative.dev/minScale: "0" autoscaling.knative.dev/scaleDownDelay: "5m"

Right-Size GPU Selection

def recommend_gpu(model_size_b: float, inference_only: bool = True) -> str: if model_size_b <= 7: return "L40" if inference_only else "A100_PCIE_80GB" elif model_size_b <= 13: return "A100_PCIE_80GB" elif model_size_b <= 70: return "A100_PCIE_80GB (4x tensor parallel)" else: return "H100_SXM5 (8x tensor parallel)"

Quantization to Use Smaller GPUs

Use AWQ or GPTQ quantization to fit larger models on smaller GPUs:

# 70B model at 4-bit fits on single A100-80GB instead of 4x vllm serve meta-llama/Llama-3.1-70B-Instruct-AWQ --quantization awq

Resources

Next Steps

For architecture patterns, see coreweave-reference-architecture.