From coreweave-pack
Optimizes CoreWeave GPU costs with right-sizing, Knative scale-to-zero, quantization, and instance recommendations for ML inference workloads.
npx claudepluginhub jeremylongshore/claude-code-plugins-plus-skills --plugin coreweave-packThis skill is limited to using the following tools:
| GPU | Per GPU/hour | Best For |
Optimizes CoreWeave GPU inference latency and throughput using workload-specific GPU picks, vLLM batching, and Kubernetes HPA autoscaling.
Optimizes Vast.ai GPU rental costs using cost-per-TFLOP selection, spot instance analysis, Python auto-destroy timers, and Bash idle detection.
Optimizes GPU resources for ML deployment tasks like model serving, MLOps pipelines, monitoring, and production inference. Generates code, configs, and best practices guidance. Auto-activates on 'gpu resource optimizer' or 'gpu optimizer' phrases.
Share bugs, ideas, or general feedback.
| GPU | Per GPU/hour | Best For |
|---|---|---|
| A100 40GB PCIe | ~$1.50 | Development, smaller models |
| A100 80GB PCIe | ~$2.21 | Production inference |
| H100 80GB PCIe | ~$4.76 | High-throughput inference |
| H100 SXM5 (8x) | ~$6.15/GPU | Training, multi-GPU |
| L40 | ~$1.10 | Image generation, light inference |
autoscaling.knative.dev/minScale: "0"
autoscaling.knative.dev/scaleDownDelay: "5m"
def recommend_gpu(model_size_b: float, inference_only: bool = True) -> str:
if model_size_b <= 7:
return "L40" if inference_only else "A100_PCIE_80GB"
elif model_size_b <= 13:
return "A100_PCIE_80GB"
elif model_size_b <= 70:
return "A100_PCIE_80GB (4x tensor parallel)"
else:
return "H100_SXM5 (8x tensor parallel)"
Use AWQ or GPTQ quantization to fit larger models on smaller GPUs:
# 70B model at 4-bit fits on single A100-80GB instead of 4x
vllm serve meta-llama/Llama-3.1-70B-Instruct-AWQ --quantization awq
For architecture patterns, see coreweave-reference-architecture.