Skill

coreweave-performance-tuning

Optimizes CoreWeave GPU inference latency and throughput using workload-specific GPU picks, vLLM batching, and Kubernetes HPA autoscaling.

Kubernetes

Bash

ai-ml

performance

npx claudepluginhub jeremylongshore/claude-code-plugins-plus-skills --plugin coreweave-pack

Tool Access

This skill is limited to using the following tools:

ReadWriteEditBash(kubectl:*)

Preview

| Workload | Recommended GPU | Why |

SKILL.md

Similar Skills

coreweave-cost-tuning

1.9k

Optimizes CoreWeave GPU costs with right-sizing, Knative scale-to-zero, quantization, and instance recommendations for ML inference workloads.

5 tools

coreweave-pack

optimize-llm

Provides LLM serving optimization recommendations for latency, inference costs, and throughput. Scans configs, detects stacks like vLLM/TGI, suggests quantization, batching, KV cache, and framework changes.

4 tools

systems-design

gpu-resource-optimizer

2.0k

Optimizes GPU resources for ML deployment tasks like model serving, MLOps pipelines, monitoring, and production inference. Generates code, configs, and best practices guidance. Auto-activates on 'gpu resource optimizer' or 'gpu optimizer' phrases.

5 tools

jeremylongshore-claude-code-plugins-plus-skills

Stats

Parent Repo Stars1854

Parent Repo Forks248

Last CommitMar 22, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

CoreWeave Performance Tuning

GPU Selection by Workload

Workload

Recommended GPU

Why

LLM inference (7-13B)

A100 80GB

Good balance of memory and cost

LLM inference (70B+)

8xH100

NVLink for tensor parallelism

Image generation

L40

Good for diffusion models

Training (large models)

8xH100 SXM5

Fastest interconnect

Batch processing

A100 40GB

Cost-effective

Inference Optimization

# Continuous batching with vLLM containers: - name: vllm args: - "--model=meta-llama/Llama-3.1-8B-Instruct" - "--max-num-batched-tokens=8192" - "--max-num-seqs=256" - "--gpu-memory-utilization=0.90" - "--enable-prefix-caching" - "--dtype=float16"

Autoscaling Tuning

# HPA based on GPU utilization apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: inference-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: inference-server minReplicas: 2 maxReplicas: 10 metrics: - type: Pods pods: metric: name: DCGM_FI_DEV_GPU_UTIL target: type: AverageValue averageValue: "70"

Performance Benchmarks

Metric

A100-80GB

H100-80GB

Llama-8B tokens/sec

~2,000

~4,500

Llama-70B tokens/sec

~200 (4x)

~500 (4x)

Cold start (vLLM)

30-60s

20-40s

Resources

Next Steps

For cost optimization, see coreweave-cost-tuning.