From coreweave-pack
Optimize CoreWeave GPU inference latency and throughput. Use when reducing inference latency, maximizing GPU utilization, or tuning batch sizes and concurrency. Trigger with phrases like "coreweave performance", "coreweave latency", "coreweave throughput", "optimize coreweave inference".
npx claudepluginhub flight505/skill-forge --plugin coreweave-packThis skill is limited to using the following tools:
| Workload | Recommended GPU | Why |
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).
Share bugs, ideas, or general feedback.
| Workload | Recommended GPU | Why |
|---|---|---|
| LLM inference (7-13B) | A100 80GB | Good balance of memory and cost |
| LLM inference (70B+) | 8xH100 | NVLink for tensor parallelism |
| Image generation | L40 | Good for diffusion models |
| Training (large models) | 8xH100 SXM5 | Fastest interconnect |
| Batch processing | A100 40GB | Cost-effective |
# Continuous batching with vLLM
containers:
- name: vllm
args:
- "--model=meta-llama/Llama-3.1-8B-Instruct"
- "--max-num-batched-tokens=8192"
- "--max-num-seqs=256"
- "--gpu-memory-utilization=0.90"
- "--enable-prefix-caching"
- "--dtype=float16"
# HPA based on GPU utilization
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: inference-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: inference-server
minReplicas: 2
maxReplicas: 10
metrics:
- type: Pods
pods:
metric:
name: DCGM_FI_DEV_GPU_UTIL
target:
type: AverageValue
averageValue: "70"
| Metric | A100-80GB | H100-80GB |
|---|---|---|
| Llama-8B tokens/sec | ~2,000 | ~4,500 |
| Llama-70B tokens/sec | ~200 (4x) | ~500 (4x) |
| Cold start (vLLM) | 30-60s | 20-40s |
For cost optimization, see coreweave-cost-tuning.