Use when "LLM inference", "serving LLM", "vLLM", "llama.cpp", "GGUF", "text generation", "model serving", "inference optimization", "KV cache", "continuous batching", "speculative decoding", "local LLM", "CPU inference"
Compares LLM inference engines and recommends optimal setups for vLLM, llama.cpp, TGI, Ollama, and TensorRT-LLM.
/plugin marketplace add eyadsibai/ltk/plugin install ltk@ltk-marketplaceThis skill inherits all available tools. When active, it can use any tool Claude has access to.
High-performance inference engines for serving large language models.
| Engine | Best For | Hardware | Throughput | Setup |
|---|---|---|---|---|
| vLLM | Production serving | GPU | Highest | Medium |
| llama.cpp | Local/edge, CPU | CPU/GPU | Good | Easy |
| TGI | HuggingFace models | GPU | High | Easy |
| Ollama | Local desktop | CPU/GPU | Good | Easiest |
| TensorRT-LLM | NVIDIA production | NVIDIA GPU | Highest | Complex |
| Scenario | Recommendation |
|---|---|
| Production API server | vLLM or TGI |
| Maximum throughput | vLLM |
| Local development | Ollama or llama.cpp |
| CPU-only deployment | llama.cpp |
| Edge/embedded | llama.cpp |
| Apple Silicon | llama.cpp with Metal |
| Quick experimentation | Ollama |
| Privacy-sensitive (no cloud) | llama.cpp |
Production-grade serving with PagedAttention for optimal GPU memory usage.
| Feature | What It Does |
|---|---|
| PagedAttention | Non-contiguous KV cache, better memory utilization |
| Continuous batching | Dynamic request grouping for throughput |
| Speculative decoding | Small model drafts, large model verifies |
Strengths: Highest throughput, OpenAI-compatible API, multi-GPU Limitations: GPU required, more complex setup
Key concept: Serves OpenAI-compatible endpoints—drop-in replacement for OpenAI API.
C++ inference for running models anywhere—laptops, phones, Raspberry Pi.
| Format | Size (7B) | Quality | Use Case |
|---|---|---|---|
| Q8_0 | ~7 GB | Highest | When you have RAM |
| Q6_K | ~6 GB | High | Good balance |
| Q5_K_M | ~5 GB | Good | Balanced |
| Q4_K_M | ~4 GB | OK | Memory constrained |
| Q2_K | ~2.5 GB | Low | Minimum viable |
Recommendation: Q4_K_M for best quality/size balance.
| Model Size | Q4_K_M | RAM Needed |
|---|---|---|
| 7B | ~4 GB | 8 GB |
| 13B | ~7 GB | 16 GB |
| 30B | ~17 GB | 32 GB |
| 70B | ~38 GB | 64 GB |
| Platform | Key Setting |
|---|---|
| Apple Silicon | n_gpu_layers=-1 (Metal offload) |
| CUDA GPU | n_gpu_layers=-1 + offload_kqv=True |
| CPU only | n_gpu_layers=0 + set n_threads to core count |
Strengths: Runs anywhere, GGUF format, Metal/CUDA support Limitations: Lower throughput than vLLM, single-user focused
Key concept: GGUF format + quantization = run large models on consumer hardware.
| Technique | What It Does | When to Use |
|---|---|---|
| KV Cache | Reuse attention computations | Always (automatic) |
| Continuous Batching | Group requests dynamically | High-throughput serving |
| Tensor Parallelism | Split model across GPUs | Large models |
| Quantization | Reduce precision (fp16→int4) | Memory constrained |
| Speculative Decoding | Small model drafts, large verifies | Latency sensitive |
| GPU Offloading | Move layers to GPU | When GPU available |
| Parameter | Purpose | Typical Value |
|---|---|---|
| n_ctx | Context window size | 2048-8192 |
| n_gpu_layers | Layers to offload | -1 (all) or 0 (none) |
| temperature | Randomness | 0.0-1.0 |
| max_tokens | Output limit | 100-2000 |
| n_threads | CPU threads | Match core count |
| Issue | Solution |
|---|---|
| Out of memory | Reduce n_ctx, use smaller quant |
| Slow inference | Enable GPU offload, use faster quant |
| Model won't load | Check GGUF integrity, check RAM |
| Metal not working | Reinstall with -DLLAMA_METAL=on |
| Poor quality | Use higher quant (Q5_K_M, Q6_K) |
This skill should be used when the user asks about libraries, frameworks, API references, or needs code examples. Activates for setup questions, code generation involving libraries, or mentions of specific frameworks like React, Vue, Next.js, Prisma, Supabase, etc.
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.