Retrieval-Augmented Generation (RAG) system design patterns, chunking strategies, embedding models, retrieval techniques, and context assembly. Use when designing RAG pipelines, improving retrieval quality, or building knowledge-grounded LLM applications.
Design and optimize RAG pipelines, from chunking strategies and embedding model selection to hybrid retrieval and multi-stage reranking. Use when building knowledge-grounded LLM applications that need to retrieve and synthesize information from large document collections.
/plugin marketplace add melodic-software/claude-code-plugins/plugin install systems-design@melodic-softwareThis skill is limited to using the following tools:
Use this skill when:
Keywords: RAG, retrieval-augmented generation, embeddings, chunking, vector search, semantic search, context window, grounding, knowledge base, hybrid search, reranking, BM25, dense retrieval
┌─────────────────────────────────────────────────────────────────────┐
│ RAG Pipeline │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ Ingestion │ │ Indexing │ │ Vector Store │ │
│ │ Pipeline │───▶│ Pipeline │───▶│ (Embeddings) │ │
│ └──────────────┘ └──────────────┘ └──────────────────────┘ │
│ │ │ │ │
│ Documents Chunks + Indexed │
│ Embeddings Vectors │
│ │ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ Query │ │ Retrieval │ │ Context Assembly │ │
│ │ Processing │───▶│ Engine │───▶│ + Generation │ │
│ └──────────────┘ └──────────────┘ └──────────────────────┘ │
│ │ │ │ │
│ User Query Top-K Chunks LLM Response │
│ │
└─────────────────────────────────────────────────────────────────────┘
Raw Documents
│
▼
┌─────────────┐
│ Extract │ ← PDF, HTML, DOCX, Markdown
│ Content │
└─────────────┘
│
▼
┌─────────────┐
│ Clean & │ ← Remove boilerplate, normalize
│ Normalize │
└─────────────┘
│
▼
┌─────────────┐
│ Chunk │ ← Split into retrievable units
│ Documents │
└─────────────┘
│
▼
┌─────────────┐
│ Generate │ ← Create vector representations
│ Embeddings │
└─────────────┘
│
▼
┌─────────────┐
│ Store │ ← Persist vectors + metadata
│ in Index │
└─────────────┘
| Strategy | Description | Best For | Chunk Size |
|---|---|---|---|
| Fixed-size | Split by token/character count | Simple documents | 256-512 tokens |
| Sentence-based | Split at sentence boundaries | Narrative text | Variable |
| Paragraph-based | Split at paragraph boundaries | Structured docs | Variable |
| Semantic | Split by topic/meaning | Long documents | Variable |
| Recursive | Hierarchical splitting | Mixed content | Configurable |
| Document-specific | Custom per doc type | Specialized (code, tables) | Variable |
What type of content?
├── Code
│ └── AST-based or function-level chunking
├── Tables/Structured
│ └── Keep tables intact, chunk surrounding text
├── Long narrative
│ └── Semantic or recursive chunking
├── Short documents (<1 page)
│ └── Whole document as chunk
└── Mixed content
└── Recursive with type-specific handlers
Without Overlap:
[Chunk 1: "The quick brown"] [Chunk 2: "fox jumps over"]
↑
Information lost at boundary
With Overlap (20%):
[Chunk 1: "The quick brown fox"]
[Chunk 2: "brown fox jumps over"]
↑
Context preserved across boundaries
Recommended overlap: 10-20% of chunk size
Smaller Chunks (128-256 tokens) Larger Chunks (512-1024 tokens)
├── More precise retrieval ├── More context per chunk
├── Less context per chunk ├── May include irrelevant content
├── More chunks to search ├── Fewer chunks to search
├── Better for factoid Q&A ├── Better for summarization
└── Higher retrieval recall └── Higher retrieval precision
| Model | Dimensions | Context | Strengths |
|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 | 8K | High quality, expensive |
| OpenAI text-embedding-3-small | 1536 | 8K | Good quality/cost ratio |
| Cohere embed-v3 | 1024 | 512 | Multilingual, fast |
| BGE-large | 1024 | 512 | Open source, competitive |
| E5-large-v2 | 1024 | 512 | Open source, instruction-tuned |
| GTE-large | 1024 | 512 | Alibaba, good for Chinese |
| Sentence-BERT | 768 | 512 | Classic, well-understood |
Need best quality, cost OK?
├── Yes → OpenAI text-embedding-3-large
└── No
└── Need self-hosted/open source?
├── Yes → BGE-large or E5-large-v2
└── No
└── Need multilingual?
├── Yes → Cohere embed-v3
└── No → OpenAI text-embedding-3-small
| Technique | Description | When to Use |
|---|---|---|
| Matryoshka embeddings | Truncatable to smaller dims | Memory-constrained |
| Quantized embeddings | INT8/binary embeddings | Large-scale search |
| Instruction-tuned | Prefix with task instruction | Specialized retrieval |
| Fine-tuned embeddings | Domain-specific training | Specialized domains |
Query: "How to deploy containers"
│
▼
┌─────────┐
│ Embed │
│ Query │
└─────────┘
│
▼
┌─────────────────────────────────┐
│ Vector Similarity Search │
│ (Cosine, Dot Product, L2) │
└─────────────────────────────────┘
│
▼
Top-K semantically similar chunks
Query: "Kubernetes pod deployment YAML"
│
▼
┌─────────┐
│Tokenize │
│ + Score │
└─────────┘
│
▼
┌─────────────────────────────────┐
│ BM25 Ranking │
│ (Term frequency × IDF) │
└─────────────────────────────────┘
│
▼
Top-K lexically matching chunks
Query ──┬──▶ Dense Search ──┬──▶ Fusion ──▶ Final Ranking
│ │ │
└──▶ Sparse Search ─┘ │
│
Fusion Methods: ▼
• RRF (Reciprocal Rank Fusion)
• Linear combination
• Learned reranking
RRF Score = Σ 1 / (k + rank_i)
Where:
- k = constant (typically 60)
- rank_i = rank in each retrieval result
Example:
Doc A: Dense rank=1, Sparse rank=5
RRF(A) = 1/(60+1) + 1/(60+5) = 0.0164 + 0.0154 = 0.0318
Doc B: Dense rank=3, Sparse rank=1
RRF(B) = 1/(60+3) + 1/(60+1) = 0.0159 + 0.0164 = 0.0323
Result: Doc B ranks higher (better combined relevance)
┌─────────────────────────────────────────────────────────┐
│ Stage 1: Recall (Fast, High Recall) │
│ • ANN search (HNSW, IVF) │
│ • Retrieve top-100 candidates │
│ • Latency: 10-50ms │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Stage 2: Rerank (Slow, High Precision) │
│ • Cross-encoder or LLM reranking │
│ • Score top-100 → return top-10 │
│ • Latency: 100-500ms │
└─────────────────────────────────────────────────────────┘
| Reranker | Latency | Quality | Cost |
|---|---|---|---|
| Cross-encoder (local) | Medium | High | Compute |
| Cohere Rerank | Fast | High | API cost |
| LLM-based rerank | Slow | Highest | High API cost |
| BGE-reranker | Fast | Good | Compute |
Context Budget: 128K tokens
├── System prompt: 500 tokens (fixed)
├── Conversation history: 4K tokens (sliding window)
├── Retrieved context: 8K tokens (dynamic)
└── Generation buffer: ~115K tokens (available)
Strategy: Maximize retrieved context quality within budget
| Strategy | Description | When to Use |
|---|---|---|
| Simple concatenation | Join top-K chunks | Small context, simple Q&A |
| Relevance-ordered | Most relevant first | General retrieval |
| Chronological | Time-ordered | Temporal queries |
| Hierarchical | Summary + details | Long-form generation |
| Interleaved | Mix sources | Multi-source queries |
LLM Attention Pattern:
┌─────────────────────────────────────────────────────────┐
│ Beginning Middle End │
│ ████ ░░░░ ████ │
│ High attention Low attention High attention │
└─────────────────────────────────────────────────────────┘
Mitigation:
1. Put most relevant at beginning AND end
2. Use shorter context windows when possible
3. Use hierarchical summarization
4. Fine-tune for long-context attention
Original Query: "Tell me about the project"
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌─────────┐ ┌──────────┐ ┌──────────┐
│ HyDE │ │ Query │ │ Sub-query│
│ (Hypo │ │ Expansion│ │ Decomp. │
│ Doc) │ │ │ │ │
└─────────┘ └──────────┘ └──────────┘
│ │ │
▼ ▼ ▼
Hypothetical "project, "What is the
answer to goals, project scope?"
embed timeline, "What are the
deliverables" deliverables?"
Query: "How does photosynthesis work?"
│
▼
┌───────────────┐
│ LLM generates │
│ hypothetical │
│ answer │
└───────────────┘
│
▼
"Photosynthesis is the process by which
plants convert sunlight into energy..."
│
▼
┌───────────────┐
│ Embed hypo │
│ document │
└───────────────┘
│
▼
Search with hypothetical embedding
(Better matches actual documents)
┌─────────────────────────────────────────────────────────┐
│ 1. Generate initial response │
│ 2. Decide: Need more retrieval? (critique token) │
│ ├── Yes → Retrieve more, regenerate │
│ └── No → Check factuality (isRel, isSup tokens) │
│ 3. Verify claims against sources │
│ 4. Regenerate if needed │
│ 5. Return verified response │
└─────────────────────────────────────────────────────────┘
Query: "Compare Q3 revenue across regions"
│
▼
┌───────────────┐
│ Query Agent │
│ (Plan steps) │
└───────────────┘
│
┌───────────┼───────────┐
▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐
│Search │ │Search │ │Search │
│ EMEA │ │ APAC │ │ AMER │
│ docs │ │ docs │ │ docs │
└───────┘ └───────┘ └───────┘
│ │ │
└───────────┼───────────┘
▼
┌───────────────┐
│ Synthesize │
│ Comparison │
└───────────────┘
| Metric | Description | Target |
|---|---|---|
| Recall@K | % relevant docs in top-K | >80% |
| Precision@K | % of top-K that are relevant | >60% |
| MRR (Mean Reciprocal Rank) | 1/rank of first relevant | >0.5 |
| NDCG | Graded relevance ranking | >0.7 |
| Metric | Description | Target |
|---|---|---|
| Answer correctness | Is the answer factually correct? | >90% |
| Faithfulness | Is the answer grounded in context? | >95% |
| Answer relevance | Does it answer the question? | >90% |
| Context relevance | Is retrieved context relevant? | >80% |
┌─────────────────────────────────────────────────────────┐
│ RAG Evaluation Pipeline │
├─────────────────────────────────────────────────────────┤
│ 1. Query Set: Representative questions │
│ 2. Ground Truth: Expected answers + source docs │
│ 3. Metrics: │
│ • Retrieval: Recall@K, MRR, NDCG │
│ • Generation: Correctness, Faithfulness │
│ 4. A/B Testing: Compare configurations │
│ 5. Error Analysis: Identify failure patterns │
└─────────────────────────────────────────────────────────┘
| Failure Mode | Cause | Mitigation |
|---|---|---|
| Retrieval miss | Query-doc mismatch | Hybrid search, query expansion |
| Wrong chunk | Poor chunking | Better segmentation, overlap |
| Hallucination | Poor grounding | Faithfulness training, citations |
| Lost context | Long-context issues | Hierarchical, summarization |
| Stale data | Outdated index | Incremental updates, TTL |
| Scale | Approach |
|---|---|
| <1M docs | Single node, exact search |
| 1-10M docs | Single node, HNSW |
| 10-100M docs | Distributed, sharded |
| >100M docs | Distributed + aggressive filtering |
Typical RAG Pipeline Latency:
Query embedding: 10-50ms
Vector search: 20-100ms
Reranking: 100-300ms
LLM generation: 500-2000ms
────────────────────────────
Total: 630-2450ms
Target p95: <3 seconds for interactive use
llm-serving-patterns - LLM inference infrastructurevector-databases - Vector store selection and optimizationml-system-design - End-to-end ML pipeline designestimation-techniques - Capacity planning for RAG systemsDate: 2025-12-26
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other static piece. Create original visual designs, never copying existing artists' work to avoid copyright violations.