From systems-design
Covers RAG architecture including design patterns, chunking strategies, embedding models, retrieval techniques, hybrid search, and context assembly for LLM pipelines.
npx claudepluginhub melodic-software/claude-code-plugins --plugin systems-designThis skill is limited to using the following tools:
Use this skill when:
Build RAG systems for LLM apps using vector databases, embeddings, and retrieval strategies. Use for document Q&A, grounded chatbots, and semantic search.
Designs end-to-end RAG architectures for use cases like customer support chatbots, documentation Q&A, legal search, and code assistance, covering ingestion pipelines, retrieval strategies, quality metrics, and scaling.
Designs production-grade RAG systems by chunking documents, generating embeddings, configuring vector stores, building hybrid search pipelines, reranking, and evaluating retrieval. For RAG, vector DBs, semantic search apps.
Share bugs, ideas, or general feedback.
Use this skill when:
Keywords: RAG, retrieval-augmented generation, embeddings, chunking, vector search, semantic search, context window, grounding, knowledge base, hybrid search, reranking, BM25, dense retrieval
┌─────────────────────────────────────────────────────────────────────┐
│ RAG Pipeline │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ Ingestion │ │ Indexing │ │ Vector Store │ │
│ │ Pipeline │───▶│ Pipeline │───▶│ (Embeddings) │ │
│ └──────────────┘ └──────────────┘ └──────────────────────┘ │
│ │ │ │ │
│ Documents Chunks + Indexed │
│ Embeddings Vectors │
│ │ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ Query │ │ Retrieval │ │ Context Assembly │ │
│ │ Processing │───▶│ Engine │───▶│ + Generation │ │
│ └──────────────┘ └──────────────┘ └──────────────────────┘ │
│ │ │ │ │
│ User Query Top-K Chunks LLM Response │
│ │
└─────────────────────────────────────────────────────────────────────┘
Raw Documents
│
▼
┌─────────────┐
│ Extract │ ← PDF, HTML, DOCX, Markdown
│ Content │
└─────────────┘
│
▼
┌─────────────┐
│ Clean & │ ← Remove boilerplate, normalize
│ Normalize │
└─────────────┘
│
▼
┌─────────────┐
│ Chunk │ ← Split into retrievable units
│ Documents │
└─────────────┘
│
▼
┌─────────────┐
│ Generate │ ← Create vector representations
│ Embeddings │
└─────────────┘
│
▼
┌─────────────┐
│ Store │ ← Persist vectors + metadata
│ in Index │
└─────────────┘
| Strategy | Description | Best For | Chunk Size |
|---|---|---|---|
| Fixed-size | Split by token/character count | Simple documents | 256-512 tokens |
| Sentence-based | Split at sentence boundaries | Narrative text | Variable |
| Paragraph-based | Split at paragraph boundaries | Structured docs | Variable |
| Semantic | Split by topic/meaning | Long documents | Variable |
| Recursive | Hierarchical splitting | Mixed content | Configurable |
| Document-specific | Custom per doc type | Specialized (code, tables) | Variable |
What type of content?
├── Code
│ └── AST-based or function-level chunking
├── Tables/Structured
│ └── Keep tables intact, chunk surrounding text
├── Long narrative
│ └── Semantic or recursive chunking
├── Short documents (<1 page)
│ └── Whole document as chunk
└── Mixed content
└── Recursive with type-specific handlers
Without Overlap:
[Chunk 1: "The quick brown"] [Chunk 2: "fox jumps over"]
↑
Information lost at boundary
With Overlap (20%):
[Chunk 1: "The quick brown fox"]
[Chunk 2: "brown fox jumps over"]
↑
Context preserved across boundaries
Recommended overlap: 10-20% of chunk size
Smaller Chunks (128-256 tokens) Larger Chunks (512-1024 tokens)
├── More precise retrieval ├── More context per chunk
├── Less context per chunk ├── May include irrelevant content
├── More chunks to search ├── Fewer chunks to search
├── Better for factoid Q&A ├── Better for summarization
└── Higher retrieval recall └── Higher retrieval precision
| Model | Dimensions | Context | Strengths |
|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 | 8K | High quality, expensive |
| OpenAI text-embedding-3-small | 1536 | 8K | Good quality/cost ratio |
| Cohere embed-v3 | 1024 | 512 | Multilingual, fast |
| BGE-large | 1024 | 512 | Open source, competitive |
| E5-large-v2 | 1024 | 512 | Open source, instruction-tuned |
| GTE-large | 1024 | 512 | Alibaba, good for Chinese |
| Sentence-BERT | 768 | 512 | Classic, well-understood |
Need best quality, cost OK?
├── Yes → OpenAI text-embedding-3-large
└── No
└── Need self-hosted/open source?
├── Yes → BGE-large or E5-large-v2
└── No
└── Need multilingual?
├── Yes → Cohere embed-v3
└── No → OpenAI text-embedding-3-small
| Technique | Description | When to Use |
|---|---|---|
| Matryoshka embeddings | Truncatable to smaller dims | Memory-constrained |
| Quantized embeddings | INT8/binary embeddings | Large-scale search |
| Instruction-tuned | Prefix with task instruction | Specialized retrieval |
| Fine-tuned embeddings | Domain-specific training | Specialized domains |
Query: "How to deploy containers"
│
▼
┌─────────┐
│ Embed │
│ Query │
└─────────┘
│
▼
┌─────────────────────────────────┐
│ Vector Similarity Search │
│ (Cosine, Dot Product, L2) │
└─────────────────────────────────┘
│
▼
Top-K semantically similar chunks
Query: "Kubernetes pod deployment YAML"
│
▼
┌─────────┐
│Tokenize │
│ + Score │
└─────────┘
│
▼
┌─────────────────────────────────┐
│ BM25 Ranking │
│ (Term frequency × IDF) │
└─────────────────────────────────┘
│
▼
Top-K lexically matching chunks
Query ──┬──▶ Dense Search ──┬──▶ Fusion ──▶ Final Ranking
│ │ │
└──▶ Sparse Search ─┘ │
│
Fusion Methods: ▼
• RRF (Reciprocal Rank Fusion)
• Linear combination
• Learned reranking
RRF Score = Σ 1 / (k + rank_i)
Where:
- k = constant (typically 60)
- rank_i = rank in each retrieval result
Example:
Doc A: Dense rank=1, Sparse rank=5
RRF(A) = 1/(60+1) + 1/(60+5) = 0.0164 + 0.0154 = 0.0318
Doc B: Dense rank=3, Sparse rank=1
RRF(B) = 1/(60+3) + 1/(60+1) = 0.0159 + 0.0164 = 0.0323
Result: Doc B ranks higher (better combined relevance)
┌─────────────────────────────────────────────────────────┐
│ Stage 1: Recall (Fast, High Recall) │
│ • ANN search (HNSW, IVF) │
│ • Retrieve top-100 candidates │
│ • Latency: 10-50ms │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Stage 2: Rerank (Slow, High Precision) │
│ • Cross-encoder or LLM reranking │
│ • Score top-100 → return top-10 │
│ • Latency: 100-500ms │
└─────────────────────────────────────────────────────────┘
| Reranker | Latency | Quality | Cost |
|---|---|---|---|
| Cross-encoder (local) | Medium | High | Compute |
| Cohere Rerank | Fast | High | API cost |
| LLM-based rerank | Slow | Highest | High API cost |
| BGE-reranker | Fast | Good | Compute |
Context Budget: 128K tokens
├── System prompt: 500 tokens (fixed)
├── Conversation history: 4K tokens (sliding window)
├── Retrieved context: 8K tokens (dynamic)
└── Generation buffer: ~115K tokens (available)
Strategy: Maximize retrieved context quality within budget
| Strategy | Description | When to Use |
|---|---|---|
| Simple concatenation | Join top-K chunks | Small context, simple Q&A |
| Relevance-ordered | Most relevant first | General retrieval |
| Chronological | Time-ordered | Temporal queries |
| Hierarchical | Summary + details | Long-form generation |
| Interleaved | Mix sources | Multi-source queries |
LLM Attention Pattern:
┌─────────────────────────────────────────────────────────┐
│ Beginning Middle End │
│ ████ ░░░░ ████ │
│ High attention Low attention High attention │
└─────────────────────────────────────────────────────────┘
Mitigation:
1. Put most relevant at beginning AND end
2. Use shorter context windows when possible
3. Use hierarchical summarization
4. Fine-tune for long-context attention
Original Query: "Tell me about the project"
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌─────────┐ ┌──────────┐ ┌──────────┐
│ HyDE │ │ Query │ │ Sub-query│
│ (Hypo │ │ Expansion│ │ Decomp. │
│ Doc) │ │ │ │ │
└─────────┘ └──────────┘ └──────────┘
│ │ │
▼ ▼ ▼
Hypothetical "project, "What is the
answer to goals, project scope?"
embed timeline, "What are the
deliverables" deliverables?"
Query: "How does photosynthesis work?"
│
▼
┌───────────────┐
│ LLM generates │
│ hypothetical │
│ answer │
└───────────────┘
│
▼
"Photosynthesis is the process by which
plants convert sunlight into energy..."
│
▼
┌───────────────┐
│ Embed hypo │
│ document │
└───────────────┘
│
▼
Search with hypothetical embedding
(Better matches actual documents)
┌─────────────────────────────────────────────────────────┐
│ 1. Generate initial response │
│ 2. Decide: Need more retrieval? (critique token) │
│ ├── Yes → Retrieve more, regenerate │
│ └── No → Check factuality (isRel, isSup tokens) │
│ 3. Verify claims against sources │
│ 4. Regenerate if needed │
│ 5. Return verified response │
└─────────────────────────────────────────────────────────┘
Query: "Compare Q3 revenue across regions"
│
▼
┌───────────────┐
│ Query Agent │
│ (Plan steps) │
└───────────────┘
│
┌───────────┼───────────┐
▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐
│Search │ │Search │ │Search │
│ EMEA │ │ APAC │ │ AMER │
│ docs │ │ docs │ │ docs │
└───────┘ └───────┘ └───────┘
│ │ │
└───────────┼───────────┘
▼
┌───────────────┐
│ Synthesize │
│ Comparison │
└───────────────┘
| Metric | Description | Target |
|---|---|---|
| Recall@K | % relevant docs in top-K | >80% |
| Precision@K | % of top-K that are relevant | >60% |
| MRR (Mean Reciprocal Rank) | 1/rank of first relevant | >0.5 |
| NDCG | Graded relevance ranking | >0.7 |
| Metric | Description | Target |
|---|---|---|
| Answer correctness | Is the answer factually correct? | >90% |
| Faithfulness | Is the answer grounded in context? | >95% |
| Answer relevance | Does it answer the question? | >90% |
| Context relevance | Is retrieved context relevant? | >80% |
┌─────────────────────────────────────────────────────────┐
│ RAG Evaluation Pipeline │
├─────────────────────────────────────────────────────────┤
│ 1. Query Set: Representative questions │
│ 2. Ground Truth: Expected answers + source docs │
│ 3. Metrics: │
│ • Retrieval: Recall@K, MRR, NDCG │
│ • Generation: Correctness, Faithfulness │
│ 4. A/B Testing: Compare configurations │
│ 5. Error Analysis: Identify failure patterns │
└─────────────────────────────────────────────────────────┘
| Failure Mode | Cause | Mitigation |
|---|---|---|
| Retrieval miss | Query-doc mismatch | Hybrid search, query expansion |
| Wrong chunk | Poor chunking | Better segmentation, overlap |
| Hallucination | Poor grounding | Faithfulness training, citations |
| Lost context | Long-context issues | Hierarchical, summarization |
| Stale data | Outdated index | Incremental updates, TTL |
| Scale | Approach |
|---|---|
| <1M docs | Single node, exact search |
| 1-10M docs | Single node, HNSW |
| 10-100M docs | Distributed, sharded |
| >100M docs | Distributed + aggressive filtering |
Typical RAG Pipeline Latency:
Query embedding: 10-50ms
Vector search: 20-100ms
Reranking: 100-300ms
LLM generation: 500-2000ms
────────────────────────────
Total: 630-2450ms
Target p95: <3 seconds for interactive use
llm-serving-patterns - LLM inference infrastructurevector-databases - Vector store selection and optimizationml-system-design - End-to-end ML pipeline designestimation-techniques - Capacity planning for RAG systemsDate: 2025-12-26