RAG Architecture

When to Use This Skill

Use this skill when:

Designing RAG pipelines for LLM applications
Choosing chunking and embedding strategies
Optimizing retrieval quality and relevance
Building knowledge-grounded AI systems
Implementing hybrid search (dense + sparse)
Designing multi-stage retrieval pipelines

Keywords: RAG, retrieval-augmented generation, embeddings, chunking, vector search, semantic search, context window, grounding, knowledge base, hybrid search, reranking, BM25, dense retrieval

RAG Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                       RAG Pipeline                                  │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────────┐  │
│  │   Ingestion  │    │   Indexing   │    │    Vector Store      │  │
│  │   Pipeline   │───▶│   Pipeline   │───▶│    (Embeddings)      │  │
│  └──────────────┘    └──────────────┘    └──────────────────────┘  │
│         │                   │                       │               │
│    Documents           Chunks +                 Indexed             │
│                       Embeddings               Vectors              │
│                                                     │               │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────────┐  │
│  │    Query     │    │  Retrieval   │    │   Context Assembly   │  │
│  │  Processing  │───▶│   Engine     │───▶│   + Generation       │  │
│  └──────────────┘    └──────────────┘    └──────────────────────┘  │
│         │                   │                       │               │
│    User Query          Top-K Chunks            LLM Response         │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Document Ingestion Pipeline

Document Processing Steps

Raw Documents
      │
      ▼
┌─────────────┐
│   Extract   │ ← PDF, HTML, DOCX, Markdown
│   Content   │
└─────────────┘
      │
      ▼
┌─────────────┐
│   Clean &   │ ← Remove boilerplate, normalize
│  Normalize  │
└─────────────┘
      │
      ▼
┌─────────────┐
│   Chunk     │ ← Split into retrievable units
│  Documents  │
└─────────────┘
      │
      ▼
┌─────────────┐
│  Generate   │ ← Create vector representations
│ Embeddings  │
└─────────────┘
      │
      ▼
┌─────────────┐
│   Store     │ ← Persist vectors + metadata
│  in Index   │
└─────────────┘

Chunking Strategies

Strategy Comparison

Strategy	Description	Best For	Chunk Size
Fixed-size	Split by token/character count	Simple documents	256-512 tokens
Sentence-based	Split at sentence boundaries	Narrative text	Variable
Paragraph-based	Split at paragraph boundaries	Structured docs	Variable
Semantic	Split by topic/meaning	Long documents	Variable
Recursive	Hierarchical splitting	Mixed content	Configurable
Document-specific	Custom per doc type	Specialized (code, tables)	Variable

Chunking Decision Tree

What type of content?
├── Code
│   └── AST-based or function-level chunking
├── Tables/Structured
│   └── Keep tables intact, chunk surrounding text
├── Long narrative
│   └── Semantic or recursive chunking
├── Short documents (<1 page)
│   └── Whole document as chunk
└── Mixed content
    └── Recursive with type-specific handlers

Chunk Overlap

Without Overlap:
[Chunk 1: "The quick brown"] [Chunk 2: "fox jumps over"]
                             ↑
               Information lost at boundary

With Overlap (20%):
[Chunk 1: "The quick brown fox"]
                    [Chunk 2: "brown fox jumps over"]
                         ↑
              Context preserved across boundaries

Recommended overlap: 10-20% of chunk size

Chunk Size Trade-offs

Smaller Chunks (128-256 tokens)        Larger Chunks (512-1024 tokens)
├── More precise retrieval             ├── More context per chunk
├── Less context per chunk             ├── May include irrelevant content
├── More chunks to search              ├── Fewer chunks to search
├── Better for factoid Q&A             ├── Better for summarization
└── Higher retrieval recall            └── Higher retrieval precision

Embedding Models

Model Comparison

Model	Dimensions	Context	Strengths
OpenAI text-embedding-3-large	3072	8K	High quality, expensive
OpenAI text-embedding-3-small	1536	8K	Good quality/cost ratio
Cohere embed-v3	1024	512	Multilingual, fast
BGE-large	1024	512	Open source, competitive
E5-large-v2	1024	512	Open source, instruction-tuned
GTE-large	1024	512	Alibaba, good for Chinese
Sentence-BERT	768	512	Classic, well-understood

Embedding Selection

Need best quality, cost OK?
├── Yes → OpenAI text-embedding-3-large
└── No
    └── Need self-hosted/open source?
        ├── Yes → BGE-large or E5-large-v2
        └── No
            └── Need multilingual?
                ├── Yes → Cohere embed-v3
                └── No → OpenAI text-embedding-3-small

Embedding Optimization

Technique	Description	When to Use
Matryoshka embeddings	Truncatable to smaller dims	Memory-constrained
Quantized embeddings	INT8/binary embeddings	Large-scale search
Instruction-tuned	Prefix with task instruction	Specialized retrieval
Fine-tuned embeddings	Domain-specific training	Specialized domains

Retrieval Strategies

Dense Retrieval (Semantic Search)

Query: "How to deploy containers"
         │
         ▼
    ┌─────────┐
    │ Embed   │
    │ Query   │
    └─────────┘
         │
         ▼
    ┌─────────────────────────────────┐
    │ Vector Similarity Search        │
    │ (Cosine, Dot Product, L2)       │
    └─────────────────────────────────┘
         │
         ▼
    Top-K semantically similar chunks

Sparse Retrieval (BM25/TF-IDF)

Query: "Kubernetes pod deployment YAML"
         │
         ▼
    ┌─────────┐
    │Tokenize │
    │ + Score │
    └─────────┘
         │
         ▼
    ┌─────────────────────────────────┐
    │ BM25 Ranking                    │
    │ (Term frequency × IDF)          │
    └─────────────────────────────────┘
         │
         ▼
    Top-K lexically matching chunks

Hybrid Search (Best of Both)

Query ──┬──▶ Dense Search ──┬──▶ Fusion ──▶ Final Ranking
        │                   │      │
        └──▶ Sparse Search ─┘      │
                                   │
        Fusion Methods:            ▼
        • RRF (Reciprocal Rank Fusion)
        • Linear combination
        • Learned reranking

Reciprocal Rank Fusion (RRF)

RRF Score = Σ 1 / (k + rank_i)

Where:
- k = constant (typically 60)
- rank_i = rank in each retrieval result

Example:
Doc A: Dense rank=1, Sparse rank=5
RRF(A) = 1/(60+1) + 1/(60+5) = 0.0164 + 0.0154 = 0.0318

Doc B: Dense rank=3, Sparse rank=1
RRF(B) = 1/(60+3) + 1/(60+1) = 0.0159 + 0.0164 = 0.0323

Result: Doc B ranks higher (better combined relevance)

Multi-Stage Retrieval

Two-Stage Pipeline

┌─────────────────────────────────────────────────────────┐
│ Stage 1: Recall (Fast, High Recall)                     │
│ • ANN search (HNSW, IVF)                                │
│ • Retrieve top-100 candidates                           │
│ • Latency: 10-50ms                                      │
└─────────────────────────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────┐
│ Stage 2: Rerank (Slow, High Precision)                  │
│ • Cross-encoder or LLM reranking                        │
│ • Score top-100 → return top-10                         │
│ • Latency: 100-500ms                                    │
└─────────────────────────────────────────────────────────┘

Reranking Options

Reranker	Latency	Quality	Cost
Cross-encoder (local)	Medium	High	Compute
Cohere Rerank	Fast	High	API cost
LLM-based rerank	Slow	Highest	High API cost
BGE-reranker	Fast	Good	Compute

Context Assembly

Context Window Management

Context Budget: 128K tokens
├── System prompt: 500 tokens (fixed)
├── Conversation history: 4K tokens (sliding window)
├── Retrieved context: 8K tokens (dynamic)
└── Generation buffer: ~115K tokens (available)

Strategy: Maximize retrieved context quality within budget

Context Assembly Strategies

Strategy	Description	When to Use
Simple concatenation	Join top-K chunks	Small context, simple Q&A
Relevance-ordered	Most relevant first	General retrieval
Chronological	Time-ordered	Temporal queries
Hierarchical	Summary + details	Long-form generation
Interleaved	Mix sources	Multi-source queries

Lost-in-the-Middle Problem

LLM Attention Pattern:
┌─────────────────────────────────────────────────────────┐
│ Beginning           Middle            End               │
│    ████              ░░░░             ████              │
│  High attention   Low attention   High attention        │
└─────────────────────────────────────────────────────────┘

Mitigation:
1. Put most relevant at beginning AND end
2. Use shorter context windows when possible
3. Use hierarchical summarization
4. Fine-tune for long-context attention

Advanced RAG Patterns

Query Transformation

Original Query: "Tell me about the project"
                           │
         ┌─────────────────┼─────────────────┐
         ▼                 ▼                 ▼
    ┌─────────┐      ┌──────────┐     ┌──────────┐
    │ HyDE    │      │ Query    │     │ Sub-query│
    │ (Hypo   │      │ Expansion│     │ Decomp.  │
    │ Doc)    │      │          │     │          │
    └─────────┘      └──────────┘     └──────────┘
         │                 │                 │
         ▼                 ▼                 ▼
    Hypothetical      "project,        "What is the
    answer to         goals,           project scope?"
    embed             timeline,        "What are the
                      deliverables"    deliverables?"

HyDE (Hypothetical Document Embeddings)

Query: "How does photosynthesis work?"
                │
                ▼
        ┌───────────────┐
        │ LLM generates │
        │ hypothetical  │
        │ answer        │
        └───────────────┘
                │
                ▼
"Photosynthesis is the process by which
plants convert sunlight into energy..."
                │
                ▼
        ┌───────────────┐
        │ Embed hypo    │
        │ document      │
        └───────────────┘
                │
                ▼
    Search with hypothetical embedding
    (Better matches actual documents)

Self-RAG (Retrieval-Augmented LM with Self-Reflection)

┌─────────────────────────────────────────────────────────┐
│ 1. Generate initial response                            │
│ 2. Decide: Need more retrieval? (critique token)        │
│    ├── Yes → Retrieve more, regenerate                  │
│    └── No → Check factuality (isRel, isSup tokens)      │
│ 3. Verify claims against sources                        │
│ 4. Regenerate if needed                                 │
│ 5. Return verified response                             │
└─────────────────────────────────────────────────────────┘

Agentic RAG

Query: "Compare Q3 revenue across regions"
                │
                ▼
        ┌───────────────┐
        │ Query Agent   │
        │ (Plan steps)  │
        └───────────────┘
                │
    ┌───────────┼───────────┐
    ▼           ▼           ▼
┌───────┐   ┌───────┐   ┌───────┐
│Search │   │Search │   │Search │
│ EMEA  │   │ APAC  │   │ AMER  │
│ docs  │   │ docs  │   │ docs  │
└───────┘   └───────┘   └───────┘
    │           │           │
    └───────────┼───────────┘
                ▼
        ┌───────────────┐
        │  Synthesize   │
        │  Comparison   │
        └───────────────┘

Evaluation Metrics

Retrieval Metrics

Metric	Description	Target
Recall@K	% relevant docs in top-K	>80%
Precision@K	% of top-K that are relevant	>60%
MRR (Mean Reciprocal Rank)	1/rank of first relevant	>0.5
NDCG	Graded relevance ranking	>0.7

End-to-End Metrics

Metric	Description	Target
Answer correctness	Is the answer factually correct?	>90%
Faithfulness	Is the answer grounded in context?	>95%
Answer relevance	Does it answer the question?	>90%
Context relevance	Is retrieved context relevant?	>80%

Evaluation Framework

┌─────────────────────────────────────────────────────────┐
│                RAG Evaluation Pipeline                  │
├─────────────────────────────────────────────────────────┤
│ 1. Query Set: Representative questions                  │
│ 2. Ground Truth: Expected answers + source docs         │
│ 3. Metrics:                                             │
│    • Retrieval: Recall@K, MRR, NDCG                     │
│    • Generation: Correctness, Faithfulness              │
│ 4. A/B Testing: Compare configurations                  │
│ 5. Error Analysis: Identify failure patterns            │
└─────────────────────────────────────────────────────────┘

Common Failure Modes

Failure Mode	Cause	Mitigation
Retrieval miss	Query-doc mismatch	Hybrid search, query expansion
Wrong chunk	Poor chunking	Better segmentation, overlap
Hallucination	Poor grounding	Faithfulness training, citations
Lost context	Long-context issues	Hierarchical, summarization
Stale data	Outdated index	Incremental updates, TTL

Scaling Considerations

Index Scaling

Scale	Approach
<1M docs	Single node, exact search
1-10M docs	Single node, HNSW
10-100M docs	Distributed, sharded
>100M docs	Distributed + aggressive filtering

Latency Budget

Typical RAG Pipeline Latency:

Query embedding:     10-50ms
Vector search:       20-100ms
Reranking:          100-300ms
LLM generation:     500-2000ms
────────────────────────────
Total:              630-2450ms

Target p95: <3 seconds for interactive use

Related Skills

llm-serving-patterns - LLM inference infrastructure
vector-databases - Vector store selection and optimization
ml-system-design - End-to-end ML pipeline design
estimation-techniques - Capacity planning for RAG systems

Version History

v1.0.0 (2025-12-26): Initial release - RAG architecture patterns for systems design

Last Updated

Date: 2025-12-26

rag-architecture