Help us improve
Share bugs, ideas, or general feedback.
From grimoire
Guides designing RAG systems that ground LLM responses in retrieved documents to reduce hallucination and enable knowledge updates without retraining.
npx claudepluginhub jeffreytse/grimoire --plugin grimoireHow this skill is triggered — by the user, by Claude, or both
Slash command
/grimoire:design-rag-systemThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Build a Retrieval-Augmented Generation system that grounds LLM responses in retrieved documents to reduce hallucination and enable knowledge updates without retraining.
Designs a retrieval-augmented generation pipeline with ingestion, chunking, embedding, vector DB, hybrid search, re-ranking, and prompt construction to ground LLM outputs in external knowledge.
Share bugs, ideas, or general feedback.
Build a Retrieval-Augmented Generation system that grounds LLM responses in retrieved documents to reduce hallucination and enable knowledge updates without retraining.
Adopted by: Microsoft (Azure AI Search + OpenAI), Anthropic (Claude with document retrieval), Google (Vertex AI Search), Salesforce Einstein GPT — all major enterprise AI platforms use RAG Impact: RAG reduces hallucination rates by 38-68% compared to LLM-only generation (RAGAS benchmarks); enables knowledge updates in hours vs. months required for fine-tuning; reduces LLM costs by limiting context size Why best: LLMs cannot know proprietary or post-training data; fine-tuning is expensive and doesn't generalize; RAG provides fresh, attributable, updatable knowledge
Sources: Lewis et al. NeurIPS 2020; Gao et al. "Retrieval-Augmented Generation for Large Language Models: A Survey" (2023); LlamaIndex documentation
Define the knowledge corpus — Identify the documents the system must retrieve from: internal wikis, PDFs, databases, code repositories, API documentation. Define update frequency (hourly, daily, real-time). Update frequency determines whether you need batch indexing or streaming indexing.
Design document ingestion pipeline — Build an automated pipeline: fetch documents → parse (PDF, HTML, DOCX) → extract text → chunk → embed → index. Handle incremental updates: new documents, modified documents, and deletions. Use checksums to detect changes and avoid re-indexing unchanged documents.
Choose a chunking strategy — Chunking strategy is the most impactful RAG design decision. Fixed-size chunks (512-1024 tokens with 10-20% overlap) work for uniform documents. Semantic chunking (split at paragraph/section boundaries) works better for structured documents. Smaller chunks improve precision; larger chunks improve context. Test empirically.
Select an embedding model — Choose based on: domain (general vs. domain-specific), multilingual requirement, dimension size (768 vs. 1536), and retrieval benchmarks (MTEB leaderboard). OpenAI text-embedding-3-small is cost-effective for general use. Domain-specific embeddings outperform general embeddings for technical corpora.
Build the vector index — Use a vector database (Pinecone, Weaviate, pgvector, Qdrant, ChromaDB) or a managed service (Azure AI Search, Vertex AI Matching Engine). Configure: distance metric (cosine similarity for normalized embeddings), index type (HNSW for approximate nearest neighbor), and metadata filtering support.
Implement hybrid retrieval — Combine dense (vector) retrieval with sparse (BM25/keyword) retrieval. Dense retrieval handles semantic similarity; sparse handles exact term matching. Re-rank results using a cross-encoder (Cohere Rerank, BGE Reranker) after hybrid retrieval. Hybrid consistently outperforms either method alone.
Design the retrieval query — Don't use the raw user question as the retrieval query. Expand queries: hypothetical document embeddings (HyDE), query rewriting, multi-query generation. For conversational systems: include conversation history in query construction to handle coreference ("what about the second option?").
Construct the generation prompt — Format retrieved context with clear delimiters: <document source="..." chunk_id="...">{{text}}</document>. Instruct the model to: answer only from provided documents, cite sources, and state when the answer is not in the documents. Place context before the question in the prompt.
Implement faithfulness validation — For high-stakes applications: add a faithfulness check that verifies the generated answer is grounded in retrieved documents (NLI classifier or LLM-as-judge). Reject or flag responses that make claims not supported by the retrieved context.
Evaluate with RAG-specific metrics — Use RAGAS or TruLens: retrieval precision (are retrieved docs relevant?), retrieval recall (are all relevant docs retrieved?), answer faithfulness (is the answer grounded in retrieved docs?), and answer relevance (does the answer address the question?). Establish baselines and track metrics on every pipeline change.