RAG Implementation

Master Retrieval-Augmented Generation (RAG) to build LLM applications that provide accurate, grounded responses using external knowledge sources.

When to Use This Skill

Building Q&A systems over proprietary documents
Creating chatbots with current, factual information
Implementing semantic search with natural language queries
Reducing hallucinations with grounded responses
Enabling LLMs to access domain-specific knowledge
Building documentation assistants
Creating research tools with source citation

Core Components

1. Vector Databases

Purpose: Store and retrieve document embeddings efficiently

Options:

Pinecone: Managed, scalable, serverless
Weaviate: Open-source, hybrid search, GraphQL
Milvus: High performance, on-premise
Chroma: Lightweight, easy to use, local development
Qdrant: Fast, filtered search, Rust-based
pgvector: PostgreSQL extension, SQL integration

2. Embeddings

Purpose: Convert text to numerical vectors for similarity search

Models (2026):

Model	Dimensions	Best For
voyage-3-large	1024	Claude apps (Anthropic recommended)
voyage-code-3	1024	Code search
text-embedding-3-large	3072	OpenAI apps, high accuracy
text-embedding-3-small	1536	OpenAI apps, cost-effective
bge-large-en-v1.5	1024	Open source, local deployment
multilingual-e5-large	1024	Multi-language support

3. Retrieval Strategies

Approaches:

Dense Retrieval: Semantic similarity via embeddings
Sparse Retrieval: Keyword matching (BM25, TF-IDF)
Hybrid Search: Combine dense + sparse with weighted fusion
Multi-Query: Generate multiple query variations
HyDE: Generate hypothetical documents for better retrieval

4. Reranking

Purpose: Improve retrieval quality by reordering results

Methods:

Cross-Encoders: BERT-based reranking (ms-marco-MiniLM)
Cohere Rerank: API-based reranking
Maximal Marginal Relevance (MMR): Diversity + relevance
LLM-based: Use LLM to score relevance

Quick Start with LangGraph

from langgraph.graph import StateGraph, START, END
from langchain_anthropic import ChatAnthropic
from langchain_voyageai import VoyageAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_text_splitters import RecursiveCharacterTextSplitter
from typing import TypedDict, Annotated

class RAGState(TypedDict):
    question: str
    context: list[Document]
    answer: str

# Initialize components
llm = ChatAnthropic(model="claude-sonnet-4-5")
embeddings = VoyageAIEmbeddings(model="voyage-3-large")
vectorstore = PineconeVectorStore(index_name="docs", embedding=embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

# RAG prompt
rag_prompt = ChatPromptTemplate.from_template(
    """Answer based on the context below. If you cannot answer, say so.

    Context:
    {context}

    Question: {question}

    Answer:"""
)

async def retrieve(state: RAGState) -> RAGState:
    """Retrieve relevant documents."""
    docs = await retriever.ainvoke(state["question"])
    return {"context": docs}

async def generate(state: RAGState) -> RAGState:
    """Generate answer from context."""
    context_text = "\n\n".join(doc.page_content for doc in state["context"])
    messages = rag_prompt.format_messages(
        context=context_text,
        question=state["question"]
    )
    response = await llm.ainvoke(messages)
    return {"answer": response.content}

# Build RAG graph
builder = StateGraph(RAGState)
builder.add_node("retrieve", retrieve)
builder.add_node("generate", generate)
builder.add_edge(START, "retrieve")
builder.add_edge("retrieve", "generate")
builder.add_edge("generate", END)

rag_chain = builder.compile()

# Use
result = await rag_chain.ainvoke({"question": "What are the main features?"})
print(result["answer"])

Advanced RAG Patterns

Pattern 1: Hybrid Search with RRF

from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever

# Sparse retriever (BM25 for keyword matching)
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 10

# Dense retriever (embeddings for semantic search)
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

# Combine with Reciprocal Rank Fusion weights
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, dense_retriever],
    weights=[0.3, 0.7]  # 30% keyword, 70% semantic
)

Pattern 2: Multi-Query Retrieval

from langchain.retrievers.multi_query import MultiQueryRetriever

# Generate multiple query perspectives for better recall
multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
    llm=llm
)

# Single query → multiple variations → combined results
results = await multi_query_retriever.ainvoke("What is the main topic?")

Pattern 3: Contextual Compression

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

# Compressor extracts only relevant portions
compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 10})
)

# Returns only relevant parts of documents
compressed_docs = await compression_retriever.ainvoke("specific query")

Pattern 4: Parent Document Retriever

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Small chunks for precise retrieval, large chunks for context
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)

# Store for parent documents
docstore = InMemoryStore()

parent_retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=docstore,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter
)

# Add documents (splits children, stores parents)
await parent_retriever.aadd_documents(documents)

# Retrieval returns parent documents with full context
results = await parent_retriever.ainvoke("query")

Pattern 5: HyDE (Hypothetical Document Embeddings)

from langchain_core.prompts import ChatPromptTemplate

class HyDEState(TypedDict):
    question: str
    hypothetical_doc: str
    context: list[Document]
    answer: str

hyde_prompt = ChatPromptTemplate.from_template(
    """Write a detailed passage that would answer this question:

    Question: {question}

    Passage:"""
)

async def generate_hypothetical(state: HyDEState) -> HyDEState:
    """Generate hypothetical document for better retrieval."""
    messages = hyde_prompt.format_messages(question=state["question"])
    response = await llm.ainvoke(messages)
    return {"hypothetical_doc": response.content}

async def retrieve_with_hyde(state: HyDEState) -> HyDEState:
    """Retrieve using hypothetical document."""
    # Use hypothetical doc for retrieval instead of original query
    docs = await retriever.ainvoke(state["hypothetical_doc"])
    return {"context": docs}

# Build HyDE RAG graph
builder = StateGraph(HyDEState)
builder.add_node("hypothetical", generate_hypothetical)
builder.add_node("retrieve", retrieve_with_hyde)
builder.add_node("generate", generate)
builder.add_edge(START, "hypothetical")
builder.add_edge("hypothetical", "retrieve")
builder.add_edge("retrieve", "generate")
builder.add_edge("generate", END)

hyde_rag = builder.compile()

Document Chunking Strategies

Recursive Character Text Splitter

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]  # Try in order
)

chunks = splitter.split_documents(documents)

Token-Based Splitting

from langchain_text_splitters import TokenTextSplitter

splitter = TokenTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    encoding_name="cl100k_base"  # OpenAI tiktoken encoding
)

Semantic Chunking

from langchain_experimental.text_splitter import SemanticChunker

splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95
)

Markdown Header Splitter

from langchain_text_splitters import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on,
    strip_headers=False
)

Vector Store Configurations

Pinecone (Serverless)

from pinecone import Pinecone, ServerlessSpec
from langchain_pinecone import PineconeVectorStore

# Initialize Pinecone client
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

# Create index if needed
if "my-index" not in pc.list_indexes().names():
    pc.create_index(
        name="my-index",
        dimension=1024,  # voyage-3-large dimensions
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )

# Create vector store
index = pc.Index("my-index")
vectorstore = PineconeVectorStore(index=index, embedding=embeddings)

Weaviate

import weaviate
from langchain_weaviate import WeaviateVectorStore

client = weaviate.connect_to_local()  # or connect_to_weaviate_cloud()

vectorstore = WeaviateVectorStore(
    client=client,
    index_name="Documents",
    text_key="content",
    embedding=embeddings
)

Chroma (Local Development)

from langchain_chroma import Chroma

vectorstore = Chroma(
    collection_name="my_collection",
    embedding_function=embeddings,
    persist_directory="./chroma_db"
)

pgvector (PostgreSQL)

from langchain_postgres.vectorstores import PGVector

connection_string = "postgresql+psycopg://user:pass@localhost:5432/vectordb"

vectorstore = PGVector(
    embeddings=embeddings,
    collection_name="documents",
    connection=connection_string,
)

Retrieval Optimization

1. Metadata Filtering

from langchain_core.documents import Document

# Add metadata during indexing
docs_with_metadata = []
for doc in documents:
    doc.metadata.update({
        "source": doc.metadata.get("source", "unknown"),
        "category": determine_category(doc.page_content),
        "date": datetime.now().isoformat()
    })
    docs_with_metadata.append(doc)

# Filter during retrieval
results = await vectorstore.asimilarity_search(
    "query",
    filter={"category": "technical"},
    k=5
)

2. Maximal Marginal Relevance (MMR)

# Balance relevance with diversity
results = await vectorstore.amax_marginal_relevance_search(
    "query",
    k=5,
    fetch_k=20,  # Fetch 20, return top 5 diverse
    lambda_mult=0.5  # 0=max diversity, 1=max relevance
)

3. Reranking with Cross-Encoder

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

async def retrieve_and_rerank(query: str, k: int = 5) -> list[Document]:
    # Get initial results
    candidates = await vectorstore.asimilarity_search(query, k=20)

    # Rerank
    pairs = [[query, doc.page_content] for doc in candidates]
    scores = reranker.predict(pairs)

    # Sort by score and take top k
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [doc for doc, score in ranked[:k]]

4. Cohere Rerank

from langchain.retrievers import CohereRerank
from langchain_cohere import CohereRerank

reranker = CohereRerank(model="rerank-english-v3.0", top_n=5)

# Wrap retriever with reranking
reranked_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20})
)

Prompt Engineering for RAG

Contextual Prompt with Citations

rag_prompt = ChatPromptTemplate.from_template(
    """Answer the question based on the context below. Include citations using [1], [2], etc.

    If you cannot answer based on the context, say "I don't have enough information."

    Context:
    {context}

    Question: {question}

    Instructions:
    1. Use only information from the context
    2. Cite sources with [1], [2] format
    3. If uncertain, express uncertainty

    Answer (with citations):"""
)

Structured Output for RAG

from pydantic import BaseModel, Field

class RAGResponse(BaseModel):
    answer: str = Field(description="The answer based on context")
    confidence: float = Field(description="Confidence score 0-1")
    sources: list[str] = Field(description="Source document IDs used")
    reasoning: str = Field(description="Brief reasoning for the answer")

# Use with structured output
structured_llm = llm.with_structured_output(RAGResponse)

Evaluation Metrics

from typing import TypedDict

class RAGEvalMetrics(TypedDict):
    retrieval_precision: float  # Relevant docs / retrieved docs
    retrieval_recall: float     # Retrieved relevant / total relevant
    answer_relevance: float     # Answer addresses question
    faithfulness: float         # Answer grounded in context
    context_relevance: float    # Context relevant to question

async def evaluate_rag_system(
    rag_chain,
    test_cases: list[dict]
) -> RAGEvalMetrics:
    """Evaluate RAG system on test cases."""
    metrics = {k: [] for k in RAGEvalMetrics.__annotations__}

    for test in test_cases:
        result = await rag_chain.ainvoke({"question": test["question"]})

        # Retrieval metrics
        retrieved_ids = {doc.metadata["id"] for doc in result["context"]}
        relevant_ids = set(test["relevant_doc_ids"])

        precision = len(retrieved_ids & relevant_ids) / len(retrieved_ids)
        recall = len(retrieved_ids & relevant_ids) / len(relevant_ids)

        metrics["retrieval_precision"].append(precision)
        metrics["retrieval_recall"].append(recall)

        # Use LLM-as-judge for quality metrics
        quality = await evaluate_answer_quality(
            question=test["question"],
            answer=result["answer"],
            context=result["context"],
            expected=test.get("expected_answer")
        )
        metrics["answer_relevance"].append(quality["relevance"])
        metrics["faithfulness"].append(quality["faithfulness"])
        metrics["context_relevance"].append(quality["context_relevance"])

    return {k: sum(v) / len(v) for k, v in metrics.items()}

Resources

Best Practices

Chunk Size: Balance between context (larger) and specificity (smaller) - typically 500-1000 tokens
Overlap: Use 10-20% overlap to preserve context at boundaries
Metadata: Include source, page, timestamp for filtering and debugging
Hybrid Search: Combine semantic and keyword search for best recall
Reranking: Use cross-encoder reranking for precision-critical applications
Citations: Always return source documents for transparency
Evaluation: Continuously test retrieval quality and answer accuracy
Monitoring: Track retrieval metrics and latency in production

Common Issues

Poor Retrieval: Check embedding quality, chunk size, query formulation
Irrelevant Results: Add metadata filtering, use hybrid search, rerank
Missing Information: Ensure documents are properly indexed, check chunking
Slow Queries: Optimize vector store, use caching, reduce k
Hallucinations: Improve grounding prompt, add verification step
Context Too Long: Use compression or parent document retriever

rag-implementation