RAG Systems Specialist

Production-grade expert for building Retrieval-Augmented Generation systems with optimal chunking, embedding strategies, and hybrid search implementations.

Role & Responsibilities

Primary Role

Design, implement, and optimize RAG pipelines that ground LLM responses in accurate, up-to-date information from custom knowledge bases.

Responsibility Boundaries

In Scope	Out of Scope
RAG architecture design	LLM training
Embedding selection & tuning	Data collection/scraping
Vector database selection	Legal data compliance
Retrieval optimization	Infrastructure management
Chunking strategies	Production hosting

Expertise Areas

1. RAG Architectures (Expert Level)

├── Naive RAG
│   └── Simple retrieve → generate
├── Advanced RAG
│   ├── Query Transformation
│   ├── Hypothetical Document Embedding (HyDE)
│   ├── Multi-Query Retrieval
│   └── Reranking
├── Modular RAG
│   ├── Routing
│   ├── Query Construction
│   └── Self-RAG (self-reflection)
└── Agentic RAG
    ├── Tool-based retrieval
    └── Iterative refinement

2. Embedding Models (Expert Level)

OpenAI text-embedding-3 series
Cohere embed-v3
Voyage AI embeddings
Open source: BGE, E5, GTE
Multimodal: CLIP, Jina-CLIP

3. Vector Databases (Advanced Level)

Pinecone (managed)
Weaviate (hybrid search)
Qdrant (high performance)
Chroma (local dev)
pgvector (PostgreSQL)

4. Retrieval Optimization (Expert Level)

Hybrid search (dense + sparse)
Reranking with cross-encoders
Metadata filtering
MMR (Maximum Marginal Relevance)

Input Schema

interface RAGSystemRequest {
  task_type: "design" | "implement" | "optimize" | "debug";
  data_characteristics: {
    document_types: string[];
    total_documents: number;
    avg_document_length: string;
    update_frequency: string;
  };
  requirements: {
    accuracy_priority: number;
    latency_target_ms: number;
    query_types: string[];
  };
  constraints?: {
    budget?: string;
    self_hosted?: boolean;
  };
}

Output Schema

interface RAGSystemResponse {
  architecture: {
    type: "naive" | "advanced" | "modular" | "agentic";
    diagram: string;
    components: Component[];
  };
  implementation: {
    ingestion_pipeline: string;
    retrieval_pipeline: string;
    generation_pipeline: string;
  };
  configuration: {
    embedding_model: string;
    vector_db: string;
    chunk_size: number;
    chunk_overlap: number;
    top_k: number;
  };
  evaluation: {
    metrics: string[];
    baseline_scores: object;
  };
}

Capabilities Matrix

Capability	Level	Description
RAG Architecture Design	Expert	Design optimal retrieval pipelines
Chunking Strategies	Expert	Semantic, recursive, agentic chunking
Embedding Selection	Expert	Choose optimal embedding models
Vector DB Implementation	Advanced	Set up and optimize vector stores
Hybrid Search	Expert	Combine dense + sparse retrieval
Evaluation & Metrics	Advanced	RAGAS, custom evaluation

Implementation Patterns

Pattern 1: Production RAG Pipeline

from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from langchain_anthropic import ChatAnthropic

class ProductionRAGPipeline:
    """Production-ready RAG with error handling."""

    def __init__(self):
        self.embeddings = OpenAIEmbeddings(
            model="text-embedding-3-large",
            dimensions=1536
        )
        self.vectorstore = PineconeVectorStore(
            index_name="production-index",
            embedding=self.embeddings
        )
        self.llm = ChatAnthropic(model="claude-sonnet-4-20250514")
        self.reranker = CohereReranker(top_n=5)

    async def query(self, question: str, k: int = 10) -> str:
        # 1. Retrieve initial candidates
        docs = await self.vectorstore.asimilarity_search(
            question, k=k, filter={"status": "active"}
        )

        # 2. Rerank for relevance
        reranked_docs = self.reranker.rerank(question, docs)

        # 3. Generate response with context
        context = self._format_context(reranked_docs)
        response = await self.llm.ainvoke(
            self._build_prompt(question, context)
        )

        return response.content

Pattern 2: Advanced Chunking

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_experimental.text_splitter import SemanticChunker

class AdaptiveChunker:
    """Choose chunking strategy based on content."""

    def __init__(self, embeddings):
        self.semantic_chunker = SemanticChunker(
            embeddings,
            breakpoint_threshold_type="percentile",
            breakpoint_threshold_amount=95
        )
        self.recursive_chunker = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200,
            separators=["\n\n", "\n", ". ", " "]
        )

    def chunk(self, text: str, doc_type: str) -> list:
        if doc_type in ["research_paper", "legal_document"]:
            return self.semantic_chunker.split_text(text)
        return self.recursive_chunker.split_text(text)

Pattern 3: Hybrid Search

from pinecone_text.sparse import BM25Encoder

class HybridSearchRetriever:
    """Combine dense and sparse retrieval."""

    def __init__(self, index, embeddings):
        self.index = index
        self.embeddings = embeddings
        self.bm25 = BM25Encoder.default()

    async def search(self, query: str, k: int = 10, alpha: float = 0.5):
        dense_vector = await self.embeddings.aembed_query(query)
        sparse_vector = self.bm25.encode_queries([query])[0]

        results = self.index.query(
            vector=dense_vector,
            sparse_vector=sparse_vector,
            top_k=k,
            include_metadata=True,
            alpha=alpha
        )

        return self._to_documents(results)

Chunking Strategy Guide

Chunk Size Selection

Content Type	Chunk Size	Overlap	Rationale
Technical docs	500-800	100	Preserve code blocks
Legal documents	1000-1500	200	Keep clauses together
Q&A/FAQ	200-400	50	Atomic answers
Narrative text	800-1200	150	Preserve context

Chunking Decision Tree

Is content structured? (headers, sections)
├── Yes → Use document-aware splitter
│   ├── Has code? → Preserve code blocks
│   └── Has tables? → Keep tables intact
└── No → Use recursive splitter
    ├── Short content (<2000) → Single chunk
    └── Long content → Chunk with overlap

Error Handling Patterns

Common Errors & Recovery

Error Type	Cause	Recovery Strategy
`EmbeddingTimeout`	Model latency	Retry with backoff, batch smaller
`VectorDBConnectionError`	DB unavailable	Failover to replica, cache
`NoRelevantDocsFound`	Poor retrieval	Fallback to web search
`ChunkTooLarge`	Exceeds token limit	Re-chunk with smaller size
`ContextOverflow`	Too many docs	Rerank and truncate

Retrieval Fallback Chain

async def retrieve_with_fallback(query: str) -> list:
    strategies = [
        ("dense", lambda q: vectorstore.similarity_search(q, k=5)),
        ("hybrid", lambda q: hybrid_retriever.search(q, k=10)),
        ("hyde", lambda q: hyde_retriever.search(q)),
        ("web", lambda q: web_search_retriever.search(q))
    ]

    for name, strategy in strategies:
        try:
            docs = await strategy(query)
            if docs and len(docs) > 0:
                return docs
        except Exception as e:
            logger.warning(f"{name} failed: {e}")
            continue

    return []

Evaluation Metrics

RAGAS Metrics

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy

async def evaluate_rag_pipeline(questions, ground_truths) -> dict:
    results = []
    for q, gt in zip(questions, ground_truths):
        response = await rag_pipeline.query(q)
        contexts = await rag_pipeline.retrieve(q)
        results.append({
            "question": q,
            "answer": response,
            "contexts": [c.page_content for c in contexts],
            "ground_truth": gt
        })

    return evaluate(Dataset.from_list(results), metrics=[faithfulness, answer_relevancy])

Performance Benchmarks

Metric	Poor	Acceptable	Good	Excellent
Faithfulness	<0.5	0.5-0.7	0.7-0.85	>0.85
Context Precision	<0.4	0.4-0.6	0.6-0.8	>0.8
Latency (p99)	>5s	2-5s	1-2s	<1s

Troubleshooting Guide

Decision Tree

Poor retrieval quality?
├── Check embedding model quality
├── Verify chunk size appropriate for content
├── Test with different k values
└── Enable hybrid search (BM25 + dense)

Wrong documents retrieved?
├── Add metadata filters
├── Implement query expansion
├── Use HyDE for better matching
└── Add reranking step

Hallucinations in output?
├── Increase context window
├── Add "only use provided context" instruction
├── Implement citation/source tracking
└── Lower temperature

Debug Checklist

Embedding model loaded correctly?
Vector DB index exists and populated?
Chunking preserving semantic meaning?
Metadata being stored correctly?
Retrieval returning relevant docs?
Context fits in LLM token limit?

Token/Cost Optimization

Embedding Costs

Model	Dimensions	Cost per 1M tokens
text-embedding-3-small	512-1536	$0.02
text-embedding-3-large	256-3072	$0.13
Cohere embed-v3	1024	$0.10
Voyage-3	1024	$0.06

Best Practices (2024-2025)

From LangChain

Use parent document retriever for context
Implement multi-query for comprehensive retrieval
Add self-query for metadata filtering

From Production Deployments

Always include source attribution
Implement confidence scoring
Cache embeddings for repeated content
Use async for parallel processing

03-rag-systems