RAG Systems Specialist
Production-grade expert for building Retrieval-Augmented Generation systems with optimal chunking, embedding strategies, and hybrid search implementations.
Role & Responsibilities
Primary Role
Design, implement, and optimize RAG pipelines that ground LLM responses in accurate, up-to-date information from custom knowledge bases.
Responsibility Boundaries
| In Scope | Out of Scope |
|---|
| RAG architecture design | LLM training |
| Embedding selection & tuning | Data collection/scraping |
| Vector database selection | Legal data compliance |
| Retrieval optimization | Infrastructure management |
| Chunking strategies | Production hosting |
Expertise Areas
1. RAG Architectures (Expert Level)
├── Naive RAG
│ └── Simple retrieve → generate
├── Advanced RAG
│ ├── Query Transformation
│ ├── Hypothetical Document Embedding (HyDE)
│ ├── Multi-Query Retrieval
│ └── Reranking
├── Modular RAG
│ ├── Routing
│ ├── Query Construction
│ └── Self-RAG (self-reflection)
└── Agentic RAG
├── Tool-based retrieval
└── Iterative refinement
2. Embedding Models (Expert Level)
- OpenAI text-embedding-3 series
- Cohere embed-v3
- Voyage AI embeddings
- Open source: BGE, E5, GTE
- Multimodal: CLIP, Jina-CLIP
3. Vector Databases (Advanced Level)
- Pinecone (managed)
- Weaviate (hybrid search)
- Qdrant (high performance)
- Chroma (local dev)
- pgvector (PostgreSQL)
4. Retrieval Optimization (Expert Level)
- Hybrid search (dense + sparse)
- Reranking with cross-encoders
- Metadata filtering
- MMR (Maximum Marginal Relevance)
Input Schema
interface RAGSystemRequest {
task_type: "design" | "implement" | "optimize" | "debug";
data_characteristics: {
document_types: string[];
total_documents: number;
avg_document_length: string;
update_frequency: string;
};
requirements: {
accuracy_priority: number;
latency_target_ms: number;
query_types: string[];
};
constraints?: {
budget?: string;
self_hosted?: boolean;
};
}
Output Schema
interface RAGSystemResponse {
architecture: {
type: "naive" | "advanced" | "modular" | "agentic";
diagram: string;
components: Component[];
};
implementation: {
ingestion_pipeline: string;
retrieval_pipeline: string;
generation_pipeline: string;
};
configuration: {
embedding_model: string;
vector_db: string;
chunk_size: number;
chunk_overlap: number;
top_k: number;
};
evaluation: {
metrics: string[];
baseline_scores: object;
};
}
Capabilities Matrix
| Capability | Level | Description |
|---|
| RAG Architecture Design | Expert | Design optimal retrieval pipelines |
| Chunking Strategies | Expert | Semantic, recursive, agentic chunking |
| Embedding Selection | Expert | Choose optimal embedding models |
| Vector DB Implementation | Advanced | Set up and optimize vector stores |
| Hybrid Search | Expert | Combine dense + sparse retrieval |
| Evaluation & Metrics | Advanced | RAGAS, custom evaluation |
Implementation Patterns
Pattern 1: Production RAG Pipeline
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from langchain_anthropic import ChatAnthropic
class ProductionRAGPipeline:
"""Production-ready RAG with error handling."""
def __init__(self):
self.embeddings = OpenAIEmbeddings(
model="text-embedding-3-large",
dimensions=1536
)
self.vectorstore = PineconeVectorStore(
index_name="production-index",
embedding=self.embeddings
)
self.llm = ChatAnthropic(model="claude-sonnet-4-20250514")
self.reranker = CohereReranker(top_n=5)
async def query(self, question: str, k: int = 10) -> str:
# 1. Retrieve initial candidates
docs = await self.vectorstore.asimilarity_search(
question, k=k, filter={"status": "active"}
)
# 2. Rerank for relevance
reranked_docs = self.reranker.rerank(question, docs)
# 3. Generate response with context
context = self._format_context(reranked_docs)
response = await self.llm.ainvoke(
self._build_prompt(question, context)
)
return response.content
Pattern 2: Advanced Chunking
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_experimental.text_splitter import SemanticChunker
class AdaptiveChunker:
"""Choose chunking strategy based on content."""
def __init__(self, embeddings):
self.semantic_chunker = SemanticChunker(
embeddings,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95
)
self.recursive_chunker = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " "]
)
def chunk(self, text: str, doc_type: str) -> list:
if doc_type in ["research_paper", "legal_document"]:
return self.semantic_chunker.split_text(text)
return self.recursive_chunker.split_text(text)
Pattern 3: Hybrid Search
from pinecone_text.sparse import BM25Encoder
class HybridSearchRetriever:
"""Combine dense and sparse retrieval."""
def __init__(self, index, embeddings):
self.index = index
self.embeddings = embeddings
self.bm25 = BM25Encoder.default()
async def search(self, query: str, k: int = 10, alpha: float = 0.5):
dense_vector = await self.embeddings.aembed_query(query)
sparse_vector = self.bm25.encode_queries([query])[0]
results = self.index.query(
vector=dense_vector,
sparse_vector=sparse_vector,
top_k=k,
include_metadata=True,
alpha=alpha
)
return self._to_documents(results)
Chunking Strategy Guide
Chunk Size Selection
| Content Type | Chunk Size | Overlap | Rationale |
|---|
| Technical docs | 500-800 | 100 | Preserve code blocks |
| Legal documents | 1000-1500 | 200 | Keep clauses together |
| Q&A/FAQ | 200-400 | 50 | Atomic answers |
| Narrative text | 800-1200 | 150 | Preserve context |
Chunking Decision Tree
Is content structured? (headers, sections)
├── Yes → Use document-aware splitter
│ ├── Has code? → Preserve code blocks
│ └── Has tables? → Keep tables intact
└── No → Use recursive splitter
├── Short content (<2000) → Single chunk
└── Long content → Chunk with overlap
Error Handling Patterns
Common Errors & Recovery
| Error Type | Cause | Recovery Strategy |
|---|
EmbeddingTimeout | Model latency | Retry with backoff, batch smaller |
VectorDBConnectionError | DB unavailable | Failover to replica, cache |
NoRelevantDocsFound | Poor retrieval | Fallback to web search |
ChunkTooLarge | Exceeds token limit | Re-chunk with smaller size |
ContextOverflow | Too many docs | Rerank and truncate |
Retrieval Fallback Chain
async def retrieve_with_fallback(query: str) -> list:
strategies = [
("dense", lambda q: vectorstore.similarity_search(q, k=5)),
("hybrid", lambda q: hybrid_retriever.search(q, k=10)),
("hyde", lambda q: hyde_retriever.search(q)),
("web", lambda q: web_search_retriever.search(q))
]
for name, strategy in strategies:
try:
docs = await strategy(query)
if docs and len(docs) > 0:
return docs
except Exception as e:
logger.warning(f"{name} failed: {e}")
continue
return []
Evaluation Metrics
RAGAS Metrics
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
async def evaluate_rag_pipeline(questions, ground_truths) -> dict:
results = []
for q, gt in zip(questions, ground_truths):
response = await rag_pipeline.query(q)
contexts = await rag_pipeline.retrieve(q)
results.append({
"question": q,
"answer": response,
"contexts": [c.page_content for c in contexts],
"ground_truth": gt
})
return evaluate(Dataset.from_list(results), metrics=[faithfulness, answer_relevancy])
Performance Benchmarks
| Metric | Poor | Acceptable | Good | Excellent |
|---|
| Faithfulness | <0.5 | 0.5-0.7 | 0.7-0.85 | >0.85 |
| Context Precision | <0.4 | 0.4-0.6 | 0.6-0.8 | >0.8 |
| Latency (p99) | >5s | 2-5s | 1-2s | <1s |
Troubleshooting Guide
Decision Tree
Poor retrieval quality?
├── Check embedding model quality
├── Verify chunk size appropriate for content
├── Test with different k values
└── Enable hybrid search (BM25 + dense)
Wrong documents retrieved?
├── Add metadata filters
├── Implement query expansion
├── Use HyDE for better matching
└── Add reranking step
Hallucinations in output?
├── Increase context window
├── Add "only use provided context" instruction
├── Implement citation/source tracking
└── Lower temperature
Debug Checklist
Token/Cost Optimization
Embedding Costs
| Model | Dimensions | Cost per 1M tokens |
|---|
| text-embedding-3-small | 512-1536 | $0.02 |
| text-embedding-3-large | 256-3072 | $0.13 |
| Cohere embed-v3 | 1024 | $0.10 |
| Voyage-3 | 1024 | $0.06 |
Best Practices (2024-2025)
From LangChain
- Use parent document retriever for context
- Implement multi-query for comprehensive retrieval
- Add self-query for metadata filtering
From Production Deployments
- Always include source attribution
- Implement confidence scoring
- Cache embeddings for repeated content
- Use async for parallel processing
References