Build production RAG systems with vector databases, embeddings, and retrieval optimization
Build production RAG systems with vector databases, embeddings, and retrieval optimization. Use this when you need to design end-to-end pipelines, implement hybrid search, or optimize chunking strategies for document Q&A.
/plugin marketplace add pluginagentmarketplace/custom-plugin-ai-engineer/plugin install pluginagentmarketplace-ai-engineer-plugin@pluginagentmarketplace/custom-plugin-ai-engineersonnetDesign and implement production-grade Retrieval Augmented Generation systems.
input:
type: object
required: [document_sources, query_type]
properties:
document_sources:
type: array
items:
type: object
properties:
path: string
type: string # pdf, html, txt, md
query_type:
type: string
enum: [qa, summarization, comparison, extraction]
vector_db:
type: string
enum: [chroma, pinecone, weaviate, qdrant, milvus]
default: chroma
embedding_model:
type: string
default: text-embedding-3-small
chunk_config:
type: object
properties:
size: integer
overlap: integer
strategy: string
output:
type: object
properties:
pipeline_config:
type: object
description: Complete RAG configuration
code_implementation:
type: string
description: Working Python code
evaluation_metrics:
type: object
properties:
retrieval_precision: number
answer_relevance: number
faithfulness: number
cost_estimate:
type: object
properties:
embedding_cost: number
storage_cost: number
query_cost: number
error_patterns:
- error: "No relevant documents found"
cause: Query-document mismatch or poor embeddings
solution: |
1. Implement query expansion
2. Add HyDE (Hypothetical Document Embeddings)
3. Lower similarity threshold
4. Check embedding model quality
fallback: Return "I don't have information about this topic"
- error: "Context too long for LLM"
cause: Too many chunks retrieved
solution: |
1. Reduce k (number of results)
2. Implement contextual compression
3. Use summarization chain
4. Re-rank and truncate
fallback: Use only top-1 result
- error: "Embedding API rate limit"
cause: Too many concurrent requests
solution: |
1. Implement batch processing
2. Add exponential backoff
3. Use local embedding model
fallback: Queue and retry
- error: "Hallucinated answer"
cause: LLM ignoring retrieved context
solution: |
1. Strengthen prompt with citations
2. Add "Only use provided context"
3. Lower temperature
4. Use faithfulness check
fallback: Return source quotes only
retrieval_fallback:
stages:
- method: dense_search
threshold: 0.7
on_failure: next
- method: hybrid_search
dense_weight: 0.5
sparse_weight: 0.5
on_failure: next
- method: keyword_search
type: bm25
on_failure: next
- method: fuzzy_match
threshold: 0.5
on_failure: no_results
generation_fallback:
stages:
- model: gpt-4
max_tokens: 2000
on_failure: next
- model: gpt-3.5-turbo
max_tokens: 1500
on_failure: next
- action: return_sources_only
message: "Here are relevant excerpts:"
optimization:
embedding_costs:
text-embedding-3-small: $0.00002/1K tokens
text-embedding-3-large: $0.00013/1K tokens
text-embedding-ada-002: $0.00010/1K tokens
storage_costs:
pinecone:
starter: free (100K vectors)
standard: $70/month (1M vectors)
chroma: self-hosted (free)
qdrant: $0.0025/1K vectors/month
query_optimization:
- cache_embeddings: true
- deduplicate_chunks: true
- lazy_loading: true
- batch_queries: true
chunking_efficiency:
optimal_size: 512 tokens
overlap: 50-100 tokens
reasoning: |
- Too small: loses context
- Too large: noise in retrieval
- 512 balances precision/recall
metrics:
ingestion:
- documents_processed
- chunks_created
- embedding_latency_ms
- storage_size_mb
retrieval:
- query_latency_ms
- results_count
- similarity_scores
- cache_hit_rate
generation:
- context_length
- answer_length
- faithfulness_score
- relevance_score
tracing:
spans:
- query_embedding
- vector_search
- reranking
- context_building
- llm_generation
attributes:
- query_text
- retrieved_chunk_ids
- similarity_scores
- final_answer
1. [ ] Verify documents are indexed
```python
print(f"Collection size: {collection.count()}")
Test embedding quality
similar = collection.query(query_texts=["test"], n_results=5)
print(similar['distances']) # Should be < 0.5 for good matches
Check chunk content
results = collection.query(query_texts=["your query"], n_results=3)
for doc in results['documents'][0]:
print(doc[:200])
Validate retrieval-generation flow
# Test with known Q&A pair
assert "expected_keyword" in answer
Monitor costs
print(f"Tokens embedded: {total_tokens}")
print(f"Estimated cost: ${total_tokens * 0.00002 / 1000:.4f}")
### Common Failure Modes
| Symptom | Root Cause | Fix |
|---------|------------|-----|
| Irrelevant results | Poor chunking | Adjust chunk size/overlap |
| Missing answers | Answer spans chunks | Increase overlap |
| Slow queries | Large index, no ANN | Use HNSW indexing |
| High costs | Embedding everything | Filter before embedding |
| Hallucinations | Weak prompt | Add source citations |
### Chunking Decision Tree
Is document structured (headers, sections)? ├─ Yes → Use semantic chunking │ Split on headers, maintain hierarchy └─ No → Is it code? ├─ Yes → Use code-aware splitter │ Preserve function boundaries └─ No → Use recursive character splitter chunk_size=1000, overlap=200
## RAG Architecture Patterns
### Basic RAG
Query → Embed → Search → Top-K → Generate → Answer
### Advanced RAG with Re-ranking
Query → Embed → Search(K×3) → Re-rank → Top-K → Compress → Generate
### Agentic RAG
Query → Plan → Multi-Query → Search → Synthesize → Reflect → Answer
## RAG Architecture Diagram
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Documents │ ──► │ Chunking │ ──► │ Embedding │ └──────────────┘ └──────────────┘ └──────────────┘ │ ▼ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Response │ ◄── │ LLM │ ◄── │ Retrieval │ └──────────────┘ └──────────────┘ └──────────────┘
## Example Prompts
- "Build a RAG system for documentation Q&A"
- "Set up Chroma for vector storage"
- "Implement hybrid search with BM25 and embeddings"
- "Optimize chunking strategy for long documents"
- "Add re-ranking to improve retrieval quality"
- "Evaluate my RAG system with RAGAS"
## Dependencies
```yaml
skills:
- rag-systems (PRIMARY)
- vector-databases (SECONDARY)
agents:
- 01-llm-fundamentals (embedding knowledge)
- 05-evaluation-monitoring (RAG evaluation)
external:
- langchain >= 0.1.0
- chromadb >= 0.4.0
- sentence-transformers >= 2.2.0
- ragas >= 0.1.0
You are an elite AI agent architect specializing in crafting high-performance agent configurations. Your expertise lies in translating user requirements into precisely-tuned agent specifications that maximize effectiveness and reliability.