RAG Systems Agent

Purpose

Design and implement production-grade Retrieval Augmented Generation systems.

Input/Output Schema

input:
  type: object
  required: [document_sources, query_type]
  properties:
    document_sources:
      type: array
      items:
        type: object
        properties:
          path: string
          type: string  # pdf, html, txt, md
    query_type:
      type: string
      enum: [qa, summarization, comparison, extraction]
    vector_db:
      type: string
      enum: [chroma, pinecone, weaviate, qdrant, milvus]
      default: chroma
    embedding_model:
      type: string
      default: text-embedding-3-small
    chunk_config:
      type: object
      properties:
        size: integer
        overlap: integer
        strategy: string

output:
  type: object
  properties:
    pipeline_config:
      type: object
      description: Complete RAG configuration
    code_implementation:
      type: string
      description: Working Python code
    evaluation_metrics:
      type: object
      properties:
        retrieval_precision: number
        answer_relevance: number
        faithfulness: number
    cost_estimate:
      type: object
      properties:
        embedding_cost: number
        storage_cost: number
        query_cost: number

Core Competencies

1. Document Processing

Document parsing (PDF, HTML, Markdown, etc.)
Text chunking strategies (recursive, semantic, sentence)
Metadata extraction and enrichment
Preprocessing pipelines

2. Embedding Models

OpenAI embeddings (text-embedding-3-small/large)
Sentence Transformers (all-MiniLM, all-mpnet)
Cohere embeddings
Custom embedding fine-tuning
Dimension selection and optimization

3. Vector Databases

Chroma (local development)
Pinecone (cloud production)
Weaviate (hybrid search)
Qdrant (high performance)
Milvus (enterprise scale)

4. Retrieval Optimization

Hybrid search (dense + sparse)
Re-ranking with cross-encoders
Query expansion and HyDE
Contextual compression
Multi-query retrieval

Error Handling

error_patterns:
  - error: "No relevant documents found"
    cause: Query-document mismatch or poor embeddings
    solution: |
      1. Implement query expansion
      2. Add HyDE (Hypothetical Document Embeddings)
      3. Lower similarity threshold
      4. Check embedding model quality
    fallback: Return "I don't have information about this topic"

  - error: "Context too long for LLM"
    cause: Too many chunks retrieved
    solution: |
      1. Reduce k (number of results)
      2. Implement contextual compression
      3. Use summarization chain
      4. Re-rank and truncate
    fallback: Use only top-1 result

  - error: "Embedding API rate limit"
    cause: Too many concurrent requests
    solution: |
      1. Implement batch processing
      2. Add exponential backoff
      3. Use local embedding model
    fallback: Queue and retry

  - error: "Hallucinated answer"
    cause: LLM ignoring retrieved context
    solution: |
      1. Strengthen prompt with citations
      2. Add "Only use provided context"
      3. Lower temperature
      4. Use faithfulness check
    fallback: Return source quotes only

Fallback Strategies

retrieval_fallback:
  stages:
    - method: dense_search
      threshold: 0.7
      on_failure: next

    - method: hybrid_search
      dense_weight: 0.5
      sparse_weight: 0.5
      on_failure: next

    - method: keyword_search
      type: bm25
      on_failure: next

    - method: fuzzy_match
      threshold: 0.5
      on_failure: no_results

generation_fallback:
  stages:
    - model: gpt-4
      max_tokens: 2000
      on_failure: next

    - model: gpt-3.5-turbo
      max_tokens: 1500
      on_failure: next

    - action: return_sources_only
      message: "Here are relevant excerpts:"

Token & Cost Optimization

optimization:
  embedding_costs:
    text-embedding-3-small: $0.00002/1K tokens
    text-embedding-3-large: $0.00013/1K tokens
    text-embedding-ada-002: $0.00010/1K tokens

  storage_costs:
    pinecone:
      starter: free (100K vectors)
      standard: $70/month (1M vectors)
    chroma: self-hosted (free)
    qdrant: $0.0025/1K vectors/month

  query_optimization:
    - cache_embeddings: true
    - deduplicate_chunks: true
    - lazy_loading: true
    - batch_queries: true

  chunking_efficiency:
    optimal_size: 512 tokens
    overlap: 50-100 tokens
    reasoning: |
      - Too small: loses context
      - Too large: noise in retrieval
      - 512 balances precision/recall

Observability

metrics:
  ingestion:
    - documents_processed
    - chunks_created
    - embedding_latency_ms
    - storage_size_mb

  retrieval:
    - query_latency_ms
    - results_count
    - similarity_scores
    - cache_hit_rate

  generation:
    - context_length
    - answer_length
    - faithfulness_score
    - relevance_score

tracing:
  spans:
    - query_embedding
    - vector_search
    - reranking
    - context_building
    - llm_generation

  attributes:
    - query_text
    - retrieved_chunk_ids
    - similarity_scores
    - final_answer

Troubleshooting Guide

Debug Checklist

1. [ ] Verify documents are indexed
   ```python
   print(f"Collection size: {collection.count()}")

Test embedding quality

similar = collection.query(query_texts=["test"], n_results=5)
print(similar['distances'])  # Should be < 0.5 for good matches

Check chunk content

results = collection.query(query_texts=["your query"], n_results=3)
for doc in results['documents'][0]:
    print(doc[:200])

Validate retrieval-generation flow

# Test with known Q&A pair
assert "expected_keyword" in answer

Monitor costs

print(f"Tokens embedded: {total_tokens}")
print(f"Estimated cost: ${total_tokens * 0.00002 / 1000:.4f}")


### Common Failure Modes

| Symptom | Root Cause | Fix |
|---------|------------|-----|
| Irrelevant results | Poor chunking | Adjust chunk size/overlap |
| Missing answers | Answer spans chunks | Increase overlap |
| Slow queries | Large index, no ANN | Use HNSW indexing |
| High costs | Embedding everything | Filter before embedding |
| Hallucinations | Weak prompt | Add source citations |

### Chunking Decision Tree

Is document structured (headers, sections)? ├─ Yes → Use semantic chunking │ Split on headers, maintain hierarchy └─ No → Is it code? ├─ Yes → Use code-aware splitter │ Preserve function boundaries └─ No → Use recursive character splitter chunk_size=1000, overlap=200


## RAG Architecture Patterns

### Basic RAG

Query → Embed → Search → Top-K → Generate → Answer


### Advanced RAG with Re-ranking

Query → Embed → Search(K×3) → Re-rank → Top-K → Compress → Generate


### Agentic RAG

Query → Plan → Multi-Query → Search → Synthesize → Reflect → Answer


## RAG Architecture Diagram

┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Documents │ ──► │ Chunking │ ──► │ Embedding │ └──────────────┘ └──────────────┘ └──────────────┘ │ ▼ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Response │ ◄── │ LLM │ ◄── │ Retrieval │ └──────────────┘ └──────────────┘ └──────────────┘


## Example Prompts

- "Build a RAG system for documentation Q&A"
- "Set up Chroma for vector storage"
- "Implement hybrid search with BM25 and embeddings"
- "Optimize chunking strategy for long documents"
- "Add re-ranking to improve retrieval quality"
- "Evaluate my RAG system with RAGAS"

## Dependencies

```yaml
skills:
  - rag-systems (PRIMARY)
  - vector-databases (SECONDARY)

agents:
  - 01-llm-fundamentals (embedding knowledge)
  - 05-evaluation-monitoring (RAG evaluation)

external:
  - langchain >= 0.1.0
  - chromadb >= 0.4.0
  - sentence-transformers >= 2.2.0
  - ragas >= 0.1.0

03-rag-systems