RAG Implementation

Build Retrieval-Augmented Generation systems that extend AI capabilities with external knowledge sources.

Overview

RAG (Retrieval-Augmented Generation) enhances AI applications by retrieving relevant information from knowledge bases and incorporating it into AI responses, reducing hallucinations and providing accurate, grounded answers.

When to Use

Use this skill when:

Building Q&A systems over proprietary documents
Creating chatbots with current, factual information
Implementing semantic search with natural language queries
Reducing hallucinations with grounded responses
Enabling AI systems to access domain-specific knowledge
Building documentation assistants
Creating research tools with source citation
Developing knowledge management systems

Core Components

Vector Databases

Store and efficiently retrieve document embeddings for semantic search.

Key Options:

Pinecone: Managed, scalable, production-ready
Weaviate: Open-source, hybrid search capabilities
Milvus: High performance, on-premise deployment
Chroma: Lightweight, easy local development
Qdrant: Fast, advanced filtering
FAISS: Meta's library, full control

Embedding Models

Convert text to numerical vectors for similarity search.

Popular Models:

text-embedding-ada-002 (OpenAI): General purpose, 1536 dimensions
all-MiniLM-L6-v2: Fast, lightweight, 384 dimensions
e5-large-v2: High quality, multilingual
bge-large-en-v1.5: State-of-the-art performance

Retrieval Strategies

Find relevant content based on user queries.

Approaches:

Dense Retrieval: Semantic similarity via embeddings
Sparse Retrieval: Keyword matching (BM25, TF-IDF)
Hybrid Search: Combine dense + sparse for best results
Multi-Query: Generate multiple query variations
Contextual Compression: Extract only relevant parts

Quick Implementation

Basic RAG Setup

// Load documents from file system
List<Document> documents = FileSystemDocumentLoader.loadDocuments("/path/to/docs");

// Create embedding store
InMemoryEmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>();

// Ingest documents into the store
EmbeddingStoreIngestor.ingest(documents, embeddingStore);

// Create AI service with RAG capability
Assistant assistant = AiServices.builder(Assistant.class)
    .chatModel(chatModel)
    .chatMemory(MessageWindowChatMemory.withMaxMessages(10))
    .contentRetriever(EmbeddingStoreContentRetriever.from(embeddingStore))
    .build();

Document Processing Pipeline

// Split documents into chunks
DocumentSplitter splitter = new RecursiveCharacterTextSplitter(
    500,  // chunk size
    100   // overlap
);

// Create embedding model
EmbeddingModel embeddingModel = OpenAiEmbeddingModel.builder()
    .apiKey("your-api-key")
    .build();

// Create embedding store
EmbeddingStore<TextSegment> embeddingStore = PgVectorEmbeddingStore.builder()
    .host("localhost")
    .database("postgres")
    .user("postgres")
    .password("password")
    .table("embeddings")
    .dimension(1536)
    .build();

// Process and store documents
for (Document document : documents) {
    List<TextSegment> segments = splitter.split(document);
    for (TextSegment segment : segments) {
        Embedding embedding = embeddingModel.embed(segment).content();
        embeddingStore.add(embedding, segment);
    }
}

Implementation Patterns

Pattern 1: Simple Document Q&A

Create a basic Q&A system over your documents.

public interface DocumentAssistant {
    String answer(String question);
}

DocumentAssistant assistant = AiServices.builder(DocumentAssistant.class)
    .chatModel(chatModel)
    .contentRetriever(retriever)
    .build();

Pattern 2: Metadata-Filtered Retrieval

Filter results based on document metadata.

// Add metadata during document loading
Document document = Document.builder()
    .text("Content here")
    .metadata("source", "technical-manual.pdf")
    .metadata("category", "technical")
    .metadata("date", "2024-01-15")
    .build();

// Filter during retrieval
EmbeddingStoreContentRetriever retriever = EmbeddingStoreContentRetriever.builder()
    .embeddingStore(embeddingStore)
    .embeddingModel(embeddingModel)
    .maxResults(5)
    .minScore(0.7)
    .filter(metadataKey("category").isEqualTo("technical"))
    .build();

Pattern 3: Multi-Source Retrieval

Combine results from multiple knowledge sources.

ContentRetriever webRetriever = EmbeddingStoreContentRetriever.from(webStore);
ContentRetriever documentRetriever = EmbeddingStoreContentRetriever.from(documentStore);
ContentRetriever databaseRetriever = EmbeddingStoreContentRetriever.from(databaseStore);

// Combine results
List<Content> allResults = new ArrayList<>();
allResults.addAll(webRetriever.retrieve(query));
allResults.addAll(documentRetriever.retrieve(query));
allResults.addAll(databaseRetriever.retrieve(query));

// Rerank combined results
List<Content> rerankedResults = reranker.reorder(query, allResults);

Best Practices

Document Preparation

Clean and preprocess documents before ingestion
Remove irrelevant content and formatting artifacts
Standardize document structure for consistent processing
Add relevant metadata for filtering and context

Chunking Strategy

Use 500-1000 tokens per chunk for optimal balance
Include 10-20% overlap to preserve context at boundaries
Consider document structure when determining chunk boundaries
Test different chunk sizes for your specific use case

Retrieval Optimization

Start with high k values (10-20) then filter/rerank
Use metadata filtering to improve relevance
Combine multiple retrieval strategies for better coverage
Monitor retrieval quality and user feedback

Performance Considerations

Cache embeddings for frequently accessed content
Use batch processing for document ingestion
Optimize vector store configuration for your scale
Monitor query performance and system resources

Common Issues and Solutions

Poor Retrieval Quality

Problem: Retrieved documents don't match user queries Solutions:

Improve document preprocessing and cleaning
Adjust chunk size and overlap parameters
Try different embedding models
Use hybrid search combining semantic and keyword matching

Irrelevant Results

Problem: Retrieved documents contain relevant information but are not specific enough Solutions:

Add metadata filtering for domain-specific constraints
Implement reranking with cross-encoder models
Use contextual compression to extract relevant parts
Fine-tune retrieval parameters (k values, similarity thresholds)

Performance Issues

Problem: Slow response times during retrieval Solutions:

Optimize vector store configuration and indexing
Implement caching for frequently retrieved content
Use smaller embedding models for faster inference
Consider approximate nearest neighbor algorithms

Hallucination Prevention

Problem: AI generates information not present in retrieved documents Solutions:

Improve prompt engineering to emphasize grounding
Add verification steps to check answer alignment
Include confidence scoring for responses
Implement fact-checking mechanisms

Evaluation Framework

Retrieval Metrics

Precision@k: Percentage of relevant documents in top-k results
Recall@k: Percentage of all relevant documents found in top-k results
Mean Reciprocal Rank (MRR): Average rank of first relevant result
Normalized Discounted Cumulative Gain (nDCG): Ranking quality metric

Answer Quality Metrics

Faithfulness: Degree to which answers are grounded in retrieved documents
Answer Relevance: How well answers address user questions
Context Recall: Percentage of relevant context used in answers
Context Precision: Percentage of retrieved context that is relevant

User Experience Metrics

Response Time: Time from query to answer
User Satisfaction: Feedback ratings on answer quality
Task Completion: Rate of successful task completion
Engagement: User interaction patterns with the system

Resources

Reference Documentation

Vector Database Comparison - Detailed comparison of vector database options
Embedding Models Guide - Model selection and optimization
Retrieval Strategies - Advanced retrieval techniques
Document Chunking - Chunking strategies and best practices
LangChain4j RAG Guide - Official implementation patterns

Assets

assets/vector-store-config.yaml - Configuration templates for different vector stores
assets/retriever-pipeline.java - Complete RAG pipeline implementation
assets/evaluation-metrics.java - Evaluation framework code

Constraints and Limitations

Token Limits: Respect model context window limitations
API Rate Limits: Manage external API rate limits and costs
Data Privacy: Ensure compliance with data protection regulations
Resource Requirements: Consider memory and computational requirements
Maintenance: Plan for regular updates and system monitoring

Security Considerations

Secure access to vector databases and embedding services
Implement proper authentication and authorization
Validate and sanitize user inputs
Monitor for abuse and unusual usage patterns
Regular security audits and penetration testing

rag-implementation