From ai-engineer
Design a RAG pipeline — chunking strategy, embedding selection, retrieval configuration, and end-to-end evaluation.
npx claudepluginhub hpsgd/turtlestack --plugin ai-engineerThis skill is limited to using the following tools:
Design a RAG pipeline for $ARGUMENTS.
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Migrates code, prompts, and API calls from Claude Sonnet 4.0/4.5 or Opus 4.1 to Opus 4.5, updating model strings on Anthropic, AWS, GCP, Azure platforms.
Analyzes BMad project state from catalog CSV, configs, artifacts, and query to recommend next skills or answer questions. Useful for help requests, 'what next', or starting BMad.
Design a RAG pipeline for $ARGUMENTS.
Understand the content before designing the pipeline. The corpus dictates every downstream decision.
| Property | Question | Impact on pipeline |
|---|---|---|
| Document types | What formats? (PDF, HTML, Markdown, plain text, code) | Determines parsing and extraction strategy |
| Volume | How many documents? Total size? | Storage, indexing time, cost projections |
| Average length | Typical document size in tokens? | Chunk size and overlap decisions |
| Update frequency | Static, daily, hourly, real-time? | Index rebuild strategy, freshness requirements |
| Content structure | Headings, sections, tables, code blocks? | Semantic chunking boundaries |
| Language | Single or multilingual? | Embedding model selection |
| Quality | Clean text or noisy (OCR, scraped HTML)? | Preprocessing pipeline requirements |
Examine a representative sample of 10-20 documents. Do not design a pipeline for content you have not read.
Chunking is the highest-impact decision in the pipeline. Wrong chunk sizes corrupt retrieval regardless of everything downstream.
| Strategy | Chunk size | Overlap | When to use |
|---|---|---|---|
| Fixed-size | 256 tokens | 10% | Short, uniform documents. Simple but lossy at boundaries |
| Fixed-size | 512 tokens | 10-20% | General purpose default. Start here |
| Fixed-size | 1024 tokens | 20% | Long-form content where context within a chunk matters |
| Semantic | Variable | None | Structured documents with clear headings and sections |
| Paragraph-based | Variable | None | Well-formatted text with meaningful paragraph breaks |
Chunking rules:
Chunk metadata schema (MANDATORY):
| Field | Type | Purpose |
|---|---|---|
| source_id | string | Reference to the original document |
| source_name | string | Human-readable document name |
| chunk_index | integer | Position within the source document |
| section_heading | string | Nearest heading above this chunk |
| created_at | datetime | When the chunk was created |
| content_hash | string | Hash of chunk content — detect changes on re-index |
Metadata enables filtered retrieval — searching within a category, date range, or document type rather than the full corpus.
Define the metadata schema based on corpus analysis:
| Field | Type | Values | Enables |
|---|---|---|---|
| category | string | [domain-specific categories] | Filtered search within a topic |
| document_type | string | policy, guide, reference, FAQ | Type-specific retrieval |
| date | date | Document publication or effective date | Recency-weighted retrieval |
| audience | string | internal, customer, developer | Audience-appropriate results |
| [custom] | [type] | [domain-specific values] | [domain-specific filtering] |
Rules:
Evaluate embedding models on YOUR data. Benchmark rankings do not predict performance on your domain.
Evaluation method:
| Criterion | Question | Trade-off |
|---|---|---|
| Domain fit | Does the model understand your domain vocabulary? | General models miss domain-specific semantics |
| Dimensionality | 384 / 768 / 1024 / 1536 dimensions? | Higher = more nuance, more storage, slower search |
| Max input tokens | Does it handle your chunk sizes? | Chunks exceeding the limit are silently truncated |
| Cost | Per-token embedding cost × corpus size × rebuild frequency | Adds up fast on large, frequently updated corpora |
| Speed | Embedding throughput (tokens/second) | Matters for initial indexing and real-time updates |
Do not select an embedding model without running the 20-query evaluation. An embedding model that scores well on academic benchmarks but misses your domain terminology will return irrelevant results.
| Strategy | How it works | When to use |
|---|---|---|
| Similarity search | Cosine similarity between query and chunk embeddings | Default starting point. Works when queries match document language |
| Hybrid search | Similarity + keyword (BM25) combined with weighted scoring | When users search with exact terms (product names, codes, error messages) |
| Re-ranking | Retrieve top-N candidates, re-rank with a cross-encoder model | When initial retrieval precision is insufficient. Adds latency |
Top-K selection:
| Top-K | Trade-off |
|---|---|
| 3 | Minimal context, lowest cost. Use when documents are highly relevant and focused |
| 5 | Default. Balances context breadth with cost and noise |
| 10 | Maximum context. Use when relevance is uncertain or topic requires synthesis across documents |
Rules:
How retrieved context is injected into the generation prompt determines output quality and faithfulness.
Template structure:
[System instruction — role, task, constraints]
You are a [role] that answers questions using ONLY the provided context documents.
If the answer is not in the context, say "I don't have enough information to answer this."
Always cite the source document for each claim in your answer.
[Retrieved context — clearly delimited]
<context>
[Source: {source_name_1}]
{chunk_text_1}
[Source: {source_name_2}]
{chunk_text_2}
</context>
[User query]
Question: {user_query}
Citation requirements (MANDATORY):
Context window management:
Evaluate retrieval and generation separately. If retrieval returns wrong documents, better prompts will not help.
Retrieval evaluation:
| Metric | Definition | Target |
|---|---|---|
| Precision@K | Of the K retrieved chunks, how many are relevant? | >= 80% |
| Recall@K | Of all relevant chunks in the corpus, how many were retrieved? | >= 70% |
| MRR | Mean Reciprocal Rank — is the best result near the top? | >= 0.8 |
Generation evaluation:
| Metric | Definition | Target |
|---|---|---|
| Answer accuracy | Is the generated answer factually correct? | >= 90% |
| Citation accuracy | Do citations match the actual source of the information? | >= 95% |
| Faithfulness | Does the answer contain ONLY information from the context? | 100% — any hallucination is a failure |
| Completeness | Does the answer address the full question? | >= 85% |
Use RAGAS as the industry-standard RAG evaluation framework. The standard RAGAS metrics are: faithfulness, answer_relevancy, context_precision, and context_recall.
Evaluation process:
A RAG pipeline that returns stale information is worse than no pipeline — it gives users false confidence.
| Property | Decision |
|---|---|
| Rebuild frequency | Full re-index: [daily / weekly / monthly]. Based on corpus update frequency |
| Incremental updates | New/modified documents indexed within [timeframe]. Requires change detection |
| Change detection | Content hash comparison — re-embed only changed chunks |
| Stale content handling | Documents older than [threshold] flagged in retrieval results |
| Monitoring | Track: index size, query latency, retrieval scores, generation quality over time |
Rules:
# RAG Pipeline Design: [corpus/knowledge base name]
## Corpus Profile
| Property | Value |
|---|---|
| Document types | [formats] |
| Volume | [count, total size] |
| Update frequency | [cadence] |
| Language | [languages] |
## Chunking Configuration
- **Strategy:** [fixed / semantic / paragraph]
- **Chunk size:** [tokens]
- **Overlap:** [percentage]
- **Boundary rules:** [sentence / paragraph / section]
## Metadata Schema
| Field | Type | Values |
|---|---|---|
## Embedding
- **Model:** [name]
- **Dimensions:** [N]
- **Evaluation results:** [precision/recall on 20-query test]
## Retrieval
- **Strategy:** [similarity / hybrid / re-ranking]
- **Top-K:** [N]
- **Filters:** [metadata filters applied]
## Prompt Template
[Full generation prompt with context injection]
## Evaluation Results
### Retrieval
| Metric | Result | Target | Pass? |
|---|---|---|---|
### Generation
| Metric | Result | Target | Pass? |
|---|---|---|---|
## Freshness Strategy
- **Rebuild frequency:** [cadence]
- **Change detection:** [method]
- **Staleness threshold:** [age limit]
## Monitoring
- [Metrics tracked]
- [Alert conditions]
## Open Questions
- [Decisions pending data or stakeholder input]