Skill

corrective-rag

Corrective RAG (CRAG) pattern library — implementation of the Corrective Retrieval Augmented Generation paradigm by Yan et al. 2024 (Corrective Retrieval Augmented Generation, ICLR 2024) which improves classical RAG by adding a retrieval evaluator that grades the relevance of retrieved documents and triggers fallback mechanisms when the retrieval is judged insufficient. Covers the core CRAG flow (retrieve documents from vector store, grade each document via lightweight T5 evaluator or LLM-as-judge with categories Correct/Incorrect/Ambiguous, when Correct use as-is, when Ambiguous combine knowledge refinement with web search, when Incorrect discard and rely on web search), the knowledge refinement step (decompose retrieved documents into strips, filter strips by relevance, re-compose into clean context), the web search fallback (typically using Google Search API, Brave Search, Tavily, Firecrawl, Exa to fetch fresh sources when internal knowledge base fails), benchmark gains reported in the paper (PopQA +20%, Biography +25%, PubHealth +10% over standard RAG), comparison with alternative RAG variants (HyDE for hypothetical document embedding, Self-RAG with self-reflection tokens, Adaptive RAG that decides when to retrieve), implementation strategies (lightweight evaluator vs LLM-as-judge trade-off, web fallback cost management, hybrid local+web context fusion), production considerations (latency added by evaluator step, web API costs, hallucination risk reduction, compliance for web fetching), use cases where CRAG dominates (open-domain QA with risk of stale knowledge base, fact-checking applications, customer support with both static KB and dynamic web sources), and the limitations (overhead of evaluator, dependency on web search quality, complexity vs simple RAG). Use when standard RAG hallucinates due to poor retrieval, when knowledge base coverage is incomplete and web augmentation is acceptable, or when you need a robust fallback mechanism. Differentiates from generic RAG by deep focus on retrieval quality grading and fallback orchestration.

Install

npx claudepluginhub arnwaldn/atum-plugins-collection --plugin atum-ai-ml

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Pattern publié par **Yan et al. 2024** (Université des Sciences et Technologies de Chine, ICLR 2024). "Corrective Retrieval Augmented Generation" résout le problème principal du RAG classique : **que faire quand les documents récupérés sont mauvais ?**

SKILL.md

Similar Skills

executing-plans

Executes pre-written implementation plans: critically reviews, follows bite-sized steps exactly, runs verifications, tracks progress with checkpoints, uses git worktrees, stops on blockers.

superpowers

150.3k

dispatching-parallel-agents

Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.

superpowers

150.3k

brainstorming

7 files

Guides idea refinement into designs: explores context, asks questions one-by-one, proposes approaches, presents sections for approval, writes/review specs before coding.

superpowers

150.3k

Stats

Parent Repo Stars0

Parent Repo Forks0

Last CommitApr 8, 2026

Actions

View Source View Plugin View on GitHub View README

Corrective RAG (CRAG)

Pattern publié par Yan et al. 2024 (Université des Sciences et Technologies de Chine, ICLR 2024). "Corrective Retrieval Augmented Generation" résout le problème principal du RAG classique : que faire quand les documents récupérés sont mauvais ?

Le problème du RAG classique

Question → [Retrieve top-k docs] → [LLM Generate] → Réponse

Si les docs récupérés sont non pertinents ou incorrects, le LLM va :

Hallucinate en mélangeant docs et savoir interne
Refuser de répondre ("information non trouvée")
Donner une réponse erronée basée sur des docs trompeurs

CRAG résout ça en évaluant la qualité du retrieval et en adaptant la stratégie.

Architecture CRAG

              [QUESTION]
                  │
                  ▼
        ┌──────────────────┐
        │     RETRIEVE     │ ← Vector store / BM25 / hybrid
        └────────┬─────────┘
                 │ documents top-k
                 ▼
        ┌──────────────────┐
        │  GRADE DOCS      │ ← Lightweight evaluator (T5) or LLM
        │  (Correct/Inc.   │
        │   /Ambiguous)    │
        └────────┬─────────┘
                 │
      ┌──────────┼──────────┐
      │          │          │
      ▼          ▼          ▼
  CORRECT   AMBIGUOUS   INCORRECT
      │          │          │
      │          ▼          ▼
      │   ┌──────────┐  ┌──────────┐
      │   │ KNOWLEDGE│  │   WEB    │
      │   │ REFINE + │  │  SEARCH  │
      │   │WEB SEARCH│  │ FALLBACK │
      │   └────┬─────┘  └────┬─────┘
      │        │             │
      └────────┴─────────────┘
                 │
                 ▼
          [GENERATE]
                 │
                 ▼
            [ANSWER]

Composants

1. Retrieval Evaluator

Score chaque document récupéré comme Correct / Ambiguous / Incorrect.

Option A — Lightweight T5 (du papier) :

Modèle T5-large fine-tuné pour classification de relevance
Coût négligeable, latence ~50ms
Préféré en production

Option B — LLM-as-judge :

Question: {question}
Document: {document}

Évalue ce document sur une échelle de 0 à 10 selon sa pertinence
pour répondre à la question. Retourne JSON :
{"score": 0-10, "verdict": "correct"|"ambiguous"|"incorrect", "reason": "..."}

Seuils typiques :

Score ≥ 7 → Correct
4 ≤ Score < 7 → Ambiguous
Score < 4 → Incorrect

2. Knowledge Refinement (pour Ambiguous/Correct)

Le document brut est décomposé en strips (paragraphes ou phrases), chaque strip est filtré individuellement, puis recomposé.

def refine_knowledge(document, question):
    strips = split_into_strips(document)  # par paragraphe ou phrase
    filtered = []
    for strip in strips:
        score = evaluator(question, strip)
        if score >= threshold:
            filtered.append(strip)
    return "\n\n".join(filtered)

Élimine le bruit dans des documents longs où seule une partie est pertinente.

3. Web Search Fallback (pour Incorrect/Ambiguous)

Si la KB locale échoue, on enrichit via recherche web.

APIs courantes :

API	Quand l'utiliser	Coût
Tavily	LLM-friendly, filtré par défaut	$0.04 / search
Brave Search	Privacy-friendly, fallback	$0.005 / query
Google Custom Search	Stack Google	$5 / 1k queries
Bing Search	Stack Microsoft	$7 / 1k queries
Firecrawl	Scrape de pages spécifiques	$0.001 / page
Exa.ai	Recherche sémantique LLM-native	$0.005 / search
SerpAPI	Polyvalent, multi-engines	$50/mois

def web_search_fallback(question):
    results = tavily.search(question, max_results=5, include_raw=True)
    docs = [r["content"] for r in results["results"]]
    return refine_knowledge("\n\n".join(docs), question)

4. Generator

LLM final qui génère la réponse à partir du contexte refiné (local + web).

Exemple d'exécution

Question: "Quel est le PIB du Vietnam en 2025 ?"

Step 1: Retrieve from internal KB
  Doc1: "Le Vietnam en 2020 avait un PIB de 271 milliards USD..."
  Doc2: "L'économie vietnamienne est principalement..."

Step 2: Grade docs
  Doc1: ambiguous (info de 2020, pas 2025)
  Doc2: incorrect (pas de chiffres)

Step 3: Verdict global = "ambiguous + incorrect"
  → Trigger web search fallback

Step 4: Web search "Vietnam GDP 2025"
  Result: "Vietnam's GDP reached $476 billion in 2024, with World Bank
           projecting $510 billion for 2025..."

Step 5: Knowledge refinement
  Filtré: "Vietnam's GDP reached $510 billion projected for 2025"

Step 6: Generate final answer with refined context
  "Selon les projections de la Banque Mondiale, le PIB du Vietnam en
   2025 est estimé à environ 510 milliards USD."

Gains mesurés (papier)

Benchmark	Standard RAG	Self-RAG	CRAG
PopQA (open-domain QA)	38.7%	54.9%	59.8% (+21pts vs RAG)
Biography (long-form)	64.0%	81.2%	86.0% (+22pts)
PubHealth (fact-checking)	65.0%	75.6%	80.6% (+15pts)

Bénéfice typique : 95% de pertinence retrieval (vs 70% pour RAG classique).

CRAG vs alternatives

Pattern	Différence
RAG classique	Pas d'évaluation, pas de fallback
HyDE (Hypothetical Document Embeddings)	Génère un doc hypothétique pour améliorer la query, mais pas de fallback
Self-RAG (Asai 2023)	LLM apprend des tokens spéciaux pour décider quand retrieve, requiert fine-tuning
Adaptive RAG (Jeong 2024)	Décide entre no-retrieval / single-step / multi-step selon la complexité
CRAG	Plug-and-play, pas de fine-tuning, fallback web automatique

Implémentation production

Stack 2026 recommandée

Vector store : Pinecone / Weaviate / Qdrant / pgvector
Evaluator : LLM-as-judge avec Claude Haiku ou GPT-4o-mini (cheap+fast)
Web search : Tavily (LLM-friendly) ou Brave Search (privacy)
Knowledge refinement : Claude Haiku pour le filtering strips
Generator : Claude Sonnet ou GPT-4o pour la qualité finale
Orchestration : LangGraph (idéal pour le branching CRAG)

LangGraph pour CRAG

from langgraph.graph import StateGraph, END

class CRAGState(TypedDict):
    question: str
    documents: list
    grade: str
    web_search_done: bool
    final_answer: str

workflow = StateGraph(CRAGState)
workflow.add_node("retrieve", retrieve_docs)
workflow.add_node("grade", grade_docs)
workflow.add_node("refine", refine_knowledge)
workflow.add_node("web_search", web_search_fallback)
workflow.add_node("generate", generate_answer)

workflow.set_entry_point("retrieve")
workflow.add_edge("retrieve", "grade")
workflow.add_conditional_edges("grade", route_by_grade, {
    "correct": "refine",
    "ambiguous": "web_search",
    "incorrect": "web_search",
})
workflow.add_edge("web_search", "refine")
workflow.add_edge("refine", "generate")
workflow.add_edge("generate", END)

app = workflow.compile()

Anti-patterns

Pas d'évaluator → c'est juste du RAG, pas du CRAG
Évaluateur trop strict → trop de fallbacks web, coûts élevés
Évaluateur trop laxiste → mauvais docs passent → hallucinations
Pas de knowledge refinement → le LLM est noyé dans le bruit
Web search sans cache → coûts qui explosent
Web search sans rate limiting → bannissement des APIs
Pas de logging des grades → impossible de tuner l'évaluateur
Mêmes seuils pour tous les domaines → calibrer par dataset
Web search fallback sur questions internes → fuite de PII vers Google
Pas de fallback si web search fail → retour à RAG classique
Évaluateur = même LLM que générateur → biais self-confirming
Pas de timeout sur le web search → latence imprévisible

Quand utiliser CRAG vs RAG simple

Scénario	Recommandation
KB stable, domaine fermé, hallucination tolérée	RAG simple suffit
KB partielle, web augmentation OK	CRAG
KB volatile, infos qui changent souvent	CRAG
Compliance interdit web search	RAG simple, refine knowledge only
Budget serré (latence + coût)	RAG simple ou Adaptive RAG
Fact-checking, médical, légal	CRAG (besoin de robustesse)

Quand déléguer

Architecture RAG complète → agent rag-architect (ce plugin)
Choix vector DB → skills pinecone-patterns / weaviate-patterns / qdrant-patterns (ce plugin)
Pattern d'agent recherche → skill react-pattern (ce plugin)
Auto-correction itérative → skill reflexion-pattern (ce plugin)
Knowledge graph alternatif → skill graphrag-pattern (ce plugin)

Ressources

Paper original : https://arxiv.org/abs/2401.15884
Code : https://github.com/HuskyInSalt/CRAG
LangGraph CRAG tutorial : https://langchain-ai.github.io/langgraph/tutorials/rag/langgraph_crag/
Tavily (web search LLM-friendly) : https://tavily.com/