Skill

graphrag-evaluation

Evaluates GraphRAG systems on KG completeness, retrieval relevance, answer correctness, reasoning depth, and hallucination prevention. Guides metric selection, test protocols, and reporting.

ai-ml

Install

npx claudepluginhub lyndonkl/claude --plugin thinking-frameworks-skills

Tool Access

This skill uses the workspace's default tool permissions.

Preview

- [Workflow](#workflow)

Supporting Assets

resources/evaluators/rubric_evaluation.jsonresources/methodology.mdresources/reasoning-patterns.md

SKILL.md

Similar Skills

design-system

Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.

team-skills-platform

167.4k

ui-demo

Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.

team-skills-platform

167.4k

kotlin-patterns

Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.

team-skills-platform

167.4k

Stats

Stars85

Forks11

Last CommitApr 15, 2026

Actions

View Source View Plugin View on GitHub View README

GraphRAG Evaluation

Workflow

Copy this checklist and work through each step:

Step 1. Identify Evaluation Scope
Step 2. Select Metrics
Step 3. Design Test Protocol
Step 4. Test Reasoning Capabilities
Step 5. Measure Hallucination Rate
Step 6. Compare Against Baselines
Step 7. Produce Evaluation Report

Step 1. Identify Evaluation Scope

Define what aspects of your GraphRAG system you need to evaluate and why. Determine whether you are evaluating the full pipeline or specific components (KG construction, retrieval, generation). Clarify the use case context: domain, query complexity, expected reasoning depth.

See methodology.md for the full evaluation dimensions framework.

Step 2. Select Metrics

Choose metrics appropriate to your evaluation scope. Not every evaluation requires every metric. Match metrics to your system's maturity and the questions you need answered.

See the Metric Selection Guide below and methodology.md for detailed metric definitions.

Step 3. Design Test Protocol

Build test sets that cover your evaluation dimensions. Include single-hop factual queries, multi-hop reasoning queries, constraint satisfaction queries, temporal reasoning queries, comparative queries, and negative queries (questions the system should not answer).

See methodology.md for baseline comparison approaches and statistical significance testing.

Step 4. Test Reasoning Capabilities

Evaluate how well your system handles multi-step reasoning. Verify that each reasoning step is grounded in retrieved KG evidence. Check for error propagation where an incorrect intermediate step leads to wrong conclusions.

See reasoning-patterns.md for chain validation, pattern matching, hypothesis verification, and causal reasoning evaluation.

Step 5. Measure Hallucination Rate

Quantify both intrinsic hallucination (contradicts retrieved evidence) and extrinsic hallucination (claims not supported by any retrieved source). Measure the KG grounding rate: what percentage of generated claims are traceable to knowledge graph entities and relations.

See methodology.md for hallucination detection approaches and comparison protocols.

Step 6. Compare Against Baselines

Run identical test sets against baseline systems: pure vector RAG, LLM-only (no retrieval), and alternative graph configurations. Use controlled ablation studies to isolate the contribution of each component.

See methodology.md for baseline comparison and ablation study design.

Step 7. Produce Evaluation Report

Compile findings into the structured output template below. Include metric values, baseline comparisons, identified weaknesses, and prioritized recommendations.

See rubric_evaluation.json for the scoring rubric (minimum passing score: 3.0).

Evaluation Dimensions

Dimension	What It Measures	Key Metrics	Priority
KG Quality	Completeness and accuracy of the knowledge graph	Entity coverage, relation completeness, schema consistency	High
Retrieval Quality	Effectiveness of graph-based retrieval	Context recall (C-Rec), context precision, multi-hop coverage	High
Answer Correctness	Accuracy and completeness of generated answers	Factual accuracy, answer completeness, citation accuracy	Critical
Hallucination Rate	Frequency of unsupported or contradicted claims	Intrinsic hallucination rate, extrinsic hallucination rate, KG grounding rate	Critical
Reasoning Depth	Ability to perform multi-step reasoning correctly	Multi-hop accuracy, stepwise verification score, error propagation rate	Medium-High

Metric Selection Guide

Choose metrics based on your evaluation goals:

Quick Health Check (minimal effort):

Answer correctness on a curated test set (20-50 questions)
KG grounding rate (sample 20 responses)
Single baseline comparison (pure vector RAG)

Standard Evaluation (recommended):

All five dimensions with standardized test sets
Context recall and context precision
Multi-hop reasoning tests
Hallucination rate measurement
Two or more baseline comparisons

Comprehensive Benchmark (production readiness):

Full metrics suite across all dimensions
Statistical significance testing with confidence intervals
Controlled ablation study
Process-oriented reasoning evaluation (stepwise correctness)
Automated evaluation pipeline for reproducibility

Output Template

# GraphRAG Evaluation Report

## 1. System Under Evaluation
- System name and version:
- Domain:
- KG size (entities/relations):
- Evaluation date:

## 2. Evaluation Scope
- Dimensions evaluated:
- Test set size and composition:
- Baseline systems:

## 3. KG Quality Results
- Entity coverage: ____%
- Relation completeness: ____%
- Schema consistency score: ____
- Notable gaps:

## 4. Retrieval Quality Results
- Context recall (C-Rec): ____
- Context precision: ____
- Multi-hop coverage: ____%
- Latency (p50/p95/p99): ____

## 5. Answer Correctness Results
- Factual accuracy: ____%
- Answer completeness: ____%
- Citation accuracy: ____%

## 6. Hallucination Analysis
- Intrinsic hallucination rate: ____%
- Extrinsic hallucination rate: ____%
- KG grounding rate: ____%
- Comparison with/without graph augmentation:

## 7. Reasoning Depth Results
- Single-hop accuracy: ____%
- Multi-hop accuracy: ____%
- Stepwise reasoning correctness: ____%
- Error propagation incidents: ____

## 8. Baseline Comparison
| Metric | GraphRAG | Pure Vector RAG | LLM Only |
|--------|----------|-----------------|----------|
| Answer correctness | | | |
| Hallucination rate | | | |
| Multi-hop accuracy | | | |

## 9. Statistical Significance
- Test used:
- Confidence level:
- Significant improvements:
- Non-significant differences:

## 10. Identified Weaknesses
1.
2.
3.

## 11. Recommendations
| Priority | Recommendation | Expected Impact | Effort |
|----------|---------------|-----------------|--------|
| | | | |

## 12. Rubric Score
- Metric Coverage: __ / 5
- Measurement Rigor: __ / 5
- Baseline Comparison: __ / 5
- Reasoning Depth: __ / 5
- Actionable Recommendations: __ / 5
- **Weighted Total: __ / 5.0** (minimum passing: 3.0)

graphrag-evaluation

Install

Tool Access

Preview

Supporting Assets

SKILL.md

Similar Skills

graphrag-evaluation

Install

Tool Access

Preview

Supporting Assets

SKILL.md

Table of Contents

GraphRAG Evaluation

Workflow

Step 1. Identify Evaluation Scope

Step 2. Select Metrics

Step 3. Design Test Protocol

Step 4. Test Reasoning Capabilities

Step 5. Measure Hallucination Rate

Step 6. Compare Against Baselines

Step 7. Produce Evaluation Report

Evaluation Dimensions

Metric Selection Guide

Output Template

Similar Skills

Table of Contents

GraphRAG Evaluation

Workflow

Step 1. Identify Evaluation Scope

Step 2. Select Metrics

Step 3. Design Test Protocol

Step 4. Test Reasoning Capabilities

Step 5. Measure Hallucination Rate

Step 6. Compare Against Baselines

Step 7. Produce Evaluation Report

Evaluation Dimensions

Metric Selection Guide

Output Template