Help us improve
Share bugs, ideas, or general feedback.
From grimoire
Evaluates LLM outputs for factual accuracy, relevance, safety, and alignment using RAGAS, TruLens, and HELM frameworks. Useful for auditing RAG pipelines and detecting hallucinations.
npx claudepluginhub jeffreytse/grimoire --plugin grimoireHow this skill is triggered — by the user, by Claude, or both
Slash command
/grimoire:audit-llm-outputThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Systematically evaluate LLM outputs for factual accuracy, relevance, safety, and alignment using structured frameworks and automated metrics.
Implements evaluation strategies and quality gates for LLM outputs: structural validation, semantic checks, LLM-as-judge with bias mitigations, prompt testing, and guardrails. Use for evals, CI gates, quality measurement, regressions.
Use this skill when the user asks to "set up LLM as a judge", "write an LLM judge prompt", "automate quality evaluation", "use Claude to evaluate outputs", "build an automated eval", "LLM-based evaluation", or wants to create a scalable automated evaluation system where one LLM grades the outputs of another LLM.
Share bugs, ideas, or general feedback.
Systematically evaluate LLM outputs for factual accuracy, relevance, safety, and alignment using structured frameworks and automated metrics.
Adopted by: OpenAI (evals framework), Anthropic (Constitutional AI red-teaming), Stanford (HELM benchmark), Google DeepMind Impact: RAGAS studies show that naive RAG pipelines have faithfulness scores of 0.6-0.7 out of 1.0 — meaning 30-40% of LLM statements are unsupported by the retrieved context; systematic auditing identifies and resolves these gaps.
LLM outputs are probabilistic — the same prompt can produce different quality responses. Without structured auditing, quality regressions go undetected when models are updated, prompts change, or knowledge bases evolve. Automated metrics provide continuous quality signals; human evaluation sets the ground truth.
RAGAS evaluation:
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
results = evaluate(
dataset=golden_dataset,
metrics=[faithfulness, answer_relevancy, context_precision],
)
# Target: faithfulness > 0.85
Hallucination detection prompt: "Given the context: [context]. Does the following claim appear in the context or can be directly inferred from it? Claim: [claim]. Answer: YES / NO / PARTIAL."