Help us improve
Share bugs, ideas, or general feedback.
From vanguard-frontier-agentic
Reviews evaluation setup for LLM/AI pipelines—metric selection, golden datasets, threshold governance, adversarial coverage, regression gating—to prevent unsafe model outputs from shipping.
npx claudepluginhub raishin/vanguard-frontier-agentic --plugin vanguard-frontier-agenticHow this skill is triggered — by the user, by Claude, or both
Slash command
/vanguard-frontier-agentic:llm-ai-pipeline-test-reviewThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill reviews how an LLM or AI pipeline is evaluated — not the model itself, but the evaluation setup that decides whether a model change is safe to ship. An evaluation suite only protects users if it measures the right things, gates on meaningful thresholds, covers adversarial inputs, and detects drift across model versions. The review catches missing hallucination and factuality metrics,...
Audits LLM eval pipelines for issues like missing error analysis, unvalidated judges, and vanity metrics. Produces prioritized findings with fixes when inheriting systems or verifying trustworthiness.
Evaluates LLM outputs for factual accuracy, relevance, safety, and alignment using RAGAS, TruLens, and HELM frameworks. Useful for auditing RAG pipelines and detecting hallucinations.
Implements evaluation strategies and quality gates for LLM outputs: structural validation, semantic checks, LLM-as-judge with bias mitigations, prompt testing, and guardrails. Use for evals, CI gates, quality measurement, regressions.
Share bugs, ideas, or general feedback.
This skill reviews how an LLM or AI pipeline is evaluated — not the model itself, but the evaluation setup that decides whether a model change is safe to ship. An evaluation suite only protects users if it measures the right things, gates on meaningful thresholds, covers adversarial inputs, and detects drift across model versions. The review catches missing hallucination and factuality metrics, absent answer-relevancy and faithfulness checks for RAG pipelines, unguarded bias and toxicity, no adversarial or red-team coverage, agent evals that ignore tool correctness and task completion, thresholds that are undefined or set to zero, single-shot evals on non-deterministic outputs, and no regression baseline to detect metric drift.
HallucinationMetric or no GEval with factuality criteria against source documents as HIGH — the pipeline can fabricate facts and ship them.AnswerRelevancyMetric as MEDIUM — responses may be fluent but off-topic, and no eval catches it.FaithfulnessMetric as HIGH — the model can ignore retrieved context and hallucinate; faithfulness is the primary RAG correctness signal.ContextualPrecisionMetric or ContextualRecallMetric in a RAG pipeline as MEDIUM — retrieval quality is unmeasured; noisy or incomplete retrieval is invisible to the eval.BiasMetric or ToxicityMetric as HIGH if the system is user-facing — unsafe outputs can reach users without detection; treat as CRITICAL if the audience is vulnerable (children, medical patients, crisis users).ToolCorrectnessMetric as HIGH — the agent can call wrong tools silently and the eval still passes.TaskCompletionMetric as HIGH — end-to-end success is unmeasured even if individual steps look fine.Load these only when needed:
Return, at minimum: