Designs and runs LLM evaluation with Langfuse — the strategy and workflow layer for scoring quality, building datasets, and running experiments. Use whenever the user is evaluating LLM output quality with Langfuse: "evaluate my LLM app", "which eval method should I use", "set up LLM-as-a-judge", "create a dataset / run an experiment", "score my traces", "offline vs online evaluation", "test prompt changes before deploying", "build a regression test set", or interpreting experiment results. Owns eval STRATEGY and the datasets/experiments/scores workflow; defers judge calibration and CI/CD experiment code to the vendored `langfuse` skill, and exact SDK code to live docs.
How this skill is triggered — by the user, by Claude, or both
Slash command
/claude-langfuse-plugin:langfuse-evaluationThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill carries the durable *judgment* of evaluating LLM applications with Langfuse: how the
This skill carries the durable judgment of evaluating LLM applications with Langfuse: how the
evaluation loop works, which method to use when, and how scores, datasets, and experiments fit
together. It does not embed SDK code — Langfuse updates frequently, so fetch current code from live
docs and hand calibration/CI specifics to the vendored langfuse skill.
.md to the page URL (e.g.
https://langfuse.com/docs/evaluation/experiments/experiments-via-sdk.md) or use the SDK
references (python.reference.langfuse.com, js.reference.langfuse.com). Never write eval code
from memory.langfuse skill for: judge calibration / reliability
(skills/langfuse/references/judge-calibration.md), CI/CD experiment gates
(skills/langfuse/references/ci-cd.md), systematic error analysis
(skills/langfuse/references/error-analysis.md), and capturing user feedback as scores
(skills/langfuse/references/user-feedback.md). Don't duplicate those here.First decide offline or online, and what the score attaches to:
references/datasets-experiments.md).references/methods-overview.md.Start from references/methods-overview.md to pick among the five methods by what's being judged
(deterministic vs subjective) and scale; methods compose. Then go deep in the method reference:
references/llm-as-a-judge.md — subjective judgment at scale (setup, where to point it;
calibration deferred to the vendored skill).references/code-evaluators.md — deterministic checks (JSON/schema/match/business rules).references/human-annotation.md — annotation queues + human-in-the-loop scoring from your own
tool (ground truth).Every method emits a score. Get the level (trace/observation/session/dataset-run) and data type
(NUMERIC/CATEGORICAL/BOOLEAN/TEXT) right, and use a ScoreConfig for any score multiple
people/pipelines produce. See references/scores.md — note TEXT scores can't be aggregated.
Follow references/datasets-experiments.md: assemble dataset items (seed from production traces),
define task + evaluator functions, run via UI (quick) or SDK (full control), then interpret
results top-down (aggregate metrics → item-level diff vs baseline → trace debugging → annotate
regressions) — that section is the high-judgment payoff. Manage datasets in Langfuse for comparison
views. Fetch live docs for the actual SDK code.
Configure evaluators (LLM-judge / code / annotation) to score production traces automatically; feed surprising cases back into the dataset so offline experiments catch them next time.
For these, start from the matching reference (each builds on steps 1–5, not a replacement):
references/rag-evals.md (evaluate retrieval and generation as separate components).references/agent-evals.md (trajectory + per-step + final-answer; 3 phases).references/multi-turn-evals.md (session-level; real N+1 vs simulated).references/external-pipelines.md.references/methods-overview.md — the evaluation loop and which method when (offline/online,
the five methods, how to choose, how they compose).references/scores.md — the universal score object: attachment levels, the four data types
and when each, source, ScoreConfig schema enforcement, scores-vs-tags.references/datasets-experiments.md — datasets, dataset items, tasks, evaluators, experiment
runs; data relationships; UI vs SDK; the local-dataset caveat; interpreting results (the
top-down funnel: aggregate → item diff → trace debug → annotate).references/llm-as-a-judge.md — designing/setting up a judge and where to point it
(observations/traces/experiments); calibration deferred to the vendored skill.references/code-evaluators.md — deterministic Python/TS checks; when vs a judge; where they run.references/human-annotation.md — annotation queues + human-in-the-loop custom-tool scoring;
building ground truth.references/rag-evals.md — RAG: evaluate retrieval vs generation independently; chunking;
faithfulness/relevancy/context metrics; Ragas (reference-free).references/agent-evals.md — agents: trajectory + per-step + final-answer; 3 failure modes;
3 phases; black/glass/white-box strategies.references/multi-turn-evals.md — conversational apps: session-level memory/coherence/
resolution; real N+1 evaluation vs simulated conversations.references/external-pipelines.md — when/how to evaluate outside Langfuse (scheduled/
webhook), fetch→score→ingest architecture.| Need | Where |
|---|---|
| Eval strategy, method choice, scores model, datasets/experiments workflow | this skill |
| Judge calibration, CI/CD experiment gates, error analysis, user-feedback scoring | vendored langfuse skill |
| Exact SDK/API code for scores & experiments | live docs (.md-append) + SDK references |
| Monitoring scores on dashboards / alerting | langfuse-monitoring skill (Phase 3) |
npx claudepluginhub jbaham2/claude-langfuse-plugin --plugin claude-langfuse-pluginProvides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.