Skill

langfuse-evaluation

Designs and runs LLM evaluation with Langfuse — the strategy and workflow layer for scoring quality, building datasets, and running experiments. Use whenever the user is evaluating LLM output quality with Langfuse: "evaluate my LLM app", "which eval method should I use", "set up LLM-as-a-judge", "create a dataset / run an experiment", "score my traces", "offline vs online evaluation", "test prompt changes before deploying", "build a regression test set", or interpreting experiment results. Owns eval STRATEGY and the datasets/experiments/scores workflow; defers judge calibration and CI/CD experiment code to the vendored `langfuse` skill, and exact SDK code to live docs.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/claude-langfuse-plugin:langfuse-evaluation

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

WebFetch(domain:langfuse.com)Bash(curl *langfuse.com/*)

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

This skill carries the durable *judgment* of evaluating LLM applications with Langfuse: how the

Supporting Files

references/agent-evals.mdreferences/code-evaluators.mdreferences/datasets-experiments.mdreferences/external-pipelines.mdreferences/human-annotation.mdreferences/llm-as-a-judge.mdreferences/methods-overview.mdreferences/multi-turn-evals.mdreferences/rag-evals.mdreferences/scores.md

SKILL.md

109 lines · ~1.7k tokens

Stats

LanguagePython

Stars0

MaintenanceGood

Last CommitJun 19, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Langfuse Evaluation

This skill carries the durable judgment of evaluating LLM applications with Langfuse: how the evaluation loop works, which method to use when, and how scores, datasets, and experiments fit together. It does not embed SDK code — Langfuse updates frequently, so fetch current code from live docs and hand calibration/CI specifics to the vendored langfuse skill.

Operating principles

Distill judgment, fetch facts. This skill owns the decisions and workflow. For exact code (creating scores, defining tasks/evaluators, running experiments), fetch the live doc by appending .md to the page URL (e.g. https://langfuse.com/docs/evaluation/experiments/experiments-via-sdk.md) or use the SDK references (python.reference.langfuse.com, js.reference.langfuse.com). Never write eval code from memory.
Defer to the vendored langfuse skill for: judge calibration / reliability (skills/langfuse/references/judge-calibration.md), CI/CD experiment gates (skills/langfuse/references/ci-cd.md), systematic error analysis (skills/langfuse/references/error-analysis.md), and capturing user feedback as scores (skills/langfuse/references/user-feedback.md). Don't duplicate those here.

Workflow

1. Frame the evaluation

First decide offline or online, and what the score attaches to:

Offline = gate a change before deploy → datasets + experiments (references/datasets-experiments.md).
Online = score live production traces → automated evaluators on traces.
Most "how do I evaluate X" questions resolve to those two axes. See references/methods-overview.md.

2. Choose the method

Start from references/methods-overview.md to pick among the five methods by what's being judged (deterministic vs subjective) and scale; methods compose. Then go deep in the method reference:

references/llm-as-a-judge.md — subjective judgment at scale (setup, where to point it; calibration deferred to the vendored skill).
references/code-evaluators.md — deterministic checks (JSON/schema/match/business rules).
references/human-annotation.md — annotation queues + human-in-the-loop scoring from your own tool (ground truth).

3. Model the scores

Every method emits a score. Get the level (trace/observation/session/dataset-run) and data type (NUMERIC/CATEGORICAL/BOOLEAN/TEXT) right, and use a ScoreConfig for any score multiple people/pipelines produce. See references/scores.md — note TEXT scores can't be aggregated.

4. Build datasets & run experiments (offline)

Follow references/datasets-experiments.md: assemble dataset items (seed from production traces), define task + evaluator functions, run via UI (quick) or SDK (full control), then interpret results top-down (aggregate metrics → item-level diff vs baseline → trace debugging → annotate regressions) — that section is the high-judgment payoff. Manage datasets in Langfuse for comparison views. Fetch live docs for the actual SDK code.

5. Set up online evaluation

Configure evaluators (LLM-judge / code / annotation) to score production traces automatically; feed surprising cases back into the dataset so offline experiments catch them next time.

6. Specialized application types

For these, start from the matching reference (each builds on steps 1–5, not a replacement):

RAG → references/rag-evals.md (evaluate retrieval and generation as separate components).
Agents → references/agent-evals.md (trajectory + per-step + final-answer; 3 phases).
Multi-turn / conversational → references/multi-turn-evals.md (session-level; real N+1 vs simulated).
Custom / scheduled / external-framework evaluation → references/external-pipelines.md.

Bundled resources (more references being added per the roadmap)

references/methods-overview.md — the evaluation loop and which method when (offline/online, the five methods, how to choose, how they compose).
references/scores.md — the universal score object: attachment levels, the four data types and when each, source, ScoreConfig schema enforcement, scores-vs-tags.
references/datasets-experiments.md — datasets, dataset items, tasks, evaluators, experiment runs; data relationships; UI vs SDK; the local-dataset caveat; interpreting results (the top-down funnel: aggregate → item diff → trace debug → annotate).
references/llm-as-a-judge.md — designing/setting up a judge and where to point it (observations/traces/experiments); calibration deferred to the vendored skill.
references/code-evaluators.md — deterministic Python/TS checks; when vs a judge; where they run.
references/human-annotation.md — annotation queues + human-in-the-loop custom-tool scoring; building ground truth.
references/rag-evals.md — RAG: evaluate retrieval vs generation independently; chunking; faithfulness/relevancy/context metrics; Ragas (reference-free).
references/agent-evals.md — agents: trajectory + per-step + final-answer; 3 failure modes; 3 phases; black/glass/white-box strategies.
references/multi-turn-evals.md — conversational apps: session-level memory/coherence/ resolution; real N+1 evaluation vs simulated conversations.
references/external-pipelines.md — when/how to evaluate outside Langfuse (scheduled/ webhook), fetch→score→ingest architecture.

Hand-off map

Need	Where
Eval strategy, method choice, scores model, datasets/experiments workflow	this skill
Judge calibration, CI/CD experiment gates, error analysis, user-feedback scoring	vendored `langfuse` skill
Exact SDK/API code for scores & experiments	live docs (`.md`-append) + SDK references
Monitoring scores on dashboards / alerting	`langfuse-monitoring` skill (Phase 3)

langfuse-evaluation

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

langfuse-evaluation

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

Langfuse Evaluation

Operating principles

Workflow

1. Frame the evaluation

2. Choose the method

3. Model the scores

4. Build datasets & run experiments (offline)

5. Set up online evaluation

6. Specialized application types

Bundled resources (more references being added per the roadmap)

Hand-off map

Similar Skills

Langfuse Evaluation

Operating principles

Workflow

1. Frame the evaluation

2. Choose the method

3. Model the scores

4. Build datasets & run experiments (offline)

5. Set up online evaluation

6. Specialized application types

Bundled resources (more references being added per the roadmap)

Hand-off map

Similar Skills