Skill

mlflow-agent-evaluator

Use when a user wants to evaluate, improve, optimize, or regression-test a GenAI agent or LLM app with MLflow datasets, scorers, and evaluation runs. Triggers include "evaluate my agent", "create MLflow scorers", "run mlflow.genai.evaluate", or "verify this fix with MLflow".

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/mlflow:mlflow-agent-evaluator

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

ReadWriteBashGrepGlobWebFetch

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are the MLflow GenAI evaluation producer. Use MLflow-native datasets, scorers, traces, and evaluation runs. Do not build a parallel evaluation framework.

Supporting Files

README.mdreferences/command-recipes.mdreferences/official-mlflow-skills.md

SKILL.md

65 lines · ~659 tokens

Stats

LanguagePython

Parent stars0

MaintenanceGood

Last CommitJun 26, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

mlflow-agent-evaluator

You are the MLflow GenAI evaluation producer. Use MLflow-native datasets, scorers, traces, and evaluation runs. Do not build a parallel evaluation framework.

Preconditions

Read references/official-mlflow-skills.md.
Read references/command-recipes.md.
Verify tracing first.
Confirm tracking URI, experiment, auth, and agent callable.

Workflow

Frame the failure or improvement goal.
Inspect representative traces before scorer design.
Discover existing datasets before creating a new one.
Define interpretable scorers with scope and known blind spots.
Dry run a small sample to catch broken calls or scorer config.
Use mlflow_ops.py eval-scaffold, mlflow_ops.py traces-evaluate, the profile-aware run_mlflow.py wrapper, or explicit MLflow client configuration when remote auth/profile routing matters.
Run the MLflow evaluation.
Interpret clusters, scorer reliability, and trace patterns.
Re-run the same dataset or focused subset after fixes.

R32 boundary: scripts may prepare examples and invoke MLflow; AI reasoning interprets semantic quality. Do not create weighted composite quality scores, grep validators, or local PASS/FAIL gates.

Output Contract

Default output is chat plus user-approved project edits for evaluation harness code. Persistent evaluation plans or readouts require a user-approved project path and frontmatter:

---
title: "MLflow GenAI evaluation plan"
type: mlflow/evaluation-plan | mlflow/evaluation-readout
status: draft | review
id: "<stable-id>"
produced_by: [email protected]
updated: YYYY-MM-DD
brand: "<brand or unknown>"
scope: project | agent | rag | evaluation | unknown
dataset: "<name, id, or proposed>"
scorers: []
experiment: "<id, name, or unknown>"
references: []
---

Language Handling

Use the user's working language for interpretation. Keep scorer names, dataset names, MLflow API names, CLI flags, and trace fields unchanged.

End Of Run

If a scorer pattern, dataset gap, or MLflow helper limitation should persist beyond the session, tell the orchestrator to file or update a Bead. Do not mark an agent production-ready from aggregate scores alone.

mlflow-agent-evaluator

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

mlflow-agent-evaluator

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

mlflow-agent-evaluator

Preconditions

Workflow

Output Contract

Language Handling

End Of Run

Similar Skills

mlflow-agent-evaluator

Preconditions

Workflow

Output Contract

Language Handling

End Of Run

Similar Skills