npx claudepluginhub mlflow/skillsThis skill is limited to using the following tools:
Comprehensive guide for evaluating GenAI agents with MLflow. Use this skill for the complete evaluation workflow or individual components - tracing setup, environment configuration, dataset creation, scorer definition, or evaluation execution. Each section can be used independently based on your needs.
assets/evaluation_report_template.mdreferences/dataset-preparation.mdreferences/scorers-constraints.mdreferences/scorers.mdreferences/setup-guide.mdreferences/throughput-guide.mdreferences/troubleshooting.mdscripts/analyze_results.pyscripts/create_dataset_template.pyscripts/list_datasets.pyscripts/run_evaluation_template.pyscripts/setup_mlflow.pyscripts/utils/__init__.pyscripts/utils/env_validation.pyscripts/validate_auth.pyscripts/validate_environment.pyscripts/validate_tracing_runtime.pyGenerates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
Comprehensive guide for evaluating GenAI agents with MLflow. Use this skill for the complete evaluation workflow or individual components - tracing setup, environment configuration, dataset creation, scorer definition, or evaluation execution. Each section can be used independently based on your needs.
DO NOT create custom evaluation frameworks. You MUST use MLflow's native APIs:
mlflow.genai.datasets.create_dataset() - NOT custom test case filesmlflow.genai.scorers and mlflow.genai.judges.make_judge() - NOT custom scorer functionsmlflow.genai.evaluate() - NOT custom evaluation loopsscripts/ directory templates - NOT custom evaluation/ directoriesWhy? MLflow tracks everything (datasets, scorers, traces, results) in the experiment. Custom frameworks bypass this and lose all observability.
If you're tempted to create evaluation/eval_dataset.py or similar custom files, STOP. Use scripts/create_dataset_template.py instead.
⚠️ REMINDER: Use MLflow APIs from this skill. Do not create custom evaluation frameworks.
Setup (prerequisite): Install MLflow 3.8+, configure environment, integrate tracing
Evaluation workflow in 5 steps (each uses MLflow APIs):
Always use uv run for MLflow and Python commands:
uv run mlflow --version # MLflow CLI commands
uv run python scripts/xxx.py # Python script execution
uv run python -c "..." # Python one-liners
This ensures commands run in the correct environment with proper dependencies.
CRITICAL: Separate stderr from stdout when capturing CLI output:
When saving CLI command output to files for parsing (JSON, CSV, etc.), always redirect stderr separately to avoid mixing logs with structured data:
# Save both separately for debugging
uv run mlflow traces evaluate ... --output json > results.json 2> evaluation.log
All MLflow documentation must be accessed through llms.txt:
https://mlflow.org/docs/latest/llms.txtThis applies to all steps, especially:
Each project has unique structure. Use dynamic exploration instead of assumptions:
# Search for main agent functions
grep -r "def.*agent" . --include="*.py"
grep -r "def (run|stream|handle|process)" . --include="*.py"
# Check common locations
ls main.py app.py src/*/agent.py 2>/dev/null
# Look for API routes
grep -r "@app\.(get|post)" . --include="*.py" # FastAPI/Flask
grep -r "def.*route" . --include="*.py"
# Check entry points in package config
cat pyproject.toml setup.py 2>/dev/null | grep -A 5 "scripts\|entry_points"
# Read project documentation
cat README.md docs/*.md 2>/dev/null | head -100
# Explore main directories
ls -la src/ app/ agent/ 2>/dev/null
Before doing ANY setup, check if MLFLOW_TRACKING_URI and MLFLOW_EXPERIMENT_ID are already set:
echo "MLFLOW_TRACKING_URI=$MLFLOW_TRACKING_URI"
echo "MLFLOW_EXPERIMENT_ID=$MLFLOW_EXPERIMENT_ID"
If BOTH are already set, skip Steps 1-2 entirely. The environment is pre-configured. Do NOT run setup_mlflow.py, do NOT create a .env file, do NOT override these values. Go directly to Step 3 (tracing integration) and the evaluation workflow.
references/setup-guide.md Steps 1-2instrumenting-with-mlflow-tracing skill for tracing setupscripts/validate_tracing_runtime.py after implementing⚠️ Tracing must work before evaluation. If tracing fails, stop and troubleshoot.
Checkpoint - verify before proceeding:
Validation scripts:
uv run python scripts/validate_environment.py # Check MLflow install, env vars, connectivity
uv run python scripts/validate_auth.py # Test authentication before expensive operations
Before doing anything else, ask the user these questions. Do NOT proceed until you have answers.
Required:
Use answers to:
agent_description parameter for generate_evals_dfIf running in automated mode: Read agent purpose from the codebase (SKILL.md, README, or main entry point docstring). Still surface what you found and confirm before proceeding.
uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID
IMPORTANT: if there are registered scorers in the experiment then they must be used for evaluation.
See references/scorers.md for the built-in scorers. Select any that are useful for assessing the agent's quality and that are not already registered.
If needed, create additional scorers using the make_judge() API. See references/scorers.md on how to create custom scorers and references/scorers-constraints.md for best practices.
⚠️ CRITICAL — Scorer Return Values: Scorers MUST instruct the LLM judge to return
"yes"or"no"(or booleans/numerics). Return values of"pass"or"fail"are silently cast toNoneby_cast_assessment_value_to_floatand excluded fromresults.metricswith no error or warning — results simply disappear. Seereferences/scorers-constraints.mdConstraint 2 for the full list of safe vs. broken return values.
REQUIRED: Register new scorers before evaluation using Python API:
from mlflow.genai.judges import make_judge
from mlflow.genai.scorers import BuiltinScorerName
import os
scorer = make_judge(...) # Or, scorer = BuiltinScorerName()
scorer.register()
** IMPORTANT: See references/scorers.md → "Model Selection for Scorers" to configure the model parameter of scorers before registration.
⚠️ Scorers MUST be registered before evaluation. Inline scorers that aren't registered won't appear in mlflow scorers list and won't be reusable.
uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID # Should show your scorers
ALWAYS discover existing datasets first to prevent duplicate work:
Run dataset discovery (mandatory):
uv run python scripts/list_datasets.py # Lists, compares, recommends datasets
uv run python scripts/list_datasets.py --format json # Machine-readable output
uv run python scripts/list_datasets.py --help # All options
Present findings to user:
Ask user about existing datasets:
If creating a new dataset, use the two-phase approach below.
Create a minimal 5-question dataset manually from the Step 1 interview answers. The goal is to confirm the pipeline works end-to-end before investing in large-scale generation.
import mlflow
from mlflow.genai.datasets import create_dataset
# Derive 5 representative questions directly from the agent's stated purpose
# and known failure modes identified in Step 1
sanity_records = [
{"inputs": {"query": "<question 1 from interview>"}, "expected_response": "<expected answer>"},
{"inputs": {"query": "<question 2 from interview>"}, "expected_response": "<expected answer>"},
# ... 5 total
]
sanity_dataset = create_dataset(
records=sanity_records,
name="sanity-check-5q",
)
Run evaluation on this dataset (see Step 4), then present results to the user with this framing:
"This is a sanity check — 5 questions confirm the pipeline works but aren't statistically meaningful. Proceeding to Phase B to generate a proper evaluation set."
Only proceed to Phase B once Phase A completes without errors.
Generate questions from the agent's actual corpus rather than inventing them from scratch. The approach depends on whether the project uses Databricks or OSS MLflow.
On Databricks — use generate_evals_df to synthesize questions from the agent's document corpus:
from databricks.agents.evals import generate_evals_df, estimate_synthetic_num_evals
import mlflow
# agent_description comes from Step 1 interview answers
agent_description = "<agent purpose from interview>"
# docs_df: a Spark or pandas DataFrame with a "content" column containing
# the documents/chunks the agent retrieves from (e.g., your Vector Search index)
evals = generate_evals_df(
docs=docs_df,
num_evals=100,
agent_description=agent_description,
)
# Merge into MLflow dataset — don't create a separate dataset
dataset = mlflow.genai.datasets.create_dataset(name="generated-evals-100q")
dataset.merge_records(evals)
To estimate the right num_evals before generating:
recommended = estimate_synthetic_num_evals(docs_df)
print(f"Recommended num_evals: {recommended}")
Dataset size guidance:
On OSS MLflow — use RAGAS TestsetGenerator to generate from your document corpus:
from ragas.testset import TestsetGenerator
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
generator = TestsetGenerator(
llm=LangchainLLMWrapper(your_llm),
embedding_model=LangchainEmbeddingsWrapper(your_embeddings),
)
testset = generator.generate_with_langchain_docs(docs, testset_size=100)
evals_df = testset.to_pandas()
# Convert to MLflow dataset schema and merge
import mlflow
records = [
{"inputs": {"query": row["user_input"]}, "expected_response": row["reference"]}
for _, row in evals_df.iterrows()
]
dataset = mlflow.genai.datasets.create_dataset(name="generated-evals-100q")
dataset.merge_records(records)
If no document corpus is available — ask the user to provide 50–100 representative queries from production logs or usage history. These are more realistic than synthetic questions and are preferable when available.
IMPORTANT: Do not skip dataset discovery. Always run list_datasets.py first, even if you plan to create a new dataset. This prevents duplicate work and ensures users are aware of existing evaluation datasets.
For complete dataset guide: See references/dataset-preparation.md
Checkpoint - verify before proceeding:
Run evaluation on 3 questions from the dataset before committing to the full run. This catches broken tools, misconfigured scorers, and auth failures early — before they silently corrupt 100-question results.
If you completed Phase A above, the pipeline is already validated — focus the dry run on scorer output only.
import mlflow
dataset = mlflow.genai.datasets.get_dataset(name="<your-dataset-name>")
dry_run_records = dataset.df.head(3)
Run mlflow.genai.evaluate() on these 3 records using the same wrapper and scorers as the full eval.
For each response, check:
0 or None? If so, the scorer is misconfigured (check return values — "pass"/"fail" are silently cast to None; use "yes"/"no" instead).Decision gate:
Why this matters: Tool failures (403s from docs scraping, GitHub API rate limits) produce empty agent responses that score as 0. Running a 100-question eval only to discover all tools were failing wastes time and produces misleading results. The dry run catches this in under a minute.
Large datasets (50+ questions)? See
references/throughput-guide.mdfor throughput optimization — covers parallelism env vars, async predict functions, and dataset splitting for 200+ question evals.
Before launching evaluation, tell the user how long it will take:
Count the dataset questions:
import mlflow
dataset = mlflow.genai.datasets.get_dataset(name="<your-dataset-name>")
print(f"Dataset size: {len(dataset.df)} questions")
Calculate the estimate — each question runs the agent once and the judge scorer once:
claude-opus-4): ~45–90s per questionclaude-sonnet-4): ~20–45s per questionEstimated time = N questions × 30–60s per question ÷ parallelism factor (typically 4–8x)
Tell the user before starting:
"This dataset has N questions. At ~30–60s per question with typical parallelism, evaluation will take approximately X–Y minutes. I'll run it as a background task so you can continue working — I'll summarize the results when it's done."
# Generate evaluation script (specify module and entry point)
uv run python scripts/run_evaluation_template.py \
--module mlflow_agent.agent \
--entry-point run_agent
The generated script creates a wrapper function that:
llm_provider)mlflow.genai.evaluate(data=df, predict_fn=wrapper, scorers=registered_scorers)evaluation_results.csv⚠️ CRITICAL: wrapper Signature Must Match Dataset Input Keys
MLflow calls predict_fn(**inputs) - it unpacks the inputs dict as keyword arguments.
| Dataset Record | MLflow Calls | predict_fn Must Be |
|---|---|---|
{"inputs": {"query": "..."}} | predict_fn(query="...") | def wrapper(query): |
{"inputs": {"question": "...", "context": "..."}} | predict_fn(question="...", context="...") | def wrapper(question, context): |
Common Mistake (WRONG):
def wrapper(inputs): # ❌ WRONG - inputs is NOT a dict
return agent(inputs["query"])
Run the evaluation as a background sub-agent so the main session stays available. Use the Agent tool with run_in_background: true:
Sub-agent instructions (pass these verbatim):
Run the agent evaluation and write results to scratchpad.
Steps:
1. cd <project-directory>
2. Run: uv run python run_agent_evaluation.py
3. When complete, write a summary to scratchpad/eval-results.md with:
- Exit status (success or error message)
- Path to results file (evaluation_results.csv)
- Wall-clock time taken
4. Return only: "Evaluation complete. Results written to scratchpad/eval-results.md"
In the main session, poll for completion by checking for the scratchpad file rather than blocking:
# Poll every 30s using Glob
# Glob("scratchpad/eval-results.md")
# When the file appears, read it and proceed to analysis
Do NOT use TaskOutput to wait for the background agent — that dumps the full transcript (~10–20k tokens) into the main context.
Once scratchpad/eval-results.md appears, run analysis:
# Pattern detection, failure analysis, recommendations
# Reads the CSV produced by mlflow.genai.evaluate() above
uv run python scripts/analyze_results.py evaluation_results.csv
Generates evaluation_report.md with per-scorer pass rates and improvement suggestions.
The script reads {scorer_name}/value and {scorer_name}/rationale columns from the CSV.
It also accepts the legacy JSON format from mlflow traces evaluate for backward compatibility:
uv run python scripts/analyze_results.py evaluation_results.json # legacy format
uv run python scripts/analyze_results.py evaluation_results.csv --output my_report.md # custom output
Detailed guides in references/ (load as needed):
instrumenting-with-mlflow-tracing skill (authoritative guide for autolog, decorators, session tracking, verification)Scripts are self-documenting - run with --help for usage details.