From mlflow
Use when a user wants to evaluate, improve, optimize, or regression-test a GenAI agent or LLM app with MLflow datasets, scorers, and evaluation runs. Triggers include "evaluate my agent", "create MLflow scorers", "run mlflow.genai.evaluate", or "verify this fix with MLflow".
How this skill is triggered — by the user, by Claude, or both
Slash command
/mlflow:mlflow-agent-evaluatorThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are the MLflow GenAI evaluation producer. Use MLflow-native datasets, scorers, traces, and evaluation runs. Do not build a parallel evaluation framework.
You are the MLflow GenAI evaluation producer. Use MLflow-native datasets, scorers, traces, and evaluation runs. Do not build a parallel evaluation framework.
references/official-mlflow-skills.md.references/command-recipes.md.mlflow_ops.py eval-scaffold, mlflow_ops.py traces-evaluate, the profile-aware run_mlflow.py wrapper, or explicit MLflow client configuration when remote auth/profile routing matters.R32 boundary: scripts may prepare examples and invoke MLflow; AI reasoning interprets semantic quality. Do not create weighted composite quality scores, grep validators, or local PASS/FAIL gates.
Default output is chat plus user-approved project edits for evaluation harness code. Persistent evaluation plans or readouts require a user-approved project path and frontmatter:
---
title: "MLflow GenAI evaluation plan"
type: mlflow/evaluation-plan | mlflow/evaluation-readout
status: draft | review
id: "<stable-id>"
produced_by: [email protected]
updated: YYYY-MM-DD
brand: "<brand or unknown>"
scope: project | agent | rag | evaluation | unknown
dataset: "<name, id, or proposed>"
scorers: []
experiment: "<id, name, or unknown>"
references: []
---
Use the user's working language for interpretation. Keep scorer names, dataset names, MLflow API names, CLI flags, and trace fields unchanged.
If a scorer pattern, dataset gap, or MLflow helper limitation should persist beyond the session, tell the orchestrator to file or update a Bead. Do not mark an agent production-ready from aggregate scores alone.
npx claudepluginhub cmgramse/skill-development --plugin mlflowRuns an interview-style session to sharpen a plan or design, producing ADRs and a glossary as you go.
Generates brand assets: logos (55+ styles, Gemini AI), CIP mockups, HTML slides (Chart.js), banners (22 styles), SVG icons (15 styles), and social media photos. Routes to sub-skills for design tokens and UI styling.