Evaluates GenAI agents with MLflow 3: mlflow.genai.evaluate() code, @scorer functions, built-in scorers, trace-based datasets, production monitoring, MemAlign alignment, GEPA prompt optimization.
npx claudepluginhub leary-poken/ai-dev-kit --plugin databricks-ai-dev-kitThis skill uses the workspace's default tool permissions.
1. **Read GOTCHAS.md** - 15+ common mistakes that cause failures
references/CRITICAL-interfaces.mdreferences/GOTCHAS.mdreferences/patterns-context-optimization.mdreferences/patterns-datasets.mdreferences/patterns-evaluation.mdreferences/patterns-judge-alignment.mdreferences/patterns-prompt-optimization.mdreferences/patterns-scorers.mdreferences/patterns-trace-analysis.mdreferences/patterns-trace-ingestion.mdreferences/user-journeys.mdProvides Ktor server patterns for routing DSL, plugins (auth, CORS, serialization), Koin DI, WebSockets, services, and testApplication testing.
Conducts multi-source web research with firecrawl and exa MCPs: searches, scrapes pages, synthesizes cited reports. For deep dives, competitive analysis, tech evaluations, or due diligence.
Provides demand forecasting, safety stock optimization, replenishment planning, and promotional lift estimation for multi-location retailers managing 300-800 SKUs.
Follow these workflows based on your goal. Each step indicates which reference files to read.
For users new to MLflow GenAI evaluation or setting up evaluation for a new agent.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Understand what to evaluate | user-journeys.md (Journey 0: Strategy) |
| 2 | Learn API patterns | GOTCHAS.md + CRITICAL-interfaces.md |
| 3 | Build initial dataset | patterns-datasets.md (Patterns 1-4) |
| 4 | Choose/create scorers | patterns-scorers.md + CRITICAL-interfaces.md (built-in list) |
| 5 | Run evaluation | patterns-evaluation.md (Patterns 1-3) |
For building evaluation datasets from production traces.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Search and filter traces | patterns-trace-analysis.md (MCP tools section) |
| 2 | Analyze trace quality | patterns-trace-analysis.md (Patterns 1-7) |
| 3 | Tag traces for inclusion | patterns-datasets.md (Patterns 16-17) |
| 4 | Build dataset from traces | patterns-datasets.md (Patterns 6-7) |
| 5 | Add expectations/ground truth | patterns-datasets.md (Pattern 2) |
For debugging slow or expensive agent execution.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Profile latency by span | patterns-trace-analysis.md (Patterns 4-6) |
| 2 | Analyze token usage | patterns-trace-analysis.md (Pattern 9) |
| 3 | Detect context issues | patterns-context-optimization.md (Section 5) |
| 4 | Apply optimizations | patterns-context-optimization.md (Sections 1-4, 6) |
| 5 | Re-evaluate to measure impact | patterns-evaluation.md (Pattern 6-7) |
For comparing agent versions and finding regressions.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Establish baseline | patterns-evaluation.md (Pattern 4: named runs) |
| 2 | Run current version | patterns-evaluation.md (Pattern 1) |
| 3 | Compare metrics | patterns-evaluation.md (Patterns 6-7) |
| 4 | Analyze failing traces | patterns-trace-analysis.md (Pattern 7) |
| 5 | Debug specific failures | patterns-trace-analysis.md (Patterns 8-9) |
For creating project-specific evaluation metrics.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Understand scorer interface | CRITICAL-interfaces.md (Scorer section) |
| 2 | Choose scorer pattern | patterns-scorers.md (Patterns 4-11) |
| 3 | For multi-agent scorers | patterns-scorers.md (Patterns 13-16) |
| 4 | Test with evaluation | patterns-evaluation.md (Pattern 1) |
For storing traces in Unity Catalog, instrumenting applications, and enabling continuous production monitoring.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Link UC schema to experiment | patterns-trace-ingestion.md (Patterns 1-2) |
| 2 | Set trace destination | patterns-trace-ingestion.md (Patterns 3-4) |
| 3 | Instrument your application | patterns-trace-ingestion.md (Patterns 5-8) |
| 4 | Configure trace sources (Apps/Serving/OTEL) | patterns-trace-ingestion.md (Patterns 9-11) |
| 5 | Enable production monitoring | patterns-trace-ingestion.md (Patterns 12-13) |
| 6 | Query and analyze UC traces | patterns-trace-ingestion.md (Pattern 14) |
For aligning an LLM judge to match domain expert preferences. A well-aligned judge improves every downstream use: evaluation accuracy, production monitoring signal, and prompt optimization quality. This workflow is valuable on its own, independent of prompt optimization.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Design base judge with make_judge (any feedback type) | patterns-judge-alignment.md (Pattern 1) |
| 2 | Run evaluate(), tag successful traces | patterns-judge-alignment.md (Pattern 2) |
| 3 | Build UC dataset + create SME labeling session | patterns-judge-alignment.md (Pattern 3) |
| 4 | Align judge with MemAlign after labeling completes | patterns-judge-alignment.md (Pattern 4) |
| 5 | Register aligned judge to experiment | patterns-judge-alignment.md (Pattern 5) |
| 6 | Re-evaluate with aligned judge (baseline) | patterns-judge-alignment.md (Pattern 6) |
For automatically improving a registered system prompt using optimize_prompts(). Works with any scorer, but paired with an aligned judge (Workflow 7) gives the most domain-accurate signal. For the full end-to-end loop combining alignment and optimization, see user-journeys.md Journey 10.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Build optimization dataset (inputs + expectations) | patterns-prompt-optimization.md (Pattern 1) |
| 2 | Run optimize_prompts() with GEPA + scorer | patterns-prompt-optimization.md (Pattern 2) |
| 3 | Register new version, promote conditionally | patterns-prompt-optimization.md (Pattern 3) |
| Reference | Purpose | When to Read |
|---|---|---|
GOTCHAS.md | Common mistakes | Always read first before writing code |
CRITICAL-interfaces.md | API signatures, schemas | When writing any evaluation code |
patterns-evaluation.md | Running evals, comparing | When executing evaluations |
patterns-scorers.md | Custom scorer creation | When built-in scorers aren't enough |
patterns-datasets.md | Dataset building | When preparing evaluation data |
patterns-trace-analysis.md | Trace debugging | When analyzing agent behavior |
patterns-context-optimization.md | Token/latency fixes | When agent is slow or expensive |
patterns-trace-ingestion.md | UC trace setup, monitoring | When setting up trace storage or production monitoring |
patterns-judge-alignment.md | MemAlign judge alignment, labeling sessions, SME feedback | When aligning judges to domain expert preferences |
patterns-prompt-optimization.md | GEPA optimization: build dataset, optimize_prompts(), promote | When running automated prompt improvement |
user-journeys.md | High-level workflows, full domain-expert optimization loop | When starting a new evaluation project or running the full align + optimize cycle |
mlflow.genai.evaluate() (NOT mlflow.evaluate()){"inputs": {"query": "..."}} (nested structure required)**unpacked kwargs (not a dict)feedback_value_type -- float, bool, categorical); token-heavy on the embedding model so set embedding_model explicitlyname in the labeling session MUST match the judge name used in evaluate() for align() to pair scoresinputs AND expectations per record (different from eval dataset)get_scorer() results won't show episodic memory on print until the judge is first usedSee GOTCHAS.md for complete list.