From mlx
Systematic evaluation of ML models, experiments, and AI system outputs. Multi-dimensional rubrics, LLM-as-judge, bias detection, and structured comparison frameworks. Use when the user asks to "evaluate model performance", "compare models", "build evaluation rubrics", "assess output quality", "detect model bias", or mentions evaluation frameworks, LLM-as-judge, model comparison, or quality assessment.
npx claudepluginhub damionrashford/mlx --plugin mlxThis skill is limited to using the following tools:
Frameworks for systematic evaluation of ML models, experiment results, and AI system outputs. Covers traditional ML metrics, LLM-as-judge patterns, bias detection, and structured comparison.
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
Frameworks for systematic evaluation of ML models, experiment results, and AI system outputs. Covers traditional ML metrics, LLM-as-judge patterns, bias detection, and structured comparison.
| Metric | When to use | Formula |
|---|---|---|
| Accuracy | Balanced classes | correct / total |
| Precision | False positives costly | TP / (TP + FP) |
| Recall | False negatives costly | TP / (TP + FN) |
| F1 | Imbalanced classes | 2 * P * R / (P + R) |
| AUC-ROC | Ranking / threshold selection | Area under ROC curve |
| Log loss | Probability calibration | -mean(y*log(p)) |
| Metric | When to use | Sensitivity |
|---|---|---|
| RMSE | Penalize large errors | Outlier sensitive |
| MAE | Robust to outliers | Linear penalty |
| R-squared | Variance explained | Scale independent |
| MAPE | Percentage error | Zero sensitive |
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np
# Multi-dimensional evaluation
report = classification_report(y_true, y_pred, output_dict=True)
# Per-class performance (find weak spots)
cm = confusion_matrix(y_true, y_pred)
per_class_acc = cm.diagonal() / cm.sum(axis=1)
# Calibration (are probabilities reliable?)
from sklearn.calibration import calibration_curve
fraction_pos, mean_predicted = calibration_curve(y_true, y_prob, n_bins=10)
# Fairness across subgroups
for group in ['group_a', 'group_b']:
mask = df['subgroup'] == group
group_score = metric(y_true[mask], y_pred[mask])
print(f"{group}: {group_score:.4f}")
When comparing experiments in results.tsv:
| Experiment | val_score | test_score | memory_mb | train_time | complexity |
|------------|-----------|------------|-----------|------------|------------|
| exp000 | 0.8523 | 0.8401 | 4096 | 2m | low |
| exp003 | 0.8634 | 0.8521 | 4352 | 5m | medium |
| exp007 | 0.8641 | 0.8530 | 8192 | 45m | high |
Statistical significance: Is the improvement real or noise?
Efficiency trade-off: Score per resource unit
Robustness: Does it hold across conditions?
Deployability: Can it run in production?
For evaluating LLM application outputs (RAG, chatbots, agents):
EVAL_PROMPT = """Rate the following AI response on a scale of 1-5 for each dimension.
**Task**: {task_description}
**Input**: {user_input}
**Response**: {model_output}
**Reference** (if available): {reference}
Rate each dimension:
- Relevance (1-5): Does the response address the question?
- Accuracy (1-5): Are the facts correct?
- Completeness (1-5): Does it cover all aspects?
- Clarity (1-5): Is it well-organized and clear?
- Helpfulness (1-5): Would this actually help the user?
Output as JSON:
{{"relevance": X, "accuracy": X, "completeness": X, "clarity": X, "helpfulness": X, "reasoning": "..."}}
"""
More reliable than direct scoring for subjective quality:
PAIRWISE_PROMPT = """Which response better answers the question?
**Question**: {question}
**Response A**: {response_a}
**Response B**: {response_b}
Choose: A is better / B is better / Tie
Explain your reasoning in 2-3 sentences.
"""
Bias mitigation: Run twice with A/B swapped. If results disagree, mark as tie.
| Bias | Description | Mitigation |
|---|---|---|
| Position | Prefers first response | Swap positions, average |
| Length | Prefers longer responses | Normalize by length |
| Self-enhancement | Prefers own model's style | Use different judge model |
| Verbosity | Equates detail with quality | Explicit rubric criteria |
| Authority | Prefers confident tone | Focus on factual accuracy |
| Approach | Best for | Limitation |
|---|---|---|
| Direct scoring | Objective criteria — factual accuracy, instruction following, toxicity | Score calibration drift, inconsistent scale interpretation |
| Pairwise comparison | Subjective preferences — tone, style, persuasiveness, overall quality | Position bias, length bias |
| Rubric-based | Multi-dimensional quality with defined criteria | Requires upfront rubric design |
Research (MT-Bench, Zheng et al. 2023): pairwise comparison achieves higher agreement with human judges than direct scoring for preference-based evaluation. Use direct scoring for objective criteria with clear ground truth; use pairwise for subjective quality comparisons.
| Task type | Primary metrics | Secondary |
|---|---|---|
| Binary pass/fail | Recall, Precision, F1 | Cohen's κ |
| Ordinal scale (1-5) | Spearman's ρ, Kendall's τ | Cohen's κ (weighted) |
| Pairwise preference | Agreement rate, position consistency | Confidence calibration |
| Multi-label | Macro-F1, Micro-F1 | Per-label precision/recall |
def evaluate_with_bias_mitigation(question, response_a, response_b, judge_model):
# Forward pass
result_1 = judge_model(PAIRWISE_PROMPT.format(
question=question, response_a=response_a, response_b=response_b
))
# Swapped pass (position bias mitigation)
result_2 = judge_model(PAIRWISE_PROMPT.format(
question=question, response_a=response_b, response_b=response_a
))
# Normalize result_2 (A/B labels are swapped)
if result_1 == result_2:
return result_1 # Consistent — high confidence
else:
return "tie" # Inconsistent — call it a tie
def build_rubric(criterion, weight, levels):
"""
criterion: "Factual Accuracy"
weight: 0.40
levels: {5: "All claims verified", 3: "Mostly accurate", 1: "Multiple errors"}
"""
return {"criterion": criterion, "weight": weight, "levels": levels}
Research (BrowseComp benchmark) shows three factors explain 95% of agent performance variance:
Implication: evaluate with realistic token budgets, not unlimited resources. Upgrading from an older model to Claude Sonnet 4.5 or GPT-5.2 provides larger gains than doubling token budget on the same model.
Build test sets covering multiple difficulty levels:
test_set = {
"simple": [
# Single-step, clear answer, common patterns
],
"medium": [
# Multi-step, some ambiguity, less common patterns
],
"complex": [
# Many steps, significant ambiguity, edge cases
],
"adversarial": [
# Deliberately tricky, boundary conditions, known failure modes
]
}
=== Evaluation Report ===
Task: [classification/regression/generation/retrieval]
Models compared: [list]
Test set: [size, composition]
Best model: [name]
Primary metric: [metric] = [value] (95% CI: [low, high])
Multi-dimensional comparison:
| Dimension | Model A | Model B | Winner |
|-----------------|---------|---------|--------|
| Primary metric | | | |
| Inference speed | | | |
| Memory usage | | | |
| Robustness | | | |
| Fairness | | | |
Recommendation: [which model and why]
Caveats: [limitations of this evaluation]