Evaluate model predictions against ground truth using COCO, Open Images, or custom protocols. Use when computing mAP, precision, recall, confusion matrices, or analyzing TP/FP/FN examples for detection, classification, segmentation, or regression tasks.
Evaluates model predictions against ground truth for detection, classification, segmentation, or regression tasks.
npx claudepluginhub voxel51/fiftyone-skillsThis skill inherits all available tools. When active, it can use any tool Claude has access to.
ALWAYS follow these rules:
list_datasets()
set_context(dataset_name="my-dataset")
dataset_summary(name="my-dataset")
Verify the dataset has both prediction and ground truth fields of compatible types.
list_plugins()
# If @voxel51/evaluation not listed:
download_plugin(url_or_repo="voxel51/fiftyone-plugins", plugin_names=["@voxel51/evaluation"])
enable_plugin(plugin_name="@voxel51/evaluation")
Always confirm with the user:
launch_app(dataset_name="my-dataset")
close_app()
list_datasets()
set_context(dataset_name="my-dataset")
dataset_summary(name="my-dataset")
Review:
Label Types and Compatible Evaluations:
| Label Type | Evaluation Method | Supported Methods |
|---|---|---|
Detections | evaluate_detections() | coco, open-images |
Polylines | evaluate_detections() | coco, open-images |
Keypoints | evaluate_detections() | coco, open-images |
TemporalDetections | evaluate_detections() | activitynet |
Classification | evaluate_classifications() | simple, top-k, binary |
Segmentation | evaluate_segmentations() | simple |
Regression | evaluate_regressions() | simple |
list_plugins()
If @voxel51/evaluation is not in the list:
download_plugin(
url_or_repo="voxel51/fiftyone-plugins",
plugin_names=["@voxel51/evaluation"]
)
enable_plugin(plugin_name="@voxel51/evaluation")
launch_app(dataset_name="my-dataset")
Ask user for:
pred_field)gt_field)eval_key) - must be unique identifierexecute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "predictions",
"gt_field": "ground_truth",
"eval_key": "eval",
"method": "coco",
"iou": 0.5,
"compute_mAP": true
}
)
After evaluation, the dataset will have new fields:
{eval_key}_tp - True positive count per sample{eval_key}_fp - False positive count per sample{eval_key}_fn - False negative count per sampleView only samples with false positives:
set_view(filters={"eval_fp": {"$gt": 0}})
Use the Model Evaluation Panel in the App to interactively explore:
To examine individual true positives, false positives, and false negatives, guide users to the Python SDK:
import fiftyone as fo
dataset = fo.load_dataset("my-dataset")
# Convert to evaluation patches view
eval_patches = dataset.to_evaluation_patches("eval")
# Count by type
print(eval_patches.count_values("type"))
# Output: {'fn': 246, 'fp': 4131, 'tp': 986}
# View only false positives
fp_view = eval_patches.match(F("type") == "fp")
session = fo.launch_app(view=fp_view)
close_app()
For Detections, Polylines, Keypoints labels.
COCO-style (default):
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "predictions",
"gt_field": "ground_truth",
"eval_key": "eval_coco",
"method": "coco",
"iou": 0.5,
"classwise": true,
"compute_mAP": true
}
)
| Parameter | Type | Default | Description |
|---|---|---|---|
iou | float | 0.5 | IoU threshold for matching |
classwise | bool | true | Only match objects with same class |
compute_mAP | bool | false | Compute mAP, mAR, and PR curves |
use_masks | bool | false | Use instance masks for IoU (if available) |
iscrowd | string | null | Attribute name for crowd annotations |
iou_threshs | string | null | Comma-separated IoU thresholds for mAP |
max_preds | int | null | Max predictions per sample for mAP |
Open Images-style:
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "predictions",
"gt_field": "ground_truth",
"eval_key": "eval_oi",
"method": "open-images",
"iou": 0.5
}
)
Supports additional parameters:
pos_label_field: Classifications specifying which classes should be evaluatedneg_label_field: Classifications specifying which classes should NOT be evaluatedActivityNet-style (temporal):
For TemporalDetections in video datasets:
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "predictions",
"gt_field": "ground_truth",
"eval_key": "eval_temporal",
"method": "activitynet",
"compute_mAP": true
}
)
For Classification labels.
Simple (default):
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "predictions",
"gt_field": "ground_truth",
"eval_key": "eval_cls",
"method": "simple"
}
)
Per-sample field {eval_key} stores boolean indicating if prediction was correct.
Top-k:
Requires predictions with logits field:
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "predictions",
"gt_field": "ground_truth",
"eval_key": "eval_topk",
"method": "top-k",
"k": 5
}
)
Binary:
For binary classifiers:
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "predictions",
"gt_field": "ground_truth",
"eval_key": "eval_binary",
"method": "binary"
}
)
Per-sample field {eval_key} stores: "tp", "fp", "tn", or "fn".
For Segmentation labels.
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "predictions",
"gt_field": "ground_truth",
"eval_key": "eval_seg",
"method": "simple",
"bandwidth": 5 # Optional: evaluate only boundary pixels
}
)
| Parameter | Type | Default | Description |
|---|---|---|---|
bandwidth | int | null | Pixels along contours to evaluate (null = entire mask) |
average | string | "micro" | Averaging strategy: micro, macro, weighted, samples |
Per-sample fields:
{eval_key}_accuracy{eval_key}_precision{eval_key}_recallFor Regression labels.
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "predictions",
"gt_field": "ground_truth",
"eval_key": "eval_reg",
"method": "simple",
"metric": "squared_error" # or "absolute_error"
}
)
Per-sample field {eval_key} stores the error value.
Metrics available:
execute_operator(
operator_uri="@voxel51/evaluation/get_evaluation_info",
params={
"eval_key": "eval"
}
)
Load the exact view on which an evaluation was performed:
execute_operator(
operator_uri="@voxel51/evaluation/load_evaluation_view",
params={
"eval_key": "eval",
"select_fields": false
}
)
execute_operator(
operator_uri="@voxel51/evaluation/rename_evaluation",
params={
"eval_key": "eval",
"new_eval_key": "eval_v2"
}
)
execute_operator(
operator_uri="@voxel51/evaluation/delete_evaluation",
params={
"eval_key": "eval"
}
)
# Verify dataset has detection fields
set_context(dataset_name="my-dataset")
dataset_summary(name="my-dataset")
# Launch app
launch_app(dataset_name="my-dataset")
# Run COCO-style evaluation with mAP
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "predictions",
"gt_field": "ground_truth",
"eval_key": "eval",
"method": "coco",
"iou": 0.5,
"compute_mAP": true
}
)
# View samples with most false positives
set_view(filters={"eval_fp": {"$gt": 5}})
set_context(dataset_name="my-dataset")
launch_app(dataset_name="my-dataset")
# Evaluate first model
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "model_a_predictions",
"gt_field": "ground_truth",
"eval_key": "eval_model_a",
"method": "coco",
"compute_mAP": true
}
)
# Evaluate second model
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "model_b_predictions",
"gt_field": "ground_truth",
"eval_key": "eval_model_b",
"method": "coco",
"compute_mAP": true
}
)
# Use the Model Evaluation Panel to compare results
set_context(dataset_name="my-classification-dataset")
launch_app(dataset_name="my-classification-dataset")
# Simple classification evaluation
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "predictions",
"gt_field": "ground_truth",
"eval_key": "eval_cls",
"method": "simple"
}
)
# View misclassified samples
set_view(filters={"eval_cls": false})
set_context(dataset_name="my-dataset")
launch_app(dataset_name="my-dataset")
# Strict evaluation (IoU 0.75)
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "predictions",
"gt_field": "ground_truth",
"eval_key": "eval_strict",
"method": "coco",
"iou": 0.75,
"compute_mAP": true
}
)
# Lenient evaluation (IoU 0.25)
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "predictions",
"gt_field": "ground_truth",
"eval_key": "eval_lenient",
"method": "coco",
"iou": 0.25,
"compute_mAP": true
}
)
set_context(dataset_name="my-segmentation-dataset")
launch_app(dataset_name="my-segmentation-dataset")
# Full mask evaluation
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "predictions",
"gt_field": "ground_truth",
"eval_key": "eval_seg",
"method": "simple"
}
)
# Boundary-only evaluation (5 pixel bandwidth)
execute_operator(
operator_uri="@voxel51/evaluation/evaluate_model",
params={
"pred_field": "predictions",
"gt_field": "ground_truth",
"eval_key": "eval_seg_boundary",
"method": "simple",
"bandwidth": 5
}
)
For more control over evaluation and access to full results, guide users to the Python SDK:
import fiftyone as fo
import fiftyone.zoo as foz
# Load dataset
dataset = fo.load_dataset("my-dataset")
# Evaluate detections
results = dataset.evaluate_detections(
"predictions",
gt_field="ground_truth",
eval_key="eval",
method="coco",
iou=0.5,
compute_mAP=True,
)
# Print classification report
results.print_report()
# Get mAP value
print(f"mAP: {results.mAP():.3f}")
# Plot confusion matrix (interactive)
plot = results.plot_confusion_matrix()
plot.show()
# Plot precision-recall curves
plot = results.plot_pr_curves(classes=["person", "car", "dog"])
plot.show()
# Convert to evaluation patches to view TP/FP/FN
eval_patches = dataset.to_evaluation_patches("eval")
print(eval_patches.count_values("type"))
# View false positives in the App
from fiftyone import ViewField as F
fp_view = eval_patches.match(F("type") == "fp")
session = fo.launch_app(view=fp_view)
Python SDK evaluation methods:
dataset.evaluate_detections() - Object detectiondataset.evaluate_classifications() - Classificationdataset.evaluate_segmentations() - Semantic segmentationdataset.evaluate_regressions() - RegressionResults object methods:
results.print_report() - Print classification reportresults.print_metrics() - Print aggregate metricsresults.mAP() - Get mAP value (detection only)results.mAR() - Get mAR value (detection only)results.plot_confusion_matrix() - Interactive confusion matrixresults.plot_pr_curves() - Precision-recall curvesresults.plot_results() - Scatter plot (regression only)Error: "No suitable label fields"
dataset_summary() to see available fields and typesError: "No suitable ground truth fields"
Error: "Evaluation key already exists"
Error: "Plugin not found"
download_plugin(url_or_repo="voxel51/fiftyone-plugins", plugin_names=["@voxel51/evaluation"])
enable_plugin(plugin_name="@voxel51/evaluation")
mAP is not computed
compute_mAP: true in paramsEvaluation is slow
eval_yolov8_coco, eval_resnet_topk5to_evaluation_patches() to understand errorsActivates when the user asks about AI prompts, needs prompt templates, wants to search for prompts, or mentions prompts.chat. Use for discovering, retrieving, and improving prompts.
Search, retrieve, and install Agent Skills from the prompts.chat registry using MCP tools. Use when the user asks to find skills, browse skill catalogs, install a skill for Claude, or extend Claude's capabilities with reusable AI agent components.
This skill should be used when the user wants to "create a skill", "add a skill to plugin", "write a new skill", "improve skill description", "organize skill content", or needs guidance on skill structure, progressive disclosure, or skill development best practices for Claude Code plugins.