Help us improve
Share bugs, ideas, or general feedback.
Share bugs, ideas, or general feedback.
Share bugs, ideas, or general feedback.
By NVIDIA-NeMo
Manage the full lifecycle of LLM evaluations on Slurm clusters: launch, monitor, debug, and analyze results via NeMo Evaluator SDK. Create custom benchmarks with the BYOB framework and interactively configure evaluation launchers.
npx claudepluginhub nvidia-nemo/evaluatorQuery and browse evaluation results stored in MLflow. Use when the user wants to look up runs by invocation ID, compare metrics across models, fetch artifacts (configs, logs, results), or set up the MLflow MCP server. ALWAYS triggers on mentions of MLflow, experiment results, run comparison, invocation IDs in the context of results, or MLflow MCP setup.
Run, monitor, analyze, and debug LLM evaluations via nemo-evaluator-launcher. Covers running evaluations, checking status and live progress, debugging failed runs, exporting artifacts and logs, and analyzing results. ALWAYS triggers on mentions of running evaluations, checking progress, debugging failed evals, analyzing or analysing runs or results, run directories or artifact paths on clusters, Slurm job issues, invocation IDs, or inspecting logs (client logs, server logs, SSH to cluster, tail logs, grep logs). Do NOT use for creating or modifying evaluation configs. WARNING โ this skill performs privileged operations; it SSHes to remote cluster hosts, executes shell commands on cluster infrastructure, transfers files via rsync, and may modify cluster configuration fields (e.g., the SLURM account). Only use with trusted cluster credentials and configs from trusted sources, and require explicit user confirmation before SSH/rsync operations or any cluster-config change.
Interactive config wizard for NeMo Evaluator Launcher (NEL). Use when the user wants to create a new evaluation config from scratch, set up an evaluation from existing configs, or modify a NEL config (deployment, tasks, multi-node, interceptors). ALWAYS triggers on mentions of creating configs, setting up evaluations, configuring models for evaluation, or modifying NEL YAML files. Do NOT use for monitoring, debugging, or analyzing already-running evaluations.
Create custom LLM evaluation benchmarks using the BYOB decorator framework. Use when the user wants to (1) create a new benchmark from a dataset, (2) pick or write a scorer, (3) compile and run a BYOB benchmark, (4) containerize a benchmark, or (5) use LLM-as-Judge evaluation. Triggers on mentions of BYOB, custom benchmark, bring your own benchmark, scorer, or benchmark compilation.
Share bugs, ideas, or general feedback.
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge.
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge.
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
Agent-ready playbooks for LLM serving benchmarks, capacity planning, torch-profiler triage, pipeline analysis, compute simulation, SGLang/vLLM SOTA Humanize loops, human code review, production incident triage, and model PR-history dossiers.
Skills for adding DeepEval evaluations, tracing, datasets, Confident AI reports, and iterative improvement loops to AI applications.
Evaluate and compare ML model performance metrics
Benchmark, evaluate, and optimize skills to ensure reliable performance across all LLMs
Agent and skill evaluation harness with MLflow integration
Scaffold a Dokimos Experiment that wires together a dataset, task, evaluators, and optional reporter. Supports parallelism, multiple runs for variance reduction, and server-based reporting.
Compose runnable NVIDIA Nemotron model-customization pipelines from existing repo steps.
[!NOTE] Preview: NeMo Evaluator 0.3.0 โ A ground-up rewrite with a unified
nelCLI, pluggable environment architecture, and built-in agentic eval support is available on thedev/0.3.0branch. Feedback welcome via Issues.
tau2-bench): Conversational agents in dual-control environments (telecom, airline, retail)long-context-eval): Long-context evaluation with configurable sequence lengths (4K to 1M tokens)contamination-detection): Contamination detection - practical and accurate method to detect and quantify training data contamination in large language modelsmteb): Massive Text Embedding BenchmarkNeMo Evaluator SDK is an open-source platform for robust, reproducible, and scalable evaluation of Large Language Models. It enables you to run hundreds of benchmarks across popular evaluation harnesses against any OpenAI-compatible model API. Evaluations execute in open-source Docker containers for auditable and trustworthy results. The platform's containerized architecture allows for the rapid integration of public benchmarks and private datasets.
NeMo Evaluator SDK is built on four core principles to provide a reliable and versatile evaluation experience:
The platform consists of two main components:
nemo-evaluator (The Evaluation Core Engine): A Python library that manages the interaction between an evaluation harness and the model being tested.nemo-evaluator-launcher (The CLI and Orchestration): The primary user interface and orchestration layer. It handles configuration, selects the execution environment, and launches the appropriate container to run the evaluation.Most users typically interact with nemo-evaluator-launcher, which serves as a universal gateway to different benchmarks and harnesses. However, it is also possible to interact directly with nemo-evaluator by following this guide.