Stats

Actions

Available In

Tags

NeMo Evaluator SDK

[!NOTE] Preview: NeMo Evaluator 0.3.0 — A ground-up rewrite with a unified nel CLI, pluggable environment architecture, and built-in agentic eval support is available on the dev/0.3.0 branch. Feedback welcome via Issues.

🆕 What's New in 26.01 Release

Telemetry

Anonymous telemetry to help improve the project. See Telemetry for details and opt-out options.

New Evaluation Harnesses

TAU2-Bench (tau2-bench): Conversational agents in dual-control environments (telecom, airline, retail)

RULER (long-context-eval): Long-context evaluation with configurable sequence lengths (4K to 1M tokens)

CoDec (contamination-detection): Contamination detection - practical and accurate method to detect and quantify training data contamination in large language models

MTEB (mteb): Massive Text Embedding Benchmark

NeMo Evaluator SDK is an open-source platform for robust, reproducible, and scalable evaluation of Large Language Models. It enables you to run hundreds of benchmarks across popular evaluation harnesses against any OpenAI-compatible model API. Evaluations execute in open-source Docker containers for auditable and trustworthy results. The platform's containerized architecture allows for the rapid integration of public benchmarks and private datasets.

NeMo Evaluator SDK is built on four core principles to provide a reliable and versatile evaluation experience:

Reproducibility by Default: All configurations, random seeds, and software provenance are captured automatically for auditable and repeatable evaluations.

Scale Anywhere: Run evaluations from a local machine to a Slurm cluster or cloud-native backends like Lepton AI without changing your workflow.

State-of-the-Art Benchmarking: Access a comprehensive suite of over 100 benchmarks from 18 popular open-source evaluation harnesses. See the full list of Supported benchmarks and evaluation harnesses.

Extensible and Customizable: Integrate new evaluation harnesses, add custom benchmarks with proprietary data, and define custom result exporters for existing MLOps tooling.

⚙️ How It Works: Launcher and Core Engine

The platform consists of two main components:

nemo-evaluator (The Evaluation Core Engine): A Python library that manages the interaction between an evaluation harness and the model being tested.

nemo-evaluator-launcher (The CLI and Orchestration): The primary user interface and orchestration layer. It handles configuration, selects the execution environment, and launches the appropriate container to run the evaluation.

Most users typically interact with nemo-evaluator-launcher, which serves as a universal gateway to different benchmarks and harnesses. However, it is also possible to interact directly with nemo-evaluator by following this guide.

📊 Supported Benchmarks and Evaluation Harnesses

NeMo Evaluator SDK

[!NOTE] Preview: NeMo Evaluator 0.3.0 — A ground-up rewrite with a unified nel CLI, pluggable environment architecture, and built-in agentic eval support is available on the dev/0.3.0 branch. Feedback welcome via Issues.

🆕 What's New in 26.01 Release

Telemetry

Anonymous telemetry to help improve the project. See Telemetry for details and opt-out options.

New Evaluation Harnesses

TAU2-Bench (tau2-bench): Conversational agents in dual-control environments (telecom, airline, retail)
RULER (long-context-eval): Long-context evaluation with configurable sequence lengths (4K to 1M tokens)
CoDec (contamination-detection): Contamination detection - practical and accurate method to detect and quantify training data contamination in large language models
MTEB (mteb): Massive Text Embedding Benchmark

📖 Documentation

NeMo Evaluator SDK is built on four core principles to provide a reliable and versatile evaluation experience:

Reproducibility by Default: All configurations, random seeds, and software provenance are captured automatically for auditable and repeatable evaluations.
Scale Anywhere: Run evaluations from a local machine to a Slurm cluster or cloud-native backends like Lepton AI without changing your workflow.
State-of-the-Art Benchmarking: Access a comprehensive suite of over 100 benchmarks from 18 popular open-source evaluation harnesses. See the full list of Supported benchmarks and evaluation harnesses.
Extensible and Customizable: Integrate new evaluation harnesses, add custom benchmarks with proprietary data, and define custom result exporters for existing MLOps tooling.

⚙️ How It Works: Launcher and Core Engine

The platform consists of two main components:

nemo-evaluator (The Evaluation Core Engine): A Python library that manages the interaction between an evaluation harness and the model being tested.
nemo-evaluator-launcher (The CLI and Orchestration): The primary user interface and orchestration layer. It handles configuration, selects the execution environment, and launches the appropriate container to run the evaluation.

nemo-evaluator-skills

Popularity

What's Inside

Confidence

README

NeMo Evaluator SDK

🆕 What's New in 26.01 Release

Telemetry

New Evaluation Harnesses

📖 Documentation

⚙️ How It Works: Launcher and Core Engine

📊 Supported Benchmarks and Evaluation Harnesses

Similar Plugins

llm-observability

ai-infra-auto-driven-skills

skill-optimizer

DeepEval

training-hub

agent-eval-harness

More by NVIDIA-NeMo

nemotron-customize

NeMo Evaluator SDK

🆕 What's New in 26.01 Release

Telemetry

New Evaluation Harnesses

📖 Documentation

⚙️ How It Works: Launcher and Core Engine

📊 Supported Benchmarks and Evaluation Harnesses

Popularity

Health & Quality

More by NVIDIA-NeMo

nemotron-customize

Similar Plugins

llm-observability

ai-infra-auto-driven-skills

skill-optimizer

DeepEval

training-hub

agent-eval-harness