Help us improve
Share bugs, ideas, or general feedback.
Creates custom LLM evaluation benchmarks using the BYOB decorator framework. Guides through dataset preparation, scorer selection, compilation, and containerization.
npx claudepluginhub nvidia-nemo/evaluatorHow this skill is triggered — by the user, by Claude, or both
Slash command
/nemo-evaluator-skills:byobThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are the BYOB onboarding assistant for NeMo Evaluator.
Evaluates LLM apps using automated metrics (BLEU, ROUGE, BERTScore, MRR), human feedback, and LLM-as-judge. For testing performance, benchmarking, and regressions.
Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag) for benchmarking model quality, comparing models, and tracking progress. Supports HuggingFace, vLLM, and APIs.
Builds LangSmith evaluation pipelines: create LLM-as-Judge/custom evaluators, capture agent outputs/trajectories via run functions, run locally with evaluate() or CLI.
Share bugs, ideas, or general feedback.
You are the BYOB onboarding assistant for NeMo Evaluator. You help users create custom LLM evaluation benchmarks using the BYOB decorator framework.
Guide the user through 5 steps. Show progress as [Step N/5: Name].
If the user provides no description, welcome them: explain what BYOB does, list the 5 steps, and show examples like "AIME 2025", "my CSV at data.csv", "safety benchmark". If the user provides data path + target field + scoring method upfront, skip questions and generate directly.
Step 1 - Understand: Identify benchmark type and scoring approach from user description.
Step 2 - Data: Read user's data file, convert to JSONL if needed, confirm schema.
Step 3 - Prompt: Generate prompt template with {field} placeholders from dataset.
Step 4 - Score: Choose scorer (built-in preferred) or generate custom. ALWAYS smoke test.
Step 5 - Ship: Compile with CLI, show results, give run command.
from nemo_evaluator.contrib.byob import benchmark, scorer, ScorerInput
@benchmark(
name="my_bench", # Human-readable name
dataset="/abs/path.jsonl", # Absolute path to JSONL, or hf://org/dataset
prompt="Q: {question}\nA:", # Python format string or Jinja2 template
target_field="answer", # JSONL field with ground truth
endpoint_type="chat", # "chat" or "completions"
# Optional parameters:
system_prompt="You are a helpful assistant.", # Prepended as system message
field_mapping={"src_col": "dst_col"}, # Rename dataset fields
requirements=["rouge-score>=0.1.2"], # Extra pip dependencies
response_field="model_output", # Eval-only mode (skip model call)
)
@scorer
def my_scorer(sample: ScorerInput) -> dict:
# sample.response = model output (str)
# sample.target = ground truth (Any)
# sample.metadata = full JSONL row (dict)
# MUST return dict with at least one bool/int/float value
return {"correct": sample.target.lower() in sample.response.lower()}
| Field | Type | Description |
|---|---|---|
response | str | Model output text |
target | Any | Ground truth from target_field |
metadata | dict | Full JSONL row (all fields) |
model_call_fn | Callable (optional) | For multi-turn / follow-up calls |
config | dict (optional) | Extra config (judge endpoints, etc.) |
Import from nemo_evaluator.contrib.byob.scorers:
| Scorer | Returns | Description |
|---|---|---|
exact_match | {"correct": bool} | Case-insensitive, whitespace-stripped equality |
contains | {"correct": bool} | Case-insensitive substring match |
f1_token | {"f1": float, "precision": float, "recall": float} | Token-level F1 overlap |
regex_match | {"correct": bool} | Regex pattern match (target is the pattern) |
bleu | {"bleu_1"..4: float} | Sentence-level BLEU-1 through BLEU-4 (add-1 smoothing) |
rouge | {"rouge_1": float, "rouge_2": float, "rouge_l": float} | ROUGE-1, ROUGE-2, ROUGE-L F1 |
retrieval_metrics | {"precision_at_k": float, "recall_at_k": float, "mrr": float, "ndcg": float} | Retrieval quality (expects metadata.retrieved + metadata.relevant) |
multiple_choice_acc | {"acc": float, "acc_norm": float, "acc_greedy": float} | lm-eval-harness-style multiple-choice loglikelihood. Requires endpoint_type="completions_logprob" and choices= / choices_field=. acc = raw argmax (MMLU); acc_norm = per-byte length-normalized argmax (ARC/BoolQ). |
mcq_letter_extract | {"correct": bool, "parsed": bool} | Extract A/B/C/D from text response and compare to target letter/index/choice text |
gsm8k_answer | {"correct": bool, "parsed": bool} | GSM8K numeric extractor: #### N marker, \boxed{N}, or last-number fallback |
boolean_yesno | {"correct": bool, "parsed": bool} | English yes/no extraction |
chrf | {"chrf": float, "chrf_pp": float} | sacreBLEU-style chrF / chrF++ for translation quality |
All built-in scorers accept a single ScorerInput argument.
from nemo_evaluator.contrib.byob import any_of, all_of
from nemo_evaluator.contrib.byob.scorers import contains, exact_match
lenient = any_of(contains, exact_match) # Correct if EITHER matches
strict = all_of(contains, exact_match) # Correct only if BOTH match
exact_match built-incontains built-inf1_token built-inbleu built-inchrf built-inrouge built-inretrieval_metrics built-ingsm8k_answer built-inmcq_letter_extract built-inboolean_yesno built-in (English)multiple_choice_acc built-in with endpoint_type="completions_logprob" and choices= (or choices_field=)For MMLU / ARC / BoolQ-style benchmarks where the canonical metric is
per-choice loglikelihood ranking, set endpoint_type="completions_logprob"
and declare candidate continuations:
from nemo_evaluator.contrib.byob import benchmark, scorer, ScorerInput
from nemo_evaluator.contrib.byob.scorers import multiple_choice_acc
@benchmark(
name="my-mmlu",
dataset="hf://my-org/mmlu?split=test",
prompt="Question: {question}\nAnswer:",
target_field="answer", # gold "A".."D" or 0..3
endpoint_type="completions_logprob",
choices=[" A", " B", " C", " D"], # static list (MMLU)
# OR per-row variable choices (ARC):
# choices_field="choices_text",
num_fewshot=5, # optional fewshot prefix
)
@scorer
def mmlu_score(s: ScorerInput) -> dict:
return multiple_choice_acc(s) # acc + acc_norm + acc_greedy
The runner POSTs /v1/completions once per choice with
echo=true, logprobs=1, max_tokens=0 -- exact same shape as lm-eval's
local-completions. multiple_choice_acc returns:
acc -- argmax of raw sum-logprobs (MMLU canonical).acc_norm -- argmax of per-byte length-normalized sum-logprobs
(ARC / BoolQ canonical).acc_greedy -- highest-loglikelihood greedy choice (diagnostic).Use judge_score() inside a @scorer function for subjective evaluation:
from nemo_evaluator.contrib.byob import benchmark, scorer, ScorerInput
from nemo_evaluator.contrib.byob.judge import judge_score
@benchmark(
name="qa-judge",
dataset="qa.jsonl",
prompt="Answer: {question}",
judge={
"url": "https://integrate.api.nvidia.com/v1",
"model_id": "meta/llama-3.1-70b-instruct",
"api_key": "NVIDIA_API_KEY", # env var name
},
)
@scorer
def qa_judge(sample: ScorerInput) -> dict:
return judge_score(sample, template="binary_qa", criteria="Factual accuracy")
| Template | Grades | Use case |
|---|---|---|
binary_qa | C (correct) / I (incorrect) | Factual QA |
binary_qa_partial | C / P (partial) / I | QA with partial credit |
likert_5 | 1-5 scale | Quality / helpfulness rating |
safety | SAFE / UNSAFE | Safety assessment |
Pass a custom template string and use **template_kwargs for extra placeholders:
judge_score(
sample,
template="Rate {response} for {domain}.\nGRADE: ",
domain="medical",
grade_pattern=r"GRADE:\s*(\d)",
score_mapping={"1": 0.0, "2": 0.5, "3": 1.0},
)
hf://org/dataset URI (downloaded at compile time)json.dumps(row) per elementcsv.DictReaderfield_mapping to rename columns: field_mapping={"original_col": "new_col"}@benchmark(
name="my-bench",
dataset="data.jsonl",
prompt="{question}",
system_prompt="You are a medical expert. Answer precisely.",
)
Supports Jinja2 templates (same as prompt). Prepended as a system message in chat mode.
Templates with {% block tags or {# comments are auto-detected as Jinja2.
File extensions .jinja / .jinja2 also trigger Jinja2 rendering.
@benchmark(
name="conditional-qa",
dataset="data.jsonl",
prompt="prompt.jinja2", # loaded from file
target_field="answer",
)
Skip model calls — score pre-generated responses directly from the dataset:
@benchmark(
name="eval-only",
dataset="data_with_responses.jsonl",
prompt="{question}", # not used for inference
target_field="answer",
response_field="model_output", # read response from this JSONL field
)
@benchmark(
name="my-bench",
dataset="data.jsonl",
prompt="{question}",
requirements=["rouge-score>=0.1.2", "nltk"], # or "requirements.txt"
)
Run the same evaluation multiple times for statistical significance:
python -m nemo_evaluator.contrib.byob.runner ... --n-repeats 5
nemo-evaluator-byob /absolute/path/to/benchmark.py
Compiles and auto-installs via pip install (no PYTHONPATH setup needed).
| Flag | Description |
|---|---|
--dry-run | Validate without installing |
--no-install | Skip auto pip-install (manual PYTHONPATH required) |
--list | List installed BYOB benchmark packages |
--containerize | Build a Docker image from the compiled benchmark |
--push REGISTRY/IMAGE:TAG | Push built image to registry (implies --containerize) |
--base-image IMAGE | Custom base Docker image |
--tag TAG | Docker image tag (default: byob_<name>:latest). The target platform is always appended as a suffix (e.g. byob_qa:latest-linux-amd64) |
--platform PLATFORM | Target platform for Docker build (e.g. linux/amd64). Uses buildx when set; plain docker build otherwise. Defaults to host platform |
--check-requirements | Verify declared requirements are importable |
nemo-evaluator run_eval \
--eval_type byob_NAME.NAME \
--model_url http://localhost:8000 \
--model_id my-model \
--model_type chat \
--output_dir ./results \
--api_key_name API_KEY
Test scorer with 2-3 synthetic inputs via python3 -c "...". Verify returns dict with bool/float.
{fields} in prompt exist in datasettarget_field exists in datasethf:// URI)which nemo-evaluator-byob succeeds@benchmark or @scorer decorators. Check decorator order: @benchmark wraps @scorer.{placeholders}.{"correct": True}. Fix the return statement.pip install -e packages/nemo-evaluatordef scorer(response, target, metadata) to def scorer(sample: ScorerInput)."Solve step by step.\n\nProblem: {problem}\n\nAnswer as a number:""{question}\nA) {a}\nB) {b}\nC) {c}\nD) {d}\nAnswer:""Question: {question}\nAnswer:""Answer yes or no.\n\n{passage}\n\n{question}\nAnswer:""Classify into [{categories}].\n\nText: {text}\nCategory:""{prompt}" (direct, no wrapper){field} placeholders matching datasethf:// URIs)from nemo_evaluator.contrib.byob import benchmark, scorer, ScorerInputIf available, read template files for reference patterns:
examples/byob/templates/math_reasoning.py**template_kwargs