Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.
/plugin marketplace add huggingface/skills/plugin install huggingface-huggingface-skills@huggingface/skillsThis skill inherits all available tools. When active, it can use any tool Claude has access to.
examples/USAGE_EXAMPLES.mdexamples/artificial_analysis_to_hub.pyexamples/example_readme_tables.mdexamples/metric_mapping.jsonrequirements.txtscripts/evaluation_manager.pyscripts/inspect_eval_uv.pyscripts/inspect_vllm_uv.pyscripts/lighteval_vllm_uv.pyscripts/run_eval_job.pyscripts/run_vllm_eval_job.pyscripts/test_extraction.pyThis skill provides tools to add structured evaluation results to Hugging Face model cards. It supports multiple methods for adding evaluation data:
uv integration1.3.0
Note: vLLM dependencies are installed automatically via PEP 723 script headers when using uv run.
Before creating ANY pull request with --create-pr, you MUST check for existing open PRs:
uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"
If open PRs exist:
This prevents spamming model repositories with duplicate evaluation PRs.
Use --help for the latest workflow guidance. Works with plain Python or uv run:
uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py inspect-tables --help
uv run scripts/evaluation_manager.py extract-readme --help
Key workflow (matches CLI help):
get-prs → check for existing open PRs firstinspect-tables → find table numbers/columnsextract-readme --table N → prints YAML by default--apply (push) or --create-pr to write changesinspect-tables to see all tables in a README with structure, columns, and sample rows--table N to extract from a specific table (required when multiple tables exist)--model-column-index (index from inspect output). Use --model-name-override only with exact column header text.--task-type sets the task.type field in model-index output (e.g., text-generation, summarization)inspect-ai library⚠️ Important: This approach is only possible on devices with uv installed and sufficient GPU memory.
Benefits: No need to use hf_jobs() MCP tool, can run scripts directly in terminal
When to use: User working in local device directly when GPU is available
nvidia-smiuv run scripts/train_sft_example.py
The skill includes Python scripts in scripts/ to perform operations.
uv run (PEP 723 header auto-installs deps)pip install huggingface-hub markdown-it-py python-dotenv pyyaml requestsHF_TOKEN environment variable with Write-access tokenAA_API_KEY environment variable.env is loaded automatically if python-dotenv is installedRecommended flow (matches --help):
# 1) Inspect tables to get table numbers and column hints
uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model"
# 2) Extract a specific table (prints YAML by default)
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "username/model" \
--table 1 \
[--model-column-index <column index shown by inspect-tables>] \
[--model-name-override "<column header/model name>"] # use exact header text if you can't use the index
# 3) Apply changes (push or PR)
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "username/model" \
--table 1 \
--apply # push directly
# or
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "username/model" \
--table 1 \
--create-pr # open a PR
Validation checklist:
--model-column-index; if using --model-name-override, the column header text must be exact.Fetch benchmark scores from Artificial Analysis API and add them to a model card.
Basic Usage:
AA_API_KEY="your-api-key" python scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" \
--model-name "claude-sonnet-4" \
--repo-id "username/model-name"
With Environment File:
# Create .env file
echo "AA_API_KEY=your-api-key" >> .env
echo "HF_TOKEN=your-hf-token" >> .env
# Run import
python scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" \
--model-name "claude-sonnet-4" \
--repo-id "username/model-name"
Create Pull Request:
python scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" \
--model-name "claude-sonnet-4" \
--repo-id "username/model-name" \
--create-pr
Submit an evaluation job on Hugging Face infrastructure using the hf jobs uv run CLI.
Direct CLI Usage:
HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
--flavor cpu-basic \
--secret HF_TOKEN=$HF_TOKEN \
-- --model "meta-llama/Llama-2-7b-hf" \
--task "mmlu"
GPU Example (A10G):
HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
--flavor a10g-small \
--secret HF_TOKEN=$HF_TOKEN \
-- --model "meta-llama/Llama-2-7b-hf" \
--task "gsm8k"
Python Helper (optional):
python scripts/run_eval_job.py \
--model "meta-llama/Llama-2-7b-hf" \
--task "mmlu" \
--hardware "t4-small"
Evaluate custom HuggingFace models directly on GPU using vLLM or accelerate backends. These scripts are separate from inference provider scripts and run models locally on the job's hardware.
| Feature | vLLM Scripts | Inference Provider Scripts |
|---|---|---|
| Model access | Any HF model | Models with API endpoints |
| Hardware | Your GPU (or HF Jobs GPU) | Provider's infrastructure |
| Cost | HF Jobs compute cost | API usage fees |
| Speed | vLLM optimized | Depends on provider |
| Offline | Yes (after download) | No |
lighteval is HuggingFace's evaluation library, supporting Open LLM Leaderboard tasks.
Standalone (local GPU):
# Run MMLU 5-shot with vLLM
python scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--tasks "leaderboard|mmlu|5"
# Run multiple tasks
python scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"
# Use accelerate backend instead of vLLM
python scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--tasks "leaderboard|mmlu|5" \
--backend accelerate
# Chat/instruction-tuned models
python scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-1B-Instruct \
--tasks "leaderboard|mmlu|5" \
--use-chat-template
Via HF Jobs:
hf jobs uv run scripts/lighteval_vllm_uv.py \
--flavor a10g-small \
--secrets HF_TOKEN=$HF_TOKEN \
-- --model meta-llama/Llama-3.2-1B \
--tasks "leaderboard|mmlu|5"
lighteval Task Format:
Tasks use the format suite|task|num_fewshot:
leaderboard|mmlu|5 - MMLU with 5-shotleaderboard|gsm8k|5 - GSM8K with 5-shotlighteval|hellaswag|0 - HellaSwag zero-shotleaderboard|arc_challenge|25 - ARC-Challenge with 25-shotFinding Available Tasks: The complete list of available lighteval tasks can be found at: https://github.com/huggingface/lighteval/blob/main/examples/tasks/all_tasks.txt
This file contains all supported tasks in the format suite|task|num_fewshot|0 (the trailing 0 is a version flag and can be ignored). Common suites include:
leaderboard - Open LLM Leaderboard tasks (MMLU, GSM8K, ARC, HellaSwag, etc.)lighteval - Additional lighteval tasksbigbench - BigBench tasksoriginal - Original benchmark tasksTo use a task from the list, extract the suite|task|num_fewshot portion (without the trailing 0) and pass it to the --tasks parameter. For example:
leaderboard|mmlu|0 → Use: leaderboard|mmlu|0 (or change to 5 for 5-shot)bigbench|abstract_narrative_understanding|0 → Use: bigbench|abstract_narrative_understanding|0lighteval|wmt14:hi-en|0 → Use: lighteval|wmt14:hi-en|0Multiple tasks can be specified as comma-separated values: --tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"
inspect-ai is the UK AI Safety Institute's evaluation framework.
Standalone (local GPU):
# Run MMLU with vLLM
python scripts/inspect_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--task mmlu
# Use HuggingFace Transformers backend
python scripts/inspect_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--task mmlu \
--backend hf
# Multi-GPU with tensor parallelism
python scripts/inspect_vllm_uv.py \
--model meta-llama/Llama-3.2-70B \
--task mmlu \
--tensor-parallel-size 4
Via HF Jobs:
hf jobs uv run scripts/inspect_vllm_uv.py \
--flavor a10g-small \
--secrets HF_TOKEN=$HF_TOKEN \
-- --model meta-llama/Llama-3.2-1B \
--task mmlu
Available inspect-ai Tasks:
mmlu - Massive Multitask Language Understandinggsm8k - Grade School Mathhellaswag - Common sense reasoningarc_challenge - AI2 Reasoning Challengetruthfulqa - TruthfulQA benchmarkwinogrande - Winograd Schema Challengehumaneval - Code generationThe helper script auto-selects hardware and simplifies job submission:
# Auto-detect hardware based on model size
python scripts/run_vllm_eval_job.py \
--model meta-llama/Llama-3.2-1B \
--task "leaderboard|mmlu|5" \
--framework lighteval
# Explicit hardware selection
python scripts/run_vllm_eval_job.py \
--model meta-llama/Llama-3.2-70B \
--task mmlu \
--framework inspect \
--hardware a100-large \
--tensor-parallel-size 4
# Use HF Transformers backend
python scripts/run_vllm_eval_job.py \
--model microsoft/phi-2 \
--task mmlu \
--framework inspect \
--backend hf
Hardware Recommendations:
| Model Size | Recommended Hardware |
|---|---|
| < 3B params | t4-small |
| 3B - 13B | a10g-small |
| 13B - 34B | a10g-large |
| 34B+ | a100-large |
Top-level help and version:
uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py --version
Inspect Tables (start here):
uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model-name"
Extract from README:
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "username/model-name" \
--table N \
[--model-column-index N] \
[--model-name-override "Exact Column Header or Model Name"] \
[--task-type "text-generation"] \
[--dataset-name "Custom Benchmarks"] \
[--apply | --create-pr]
Import from Artificial Analysis:
AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
--creator-slug "creator-name" \
--model-name "model-slug" \
--repo-id "username/model-name" \
[--create-pr]
View / Validate:
uv run scripts/evaluation_manager.py show --repo-id "username/model-name"
uv run scripts/evaluation_manager.py validate --repo-id "username/model-name"
Check Open PRs (ALWAYS run before --create-pr):
uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"
Lists all open pull requests for the model repository. Shows PR number, title, author, date, and URL.
Run Evaluation Job (Inference Providers):
hf jobs uv run scripts/inspect_eval_uv.py \
--flavor "cpu-basic|t4-small|..." \
--secret HF_TOKEN=$HF_TOKEN \
-- --model "model-id" \
--task "task-name"
or use the Python helper:
python scripts/run_eval_job.py \
--model "model-id" \
--task "task-name" \
--hardware "cpu-basic|t4-small|..."
Run vLLM Evaluation (Custom Models):
# lighteval with vLLM
hf jobs uv run scripts/lighteval_vllm_uv.py \
--flavor "a10g-small" \
--secrets HF_TOKEN=$HF_TOKEN \
-- --model "model-id" \
--tasks "leaderboard|mmlu|5"
# inspect-ai with vLLM
hf jobs uv run scripts/inspect_vllm_uv.py \
--flavor "a10g-small" \
--secrets HF_TOKEN=$HF_TOKEN \
-- --model "model-id" \
--task "mmlu"
# Helper script (auto hardware selection)
python scripts/run_vllm_eval_job.py \
--model "model-id" \
--task "leaderboard|mmlu|5" \
--framework lighteval
The generated model-index follows this structure:
model-index:
- name: Model Name
results:
- task:
type: text-generation
dataset:
name: Benchmark Dataset
type: benchmark_type
metrics:
- name: MMLU
type: mmlu
value: 85.2
- name: HumanEval
type: humaneval
value: 72.5
source:
name: Source Name
url: https://source-url.com
WARNING: Do not use markdown formatting in the model name. Use the exact name from the table. Only use urls in the source.url field.
get-prs before creating any new PR to avoid duplicatesinspect-tables: See table structure and get the correct extraction command--help for guidance: Run inspect-tables --help to see the complete workflow--apply or --create-pr--table N for multi-table READMEs: Required when multiple evaluation tables exist--model-name-override for comparison tables: Copy the exact column header from inspect-tables output--create-pr when updating models you don't ownWhen extracting evaluation tables with multiple models (either as columns or rows), the script uses exact normalized token matching:
**, links []() )- and _ with spaces)"OLMo-3-32B" → {"olmo", "3", "32b"} matches "**Olmo 3 32B**" or "[Olmo-3-32B](...)For column-based tables (benchmarks as rows, models as columns):
For transposed tables (models as rows, benchmarks as columns):
This ensures only the correct model's scores are extracted, never unrelated models or training checkpoints.
Update Your Own Model:
# Extract from README and push directly
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "your-username/your-model" \
--task-type "text-generation"
Update Someone Else's Model (Full Workflow):
# Step 1: ALWAYS check for existing PRs first
uv run scripts/evaluation_manager.py get-prs \
--repo-id "other-username/their-model"
# Step 2: If NO open PRs exist, proceed with creating one
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "other-username/their-model" \
--create-pr
# If open PRs DO exist:
# - Warn the user about existing PRs
# - Show them the PR URLs
# - Do NOT create a new PR unless user explicitly confirms
Import Fresh Benchmarks:
# Step 1: Check for existing PRs
uv run scripts/evaluation_manager.py get-prs \
--repo-id "anthropic/claude-sonnet-4"
# Step 2: If no PRs, import from Artificial Analysis
AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" \
--model-name "claude-sonnet-4" \
--repo-id "anthropic/claude-sonnet-4" \
--create-pr
Issue: "No evaluation tables found in README"
Issue: "Could not find model 'X' in transposed table"
--model-name-override with the exact name from the list--model-name-override "**Olmo 3-32B**"Issue: "AA_API_KEY not set"
Issue: "Token does not have write access"
Issue: "Model not found in Artificial Analysis"
Issue: "Payment required for hardware"
Issue: "vLLM out of memory" or CUDA OOM
--gpu-memory-utilization, or use --tensor-parallel-size for multi-GPUIssue: "Model architecture not supported by vLLM"
--backend hf (inspect-ai) or --backend accelerate (lighteval) for HuggingFace TransformersIssue: "Trust remote code required"
--trust-remote-code flag for models with custom code (e.g., Phi-2, Qwen)Issue: "Chat template not found"
--use-chat-template for instruction-tuned models that include a chat templatePython Script Integration:
import subprocess
import os
def update_model_evaluations(repo_id, readme_content):
"""Update model card with evaluations from README."""
result = subprocess.run([
"python", "scripts/evaluation_manager.py",
"extract-readme",
"--repo-id", repo_id,
"--create-pr"
], capture_output=True, text=True)
if result.returncode == 0:
print(f"Successfully updated {repo_id}")
else:
print(f"Error: {result.stderr}")
Use when working with Payload CMS projects (payload.config.ts, collections, fields, hooks, access control, Payload API). Use when debugging validation errors, security issues, relationship queries, transactions, or hook behavior.