Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.
/plugin marketplace add huggingface/skills/plugin install hugging-face-evaluation-manager@huggingface-skillsThis skill inherits all available tools. When active, it can use any tool Claude has access to.
examples/USAGE_EXAMPLES.mdexamples/artificial_analysis_to_hub.pyexamples/example_readme_tables.mdexamples/metric_mapping.jsonrequirements.txtscripts/evaluation_manager.pyscripts/inspect_eval_uv.pyscripts/inspect_vllm_uv.pyscripts/lighteval_vllm_uv.pyscripts/run_eval_job.pyscripts/run_vllm_eval_job.pyscripts/test_extraction.pyThis skill provides tools to add structured evaluation results to Hugging Face model cards. It supports multiple methods for adding evaluation data:
uv integration1.3.0
Note: vLLM dependencies are installed automatically via PEP 723 script headers when using uv run.
Before creating ANY pull request with --create-pr, you MUST check for existing open PRs:
uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"
If open PRs exist:
This prevents spamming model repositories with duplicate evaluation PRs.
Use --help for the latest workflow guidance. Works with plain Python or uv run:
uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py inspect-tables --help
uv run scripts/evaluation_manager.py extract-readme --help
Key workflow (matches CLI help):
get-prs → check for existing open PRs firstinspect-tables → find table numbers/columnsextract-readme --table N → prints YAML by default--apply (push) or --create-pr to write changesinspect-tables to see all tables in a README with structure, columns, and sample rows--table N to extract from a specific table (required when multiple tables exist)--model-column-index (index from inspect output). Use --model-name-override only with exact column header text.--task-type sets the task.type field in model-index output (e.g., text-generation, summarization)inspect-ai library⚠️ Important: This approach is only possible on devices with uv installed and sufficient GPU memory.
Benefits: No need to use hf_jobs() MCP tool, can run scripts directly in terminal
When to use: User working in local device directly when GPU is available
nvidia-smiuv run scripts/train_sft_example.py
The skill includes Python scripts in scripts/ to perform operations.
uv run (PEP 723 header auto-installs deps)pip install huggingface-hub markdown-it-py python-dotenv pyyaml requestsHF_TOKEN environment variable with Write-access tokenAA_API_KEY environment variable.env is loaded automatically if python-dotenv is installedRecommended flow (matches --help):
# 1) Inspect tables to get table numbers and column hints
uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model"
# 2) Extract a specific table (prints YAML by default)
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "username/model" \
--table 1 \
[--model-column-index <column index shown by inspect-tables>] \
[--model-name-override "<column header/model name>"] # use exact header text if you can't use the index
# 3) Apply changes (push or PR)
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "username/model" \
--table 1 \
--apply # push directly
# or
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "username/model" \
--table 1 \
--create-pr # open a PR
Validation checklist:
--model-column-index; if using --model-name-override, the column header text must be exact.Fetch benchmark scores from Artificial Analysis API and add them to a model card.
Basic Usage:
AA_API_KEY="your-api-key" python scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" \
--model-name "claude-sonnet-4" \
--repo-id "username/model-name"
With Environment File:
# Create .env file
echo "AA_API_KEY=your-api-key" >> .env
echo "HF_TOKEN=your-hf-token" >> .env
# Run import
python scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" \
--model-name "claude-sonnet-4" \
--repo-id "username/model-name"
Create Pull Request:
python scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" \
--model-name "claude-sonnet-4" \
--repo-id "username/model-name" \
--create-pr
Submit an evaluation job on Hugging Face infrastructure using the hf jobs uv run CLI.
Direct CLI Usage:
HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf_model_evaluation/scripts/inspect_eval_uv.py \
--flavor cpu-basic \
--secret HF_TOKEN=$HF_TOKEN \
-- --model "meta-llama/Llama-2-7b-hf" \
--task "mmlu"
GPU Example (A10G):
HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf_model_evaluation/scripts/inspect_eval_uv.py \
--flavor a10g-small \
--secret HF_TOKEN=$HF_TOKEN \
-- --model "meta-llama/Llama-2-7b-hf" \
--task "gsm8k"
Python Helper (optional):
python scripts/run_eval_job.py \
--model "meta-llama/Llama-2-7b-hf" \
--task "mmlu" \
--hardware "t4-small"
Evaluate custom HuggingFace models directly on GPU using vLLM or accelerate backends. These scripts are separate from inference provider scripts and run models locally on the job's hardware.
| Feature | vLLM Scripts | Inference Provider Scripts |
|---|---|---|
| Model access | Any HF model | Models with API endpoints |
| Hardware | Your GPU (or HF Jobs GPU) | Provider's infrastructure |
| Cost | HF Jobs compute cost | API usage fees |
| Speed | vLLM optimized | Depends on provider |
| Offline | Yes (after download) | No |
lighteval is HuggingFace's evaluation library, supporting Open LLM Leaderboard tasks.
Standalone (local GPU):
# Run MMLU 5-shot with vLLM
python scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--tasks "leaderboard|mmlu|5"
# Run multiple tasks
python scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"
# Use accelerate backend instead of vLLM
python scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--tasks "leaderboard|mmlu|5" \
--backend accelerate
# Chat/instruction-tuned models
python scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-1B-Instruct \
--tasks "leaderboard|mmlu|5" \
--use-chat-template
Via HF Jobs:
hf jobs uv run scripts/lighteval_vllm_uv.py \
--flavor a10g-small \
--secrets HF_TOKEN=$HF_TOKEN \
-- --model meta-llama/Llama-3.2-1B \
--tasks "leaderboard|mmlu|5"
lighteval Task Format:
Tasks use the format suite|task|num_fewshot:
leaderboard|mmlu|5 - MMLU with 5-shotleaderboard|gsm8k|5 - GSM8K with 5-shotlighteval|hellaswag|0 - HellaSwag zero-shotleaderboard|arc_challenge|25 - ARC-Challenge with 25-shotFinding Available Tasks: The complete list of available lighteval tasks can be found at: https://github.com/huggingface/lighteval/blob/main/examples/tasks/all_tasks.txt
This file contains all supported tasks in the format suite|task|num_fewshot|0 (the trailing 0 is a version flag and can be ignored). Common suites include:
leaderboard - Open LLM Leaderboard tasks (MMLU, GSM8K, ARC, HellaSwag, etc.)lighteval - Additional lighteval tasksbigbench - BigBench tasksoriginal - Original benchmark tasksTo use a task from the list, extract the suite|task|num_fewshot portion (without the trailing 0) and pass it to the --tasks parameter. For example:
leaderboard|mmlu|0 → Use: leaderboard|mmlu|0 (or change to 5 for 5-shot)bigbench|abstract_narrative_understanding|0 → Use: bigbench|abstract_narrative_understanding|0lighteval|wmt14:hi-en|0 → Use: lighteval|wmt14:hi-en|0Multiple tasks can be specified as comma-separated values: --tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"
inspect-ai is the UK AI Safety Institute's evaluation framework.
Standalone (local GPU):
# Run MMLU with vLLM
python scripts/inspect_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--task mmlu
# Use HuggingFace Transformers backend
python scripts/inspect_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--task mmlu \
--backend hf
# Multi-GPU with tensor parallelism
python scripts/inspect_vllm_uv.py \
--model meta-llama/Llama-3.2-70B \
--task mmlu \
--tensor-parallel-size 4
Via HF Jobs:
hf jobs uv run scripts/inspect_vllm_uv.py \
--flavor a10g-small \
--secrets HF_TOKEN=$HF_TOKEN \
-- --model meta-llama/Llama-3.2-1B \
--task mmlu
Available inspect-ai Tasks:
mmlu - Massive Multitask Language Understandinggsm8k - Grade School Mathhellaswag - Common sense reasoningarc_challenge - AI2 Reasoning Challengetruthfulqa - TruthfulQA benchmarkwinogrande - Winograd Schema Challengehumaneval - Code generationThe helper script auto-selects hardware and simplifies job submission:
# Auto-detect hardware based on model size
python scripts/run_vllm_eval_job.py \
--model meta-llama/Llama-3.2-1B \
--task "leaderboard|mmlu|5" \
--framework lighteval
# Explicit hardware selection
python scripts/run_vllm_eval_job.py \
--model meta-llama/Llama-3.2-70B \
--task mmlu \
--framework inspect \
--hardware a100-large \
--tensor-parallel-size 4
# Use HF Transformers backend
python scripts/run_vllm_eval_job.py \
--model microsoft/phi-2 \
--task mmlu \
--framework inspect \
--backend hf
Hardware Recommendations:
| Model Size | Recommended Hardware |
|---|---|
| < 3B params | t4-small |
| 3B - 13B | a10g-small |
| 13B - 34B | a10g-large |
| 34B+ | a100-large |
Top-level help and version:
uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py --version
Inspect Tables (start here):
uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model-name"
Extract from README:
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "username/model-name" \
--table N \
[--model-column-index N] \
[--model-name-override "Exact Column Header or Model Name"] \
[--task-type "text-generation"] \
[--dataset-name "Custom Benchmarks"] \
[--apply | --create-pr]
Import from Artificial Analysis:
AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
--creator-slug "creator-name" \
--model-name "model-slug" \
--repo-id "username/model-name" \
[--create-pr]
View / Validate:
uv run scripts/evaluation_manager.py show --repo-id "username/model-name"
uv run scripts/evaluation_manager.py validate --repo-id "username/model-name"
Check Open PRs (ALWAYS run before --create-pr):
uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"
Lists all open pull requests for the model repository. Shows PR number, title, author, date, and URL.
Run Evaluation Job (Inference Providers):
hf jobs uv run scripts/inspect_eval_uv.py \
--flavor "cpu-basic|t4-small|..." \
--secret HF_TOKEN=$HF_TOKEN \
-- --model "model-id" \
--task "task-name"
or use the Python helper:
python scripts/run_eval_job.py \
--model "model-id" \
--task "task-name" \
--hardware "cpu-basic|t4-small|..."
Run vLLM Evaluation (Custom Models):
# lighteval with vLLM
hf jobs uv run scripts/lighteval_vllm_uv.py \
--flavor "a10g-small" \
--secrets HF_TOKEN=$HF_TOKEN \
-- --model "model-id" \
--tasks "leaderboard|mmlu|5"
# inspect-ai with vLLM
hf jobs uv run scripts/inspect_vllm_uv.py \
--flavor "a10g-small" \
--secrets HF_TOKEN=$HF_TOKEN \
-- --model "model-id" \
--task "mmlu"
# Helper script (auto hardware selection)
python scripts/run_vllm_eval_job.py \
--model "model-id" \
--task "leaderboard|mmlu|5" \
--framework lighteval
The generated model-index follows this structure:
model-index:
- name: Model Name
results:
- task:
type: text-generation
dataset:
name: Benchmark Dataset
type: benchmark_type
metrics:
- name: MMLU
type: mmlu
value: 85.2
- name: HumanEval
type: humaneval
value: 72.5
source:
name: Source Name
url: https://source-url.com
WARNING: Do not use markdown formatting in the model name. Use the exact name from the table. Only use urls in the source.url field.
get-prs before creating any new PR to avoid duplicatesinspect-tables: See table structure and get the correct extraction command--help for guidance: Run inspect-tables --help to see the complete workflow--apply or --create-pr--table N for multi-table READMEs: Required when multiple evaluation tables exist--model-name-override for comparison tables: Copy the exact column header from inspect-tables output--create-pr when updating models you don't ownWhen extracting evaluation tables with multiple models (either as columns or rows), the script uses exact normalized token matching:
**, links []() )- and _ with spaces)"OLMo-3-32B" → {"olmo", "3", "32b"} matches "**Olmo 3 32B**" or "[Olmo-3-32B](...)For column-based tables (benchmarks as rows, models as columns):
For transposed tables (models as rows, benchmarks as columns):
This ensures only the correct model's scores are extracted, never unrelated models or training checkpoints.
Update Your Own Model:
# Extract from README and push directly
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "your-username/your-model" \
--task-type "text-generation"
Update Someone Else's Model (Full Workflow):
# Step 1: ALWAYS check for existing PRs first
uv run scripts/evaluation_manager.py get-prs \
--repo-id "other-username/their-model"
# Step 2: If NO open PRs exist, proceed with creating one
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "other-username/their-model" \
--create-pr
# If open PRs DO exist:
# - Warn the user about existing PRs
# - Show them the PR URLs
# - Do NOT create a new PR unless user explicitly confirms
Import Fresh Benchmarks:
# Step 1: Check for existing PRs
uv run scripts/evaluation_manager.py get-prs \
--repo-id "anthropic/claude-sonnet-4"
# Step 2: If no PRs, import from Artificial Analysis
AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" \
--model-name "claude-sonnet-4" \
--repo-id "anthropic/claude-sonnet-4" \
--create-pr
Issue: "No evaluation tables found in README"
Issue: "Could not find model 'X' in transposed table"
--model-name-override with the exact name from the list--model-name-override "**Olmo 3-32B**"Issue: "AA_API_KEY not set"
Issue: "Token does not have write access"
Issue: "Model not found in Artificial Analysis"
Issue: "Payment required for hardware"
Issue: "vLLM out of memory" or CUDA OOM
--gpu-memory-utilization, or use --tensor-parallel-size for multi-GPUIssue: "Model architecture not supported by vLLM"
--backend hf (inspect-ai) or --backend accelerate (lighteval) for HuggingFace TransformersIssue: "Trust remote code required"
--trust-remote-code flag for models with custom code (e.g., Phi-2, Qwen)Issue: "Chat template not found"
--use-chat-template for instruction-tuned models that include a chat templatePython Script Integration:
import subprocess
import os
def update_model_evaluations(repo_id, readme_content):
"""Update model card with evaluations from README."""
result = subprocess.run([
"python", "scripts/evaluation_manager.py",
"extract-readme",
"--repo-id", repo_id,
"--create-pr"
], capture_output=True, text=True)
if result.returncode == 0:
print(f"Successfully updated {repo_id}")
else:
print(f"Error: {result.stderr}")
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.
Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other static piece. Create original visual designs, never copying existing artists' work to avoid copyright violations.