Runs local evaluations on Hugging Face Hub models using inspect-ai or lighteval with vllm, Transformers, or accelerate backends for smoke tests and benchmarking.
From antigravity-awesome-skillsnpx claudepluginhub sickn33/antigravity-awesome-skills --plugin antigravity-awesome-skillsThis skill uses the workspace's default tool permissions.
examples/USAGE_EXAMPLES.mdscripts/inspect_eval_uv.pyscripts/inspect_vllm_uv.pyscripts/lighteval_vllm_uv.pyDesigns and optimizes AI agent action spaces, tool definitions, observation formats, error recovery, and context for higher task completion rates.
Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
Use this skill for local model evaluation, backend selection, and GPU smoke tests outside the Hugging Face Jobs workflow.
This skill is for running evaluations against models on the Hugging Face Hub on local hardware.
It covers:
inspect-ai with local inferencelighteval with local inferencevllm, Hugging Face Transformers, and accelerateIt does not cover:
model-index edits.eval_results generation or publishingIf the user wants to run the same eval remotely on Hugging Face Jobs, hand off to the hugging-face-jobs skill and pass it one of the local scripts in this skill.
If the user wants to publish results into the community evals workflow, stop after generating the evaluation run and hand off that publishing step to ~/code/community-evals.
All paths below are relative to the directory containing this
SKILL.md.
| Use case | Script |
|---|---|
Local inspect-ai eval on a Hub model via inference providers | scripts/inspect_eval_uv.py |
Local GPU eval with inspect-ai using vllm or Transformers | scripts/inspect_vllm_uv.py |
Local GPU eval with lighteval using vllm or accelerate | scripts/lighteval_vllm_uv.py |
| Extra command patterns | examples/USAGE_EXAMPLES.md |
uv run for local execution.HF_TOKEN for gated/private models.uv --version
printenv HF_TOKEN >/dev/null
nvidia-smi
If nvidia-smi is unavailable, either:
scripts/inspect_eval_uv.py for lighter provider-backed evaluation, orhugging-face-jobs skill if the user wants remote compute.inspect-ai when you want explicit task control and inspect-native flows.lighteval when the benchmark is naturally expressed as a lighteval task string, especially leaderboard-style tasks.vllm for throughput on supported architectures.--backend hf) or accelerate as compatibility fallbacks.inspect-ai: add --limit 10 or similar.lighteval: add --max-samples 10.hugging-face-jobs with the same script + args.Best when the model is already supported by Hugging Face Inference Providers and you want the lowest local setup overhead.
uv run scripts/inspect_eval_uv.py \
--model meta-llama/Llama-3.2-1B \
--task mmlu \
--limit 20
Use this path when:
inspect-evalsBest when you need to load the Hub model directly, use vllm, or fall back to Transformers for unsupported architectures.
Local GPU:
uv run scripts/inspect_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--task gsm8k \
--limit 20
Transformers fallback:
uv run scripts/inspect_vllm_uv.py \
--model microsoft/phi-2 \
--task mmlu \
--backend hf \
--trust-remote-code \
--limit 20
Best when the task is naturally expressed as a lighteval task string, especially Open LLM Leaderboard style benchmarks.
Local GPU:
uv run scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-3B-Instruct \
--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5" \
--max-samples 20 \
--use-chat-template
accelerate fallback:
uv run scripts/lighteval_vllm_uv.py \
--model microsoft/phi-2 \
--tasks "leaderboard|mmlu|5" \
--backend accelerate \
--trust-remote-code \
--max-samples 20
This skill intentionally stops at local execution and backend selection.
If the user wants to:
then switch to the hugging-face-jobs skill and pass it one of these scripts plus the chosen arguments.
inspect-ai examples:
mmlugsm8khellaswagarc_challengetruthfulqawinograndehumanevallighteval task strings use suite|task|num_fewshot:
leaderboard|mmlu|5leaderboard|gsm8k|5leaderboard|arc_challenge|25lighteval|hellaswag|0Multiple lighteval tasks can be comma-separated in --tasks.
inspect_vllm_uv.py --backend vllm for fast GPU inference on supported architectures.inspect_vllm_uv.py --backend hf when vllm does not support the model.lighteval_vllm_uv.py --backend vllm for throughput on supported models.lighteval_vllm_uv.py --backend accelerate as the compatibility fallback.inspect_eval_uv.py when Inference Providers already cover the model and you do not need direct GPU control.| Model size | Suggested local hardware |
|---|---|
< 3B | consumer GPU / Apple Silicon / small dev GPU |
3B - 13B | stronger local GPU |
13B+ | high-memory local GPU or hand off to hugging-face-jobs |
For smoke tests, prefer cheaper local runs plus --limit or --max-samples.
--batch-size--gpu-memory-utilizationhugging-face-jobsvllm:
--backend hf for inspect-ai--backend accelerate for lightevalHF_TOKEN--trust-remote-codeSee:
examples/USAGE_EXAMPLES.md for local command patternsscripts/inspect_eval_uv.pyscripts/inspect_vllm_uv.pyscripts/lighteval_vllm_uv.py