npx claudepluginhub kanpuriyanawab/autotune# /eval — Legacy Evaluation Shortcut Prefer `/baseline` for base-model evaluation and `/run-experiment` for post-training evaluation, but keep this command for direct inspection. ## Steps 1. Resolve the base model or adapter the user wants to evaluate. 2. Ask whether to use a standard benchmark or a custom eval dataset: - **Standard**: `run_evaluation(benchmark="mmlu")` — measures general knowledge (200 samples for speed) - **Custom**: `run_evaluation(eval_dataset="<your-dataset>", eval_split="test")` — measures loss/perplexity on your own task data (more informative for task-speci...
/evalDelegates to eval-harness skill for legacy /eval compatibility, supporting define, check, report, list, and cleanup evaluation intents.
/evalEvaluates a plugin or skill directory for quality via static analysis and optional LLM judging, producing overall score with badge, dimension breakdowns, anti-patterns, and recommendations.
/evalManages eval-driven development workflow: define eval criteria in markdown, check pass/fail status, generate reports, list all evals. Supports define|check|report|list subcommands.
/evalManages eval-driven development workflow: define feature evals in Markdown, check pass/fail status with logs, generate reports, list all evals.
/evalManages eval-driven development workflow: define feature eval specs, check pass/fail status, generate reports, list definitions.
/evalEvaluates implemented code for quality (SOLID/DRY), architecture, test coverage, performance, security; suggests iterative improvements via Code Reviewer Agent.
Prefer /baseline for base-model evaluation and /run-experiment for
post-training evaluation, but keep this command for direct inspection.
run_evaluation(benchmark="mmlu") — measures general knowledge (200 samples for speed)run_evaluation(eval_dataset="<your-dataset>", eval_split="test") — measures loss/perplexity on your own task data (more informative for task-specific fine-tunes)run_evaluation with the resolved parameters.