Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.
/plugin marketplace add zechenzhangAGI/AI-research-SKILLs/plugin install zechenzhangagi-nemo-evaluator-sdk-11-evaluation-nemo-evaluator@zechenzhangAGI/AI-research-SKILLsAutomated code review for pull requests using multiple specialized agents with confidence-based scoring
Interactive learning mode that requests meaningful code contributions at decision points (mimics the unshipped Learning output style)
Comprehensive PR review agents specializing in comments, tests, error handling, type design, code quality, and code simplification
Comprehensive feature development workflow with specialized agents for codebase exploration, architecture design, and quality review