Plugin

evals

AI agent evaluation framework based on Anthropic best practices. Create use cases, LLM judges, A/B prompt tests, and model comparisons.

Capabilities

Commands

Agents

Skills

Hooks

MCP Servers

LSP Servers

Install

Run in your terminal

npx claudepluginhub markac007/cg-claude-workspaces-plugins --plugin evals

Components

Commands (6)

/evals:compare-models -- Compare multiple models on the same prompt to find the best performer

/compare-models

- Existing use case with test cases and a prompt

/evals:compare-prompts -- A/B test two prompt versions to find the better performer

/compare-prompts

**Implements the Science Protocol for prompt experimentation.**

/evals:create-judge -- Create a custom LLM-as-Judge for evaluating AI outputs

/create-judge

---

/evals:create-use-case -- Create a new evaluation use case with test cases and scoring criteria

/create-use-case

Save all generated config files to `~/Downloads/evals/<name>/` before moving to the project.

/evals:run-eval -- Execute an evaluation suite against test cases

/run-eval

- Use case config.yaml exists

/evals:view-results -- Query and display evaluation results and trends

/view-results

- Evaluations have been run (results stored as JSON files in `Results/`)

Similar Plugins

everything-claude-code

139.9k

639

Complete collection of battle-tested Claude Code configs from an Anthropic hackathon winner - agents, skills, hooks, rules, and legacy command shims evolved over 10+ months of intensive daily use

Stats

Version1.0.0

Parent Repo Stars0

MaintenanceFair

Actions

View on GitHub View README Plugin Marketplace JSON

Available In

cg-plugins

evals

Commands (6)

everything-claude-code

evals

Commands (6)

everything-claude-code

claude-md-management

pr-review-toolkit

fullstack-dev-skills

creative-writing

cc-polymath