Skip to main content

/

/

Stats

Actions

Tags

Help us improve

Share bugs, ideas, or general feedback.

Stats

Actions

Tags

Help us improve

Share bugs, ideas, or general feedback.

ClaudePluginHub

Community directory for discovering and installing Claude Code plugins.

Find plugins for your project

AI-powered recommendations based on your stack.

Product

Browse Plugins
Marketplaces
Pricing
About
Contact

Resources

Learning Center
Blog
Weekly Digest
Claude Code Docs
Plugin Guide
Plugin Reference
Plugin Marketplaces

Community

Browse on GitHub
Get Support

Legal

Terms of Service
Privacy Policy

Browse · Plugins · Top Plugins · Marketplaces · Components · Technologies · Skills · Agents · Commands · Hooks · MCP Servers · LSP Servers · Output Styles · Themes · Monitors

Categories · Productivity · Development · Testing · Deployment · Security · Documentation · Data · Utilities

© 2025 ClaudePluginHub

Community Maintained · Not affiliated with Anthropic

ClaudePluginHub

ClaudePluginHub

Tools Learn Pricing

Search everything...

prompt-evaluation-runner | agent-evaluation-lab

Home
Skills
agent-evaluation-lab
prompt-evaluation-runner

Skill

prompt-evaluation-runner

From agent-evaluation-lab

Use when evaluating prompts, LLM outputs, red-team suites, or model behavior with local eval configs and safe provider/cost controls.

Popularity

Parent stars

5

Parent forks

1

Shared by

2

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/agent-evaluation-lab:prompt-evaluation-runner

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Use this skill when you need to evaluate an LLM app, test a prompt, or run red-teaming/vulnerability scans against a target model or application.

Supporting Files

references/eval-config-patterns.md

SKILL.md

39 lines · ~522 tokens

Stats

LanguageTypeScript

Parent stars5

Parent forks1

MaintenanceExcellent

Last CommitMay 6, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stats

LanguageTypeScript

Parent stars5

Parent forks1

MaintenanceExcellent

Last CommitMay 6, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Prompt Evaluation Runner

When to use

Use this skill when you need to evaluate an LLM app, test a prompt, or run red-teaming/vulnerability scans against a target model or application.

Requirements / Checks

Check if an evaluation tool is defined in project deps, scripts, lockfiles, or local toolchain.
Do not run external commands like npx Prompt evaluation@latest directly.
If missing, ask before adding a local dev dependency or using an ephemeral runner.
Confirm expected cost, provider, API keys, and network target before execution.

Workflow

Define risk: State target behavior, failure mode, provider(s), and budget limits.
Choose assertions: Prefer deterministic checks first: exact/contains/regex/JSON/schema/javascript/python/cost/latency.
Use model graders sparingly: Pin grader provider/model and explain cost/non-determinism.
Configure minimally: Keep config to description, env refs, prompts, providers, default assertions, and tests.
Handle env safely: Use templated env references such as {{env.NAME}}; never hardcode keys.
Execute locally: Run smallest suite first. Ask before long, paid, red-team, or production-targeted runs.
Analyze failures: Separate prompt failures, provider variance, flaky graders, bad fixtures, and config mistakes.

Safety Constraints

Do NOT log, echo, or store secrets (API keys) in configuration files or chat output.
Do NOT run evaluations against production endpoints without user consent.
Avoid executing arbitrary remote code or unvetted plugins during evaluation.

Validation / Done Criteria

Evaluation config is valid, minimal, and uses safe env refs.
Deterministic assertions exist where possible.
Run scope, provider, and cost are reported.
Results are summarized without leaking sensitive data.

References

references/eval-config-patterns.md

$

npx claudepluginhub yeaight7/agent-powerups --plugin agent-evaluation-lab

Similar Skills

receiving-code-review

228.0k

Guides technical evaluation of code review feedback: read fully, restate for understanding, verify against codebase, respond with reasoning or pushback before implementing.

View receiving-code-review