Production-ready workflow orchestration with 80 focused plugins, 185 specialized agents, and 153 skills - optimized for granular installation and minimal token usage

promptfoo

21.3K

0plugins

Agent skills for building and maintaining promptfoo evaluations

Stats

Plugins1

Stars12

UpdatedMay 14, 2026

Links

View on GitHub View Marketplace JSON

Help us improve

Share bugs, ideas, or general feedback.

Stats

Links

Help us improve

Share bugs, ideas, or general feedback.

Question	What Goes Wrong Without It
Did we solve the problem?	Feature works but nobody uses it. Wrong problem solved.
Does it do the right things?	Happy path works, edge cases ship broken.
Does the code work?	Tests pass, but they test implementation details, not behaviors.

Question

What Goes Wrong Without It

Did we solve the problem?

Feature works but nobody uses it. Wrong problem solved.

Does it do the right things?

Happy path works, edge cases ship broken.

Does the code work?

Tests pass, but they test implementation details, not behaviors.

Type	What It Means	How Review Checks It	Example
`scenario`	Criterion is satisfied when a linked BDD scenario's test passes	Finds the test, runs it, checks PASS	"Users can reset password" -> linked to scenario "Successful password reset"
`code`	Criterion is satisfied when specific code exists	Greps the codebase for the pattern	"POST /api/reset endpoint exists" -> greps for route definition
`metric`	Can't be verified before deployment	Flags for post-deploy monitoring	"Support tickets decrease 30%" -> requires production data
`manual`	Requires human judgment	Flags for human review, non-blocking	"Reset flow feels intuitive" -> UX review needed

Type

What It Means

How Review Checks It

Example

scenario

Criterion is satisfied when a linked BDD scenario's test passes

Finds the test, runs it, checks PASS

"Users can reset password" -> linked to scenario "Successful password reset"

code

Criterion is satisfied when specific code exists

Greps the codebase for the pattern

"POST /api/reset endpoint exists" -> greps for route definition

metric

Can't be verified before deployment

Flags for post-deploy monitoring

"Support tickets decrease 30%" -> requires production data

manual

Requires human judgment

Flags for human review, non-blocking

"Reset flow feels intuitive" -> UX review needed

Question	What Goes Wrong Without It
Did we solve the problem?	Feature works but nobody uses it. Wrong problem solved.
Does it do the right things?	Happy path works, edge cases ship broken.
Does the code work?	Tests pass, but they test implementation details, not behaviors.