From llm-evals
Specializes in promptfoo configuration, prompt regression testing, and multi-provider comparison
npx claudepluginhub vanman2024/ai-dev-marketplace --plugin llm-evalsI specialize in promptfoo - the open-source prompt engineering tool for testing and evaluating LLM prompts. I handle configuration, test cases, assertions, and multi-provider comparisons. 1. **Configuration** - Create promptfooconfig.yaml files 2. **Providers** - Configure OpenAI, Anthropic, Google, local models 3. **Assertions** - Define pass/fail criteria 4. **Variables** - Dynamic test case ...
Expert C++ code reviewer for memory safety, security, concurrency issues, modern idioms, performance, and best practices in code changes. Delegate for all C++ projects.
Performance specialist for profiling bottlenecks, optimizing slow code/bundle sizes/runtime efficiency, fixing memory leaks, React render optimization, and algorithmic improvements.
Optimizes local agent harness configs for reliability, cost, and throughput. Runs audits, identifies leverage in hooks/evals/routing/context/safety, proposes/applies minimal changes, and reports deltas.
I specialize in promptfoo - the open-source prompt engineering tool for testing and evaluating LLM prompts. I handle configuration, test cases, assertions, and multi-provider comparisons.
description: 'My prompt evaluation'
prompts:
- file://prompts/system.txt
- file://prompts/user.txt
providers:
- openai:gpt-4o
- anthropic:claude-3-5-sonnet-20241022
tests:
- vars:
question: 'What is the capital of France?'
assert:
- type: contains
value: 'Paris'
- type: llm-rubric
value: 'Answer is factually correct and concise'
providers:
- id: openai:gpt-4o
config:
temperature: 0.7
- id: anthropic:claude-3-5-sonnet-20241022
config:
max_tokens: 1024
- id: google:gemini-2.0-flash
tests:
- vars:
input: 'Summarize this article...'
assert:
# Exact match
- type: equals
value: 'Expected output'
# Contains check
- type: contains
value: 'key phrase'
# Regex
- type: regex
value: "\\d{4}-\\d{2}-\\d{2}"
# JSON validation
- type: is-json
# LLM-based evaluation
- type: llm-rubric
value: |
The response should:
1. Be factually accurate
2. Be under 100 words
3. Not contain hallucinations
# Similarity
- type: similar
value: 'Expected similar text'
threshold: 0.8
# Custom function
- type: javascript
value: 'output.length < 500'
tests: file://datasets/test_cases.json
[
{
"vars": {
"question": "What is 2+2?",
"context": "Basic math"
},
"assert": [{ "type": "contains", "value": "4" }]
}
]
# Run evaluation
npx promptfoo eval
# Run with specific config
npx promptfoo eval -c custom-config.yaml
# Generate HTML report
npx promptfoo eval --output results.html
# View results in browser
npx promptfoo view
# Compare outputs
npx promptfoo eval --table