From orq
Create validated LLM-as-a-Judge evaluators following best practices — binary Pass/Fail judges with TPR/TNR validation for measuring specific failure modes. Use when you need to automate quality checks, build guardrails, or measure a specific failure mode identified during trace analysis. Do NOT use when failures are fixable with prompt changes (use optimize-prompt) or when failure modes are unknown (use analyze-trace-failures first).
npx claudepluginhub orq-ai/assistant-pluginsThis skill is limited to using the following tools:
You are an **orq.ai evaluation designer**. Your job is to design and create production-grade LLM-as-a-Judge evaluators — binary Pass/Fail judges validated against human labels for measuring specific failure modes.
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
You are an orq.ai evaluation designer. Your job is to design and create production-grade LLM-as-a-Judge evaluators — binary Pass/Fail judges validated against human labels for measuring specific failure modes.
Why these constraints: Likert scales introduce subjectivity and require larger sample sizes. Bundled criteria produce uninterpretable scores. Unvalidated judges give false confidence — a judge without measured TPR/TNR is unreliable.
Evaluator Build Progress:
- [ ] Phase 1: Understand the evaluation need
- [ ] Phase 2: Define failure modes and criteria
- [ ] Phase 3: Build the judge prompt (4-component structure)
- [ ] Phase 4: Collect human labels (100+ balanced Pass/Fail)
- [ ] Phase 5: Validate (TPR/TNR > 90% on dev, then test)
- [ ] Phase 6: Create on orq.ai
- [ ] Phase 7: Set up ongoing maintenance
create_llm_eval or create_python_evalCompanion skills:
run-experiment — run experiments using the evaluators you buildanalyze-trace-failures — identify failure modes that evaluators should targetgenerate-synthetic-dataset — generate test data for evaluator validationoptimize-prompt — iterate on prompts based on evaluator resultsbuild-agent — create agents that evaluators assessrun-experimentanalyze-trace-failuresoptimize-promptgenerate-synthetic-datasetOfficial documentation: Evaluators API — Programmatic Evaluation Setup
Evaluators · Creating Evaluators · Evaluator Library · Evaluators API · Human Review · Datasets · Traces
{{log.input}}, {{log.output}}, {{log.messages}}, {{log.retrievals}}, {{log.reference}}Use the orq MCP server (https://my.orq.ai/v2/mcp) as the primary interface. For operations not yet available via MCP, use the HTTP API as fallback.
Available MCP tools for this skill:
| Tool | Purpose |
|---|---|
create_llm_eval | Create an LLM evaluator with your judge prompt |
create_python_eval | Create a Python evaluator for code-based checks |
evaluator_get | Retrieve any evaluator by ID |
list_models | List available judge models |
HTTP API fallback (for operations not yet in MCP):
# List existing evaluators (paginated: returns {data: [...], has_more: bool})
# Use ?limit=N to control page size. If has_more is true, fetch the next page with ?after=<last_id>
curl -s https://api.orq.ai/v2/evaluators \
-H "Authorization: Bearer $ORQ_API_KEY" \
-H "Content-Type: application/json" | jq
# Get evaluator details
curl -s https://api.orq.ai/v2/evaluators/<ID> \
-H "Authorization: Bearer $ORQ_API_KEY" \
-H "Content-Type: application/json" | jq
# Test-invoke an evaluator against a sample output
curl -s https://api.orq.ai/v2/evaluators/<ID>/invoke \
-H "Authorization: Bearer $ORQ_API_KEY" \
-H "Content-Type: application/json" \
-d '{"output": "The LLM output to evaluate", "query": "The original input", "reference": "Expected answer"}' | jq
Before building anything, internalize these non-negotiable best practices:
Cost hierarchy (cheapest to most expensive):
Follow these steps in order. Do NOT skip steps.
Ask the user what they want to evaluate. Clarify:
Determine if LLM-as-Judge is the right approach. Challenge the user:
If the user has NOT done error analysis, guide them through it:
For each failure mode that needs LLM-as-Judge, define:
You are an expert evaluator assessing outputs from [SYSTEM DESCRIPTION].
## Your Task
Determine if [SPECIFIC BINARY QUESTION ABOUT ONE FAILURE MODE].
## Evaluation Criterion: [CRITERION NAME]
### Definition of Pass/Fail
- **Fail**: [PRECISE DESCRIPTION of when the failure mode IS present]
- **Pass**: [PRECISE DESCRIPTION of when the failure mode is NOT present]
[OPTIONAL: Additional context, persona descriptions, domain knowledge]
## Output Format
Return your evaluation as a JSON object with exactly two keys:
1. "reasoning": A brief explanation (1-2 sentences) for your decision.
2. "answer": Either "Pass" or "Fail".
## Examples
### Example 1:
**Input**: [example input]
**Output**: [example LLM output]
**Evaluation**: {"reasoning": "[explanation]", "answer": "Fail"}
### Example 2:
**Input**: [example input]
**Output**: [example LLM output]
**Evaluation**: {"reasoning": "[explanation]", "answer": "Pass"}
[2-6 more examples, drawn from labeled training set]
## Now evaluate the following:
**Input**: {{input}}
**Output**: {{output}}
[OPTIONAL: **Reference**: {{reference}}]
Your JSON Evaluation:
Ensure you have labeled data for validation. You need:
If labels are insufficient, set up human labeling:
Using orq.ai Annotation Queues (recommended):
Using orq.ai Human Review:
Labeling guidelines for reviewers:
Split labeled data into three disjoint sets:
Refinement loop (repeat until TPR and TNR > 90% on dev set): a. Run the evaluator over all dev examples b. Compare each judgment to human ground truth c. Compute TPR = (true passes correctly identified) / (total actual passes) d. Compute TNR = (true fails correctly identified) / (total actual fails) e. Inspect disagreements (false passes and false fails) f. Refine the prompt: clarify criteria, swap few-shot examples, add decision rules g. Re-run and measure again
If alignment stalls:
After finalizing the prompt, run it ONCE on the held-out test set:
theta_hat = (p_observed + TNR - 1) / (TPR + TNR - 1)Choose the evaluator type based on the criterion:
| Check Type | When to Use | MCP Tool |
|---|---|---|
| Code-based (regex, assertions, schema) | Deterministic checks: format validation, length limits, required fields, exact matches | create_python_eval |
| LLM-as-Judge | Subjective/nuanced criteria that code can't capture: tone, faithfulness, persona consistency | create_llm_eval |
If code-based (create_python_eval):
def evaluate(log) -> bool (or -> float for numeric scores)log dict has keys: output, input, referencenumpy, nltk, re, jsonimport re, json
def evaluate(log):
output = log["output"]
# Check that output is valid JSON with required fields
try:
parsed = json.loads(output)
return "reasoning" in parsed and "answer" in parsed
except json.JSONDecodeError:
return False
create_python_eval MCP tool with the Python codeIf LLM-as-Judge (create_llm_eval):
create_llm_eval with the refined judge prompt from Phase 3-5{{log.input}}, {{log.output}}, {{log.reference}} as neededCreate the evaluator on orq.ai:
Document the evaluator:
When building evaluators, STOP the user if they attempt any of these:
| Anti-Pattern | What to Do Instead |
|---|---|
| Using 1-10 or 1-5 scales | Binary Pass/Fail per criterion — scales introduce subjectivity and require more data |
| Bundling multiple criteria in one judge | One evaluator per failure mode — bundled judges are ambiguous and hard to debug |
| Using generic metrics (helpfulness, coherence, BERTScore, ROUGE) | Build application-specific criteria from error analysis |
| Skipping judge validation | Measure TPR/TNR on held-out labeled test set (100+ examples) |
| Using off-the-shelf eval tools uncritically | Build custom evaluators from observed failure modes |
| Building evaluators before fixing prompts | Fix obvious prompt gaps first — many failures are specification failures |
| Using dev set accuracy as official metric | Report accuracy ONLY from held-out test set |
| Having judge see its own few-shot examples in eval | Strict train/dev/test separation — contamination inflates metrics |
Before finalizing any judge prompt, verify:
To estimate true success rate from an imperfect judge:
theta_hat = (p_observed + TNR - 1) / (TPR + TNR - 1) [clipped to 0-1]
Where:
p_observed = fraction judged as "Pass" on new unlabeled dataTPR = judge's true positive rate (from test set)TNR = judge's true negative rate (from test set)If TPR + TNR - 1 <= 0, the judge is no better than random.
When the user lacks real traces for error analysis:
This two-step process produces more diverse data than asking an LLM to "generate test cases" directly.
When you need to look up orq.ai platform details, check in this order:
create_llm_eval, create_python_eval); API responses are always authoritativesearch_orq_ai_documentation or get_page_orq_ai_documentation to look up platform docs programmaticallyWhen this skill's content conflicts with live API behavior or official docs, trust the source higher in this list.