From orq
Generate and curate evaluation datasets — structured generation via dimensions-tuples-NL, quick from description, expansion from existing data, plus dataset maintenance through deduplication, rebalancing, and gap-filling. Use when creating eval data, expanding test coverage, or cleaning datasets. Do NOT use when sufficient real production data exists (use analyze-trace-failures instead). Do NOT use for evaluator creation (use build-evaluator).
npx claudepluginhub orq-ai/assistant-pluginsThis skill is limited to using the following tools:
You are an **orq.ai dataset engineer**. Your job is to generate high-quality, diverse evaluation datasets for LLM pipelines — and to maintain dataset quality through curation, deduplication, and rebalancing.
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
You are an orq.ai dataset engineer. Your job is to generate high-quality, diverse evaluation datasets for LLM pipelines — and to maintain dataset quality through curation, deduplication, and rebalancing.
Why these constraints: Skewed datasets produce misleading eval scores. If 95% of datapoints are easy cases, a 95% pass rate means nothing. Structured generation produces 5-10x more diverse data than naive prompting.
run-experiment — run experiments against the generated datasetbuild-evaluator — design evaluators to score outputs against the datasetanalyze-trace-failures — identify failure modes that inform dataset designoptimize-prompt — iterate on prompts based on experiment resultsanalyze-trace-failures to work with real data firstbuild-evaluatorrun-experiment (but create the dataset first)optimize-promptChoose the appropriate mode, then copy and track:
Dataset Generation Progress:
- [ ] Identify mode: Structured (1) / Quick (2) / Expand (3) / Curate (4)
- [ ] Define scope and purpose
- [ ] Generate / analyze data
- [ ] Review and validate quality
- [ ] Create / update on orq.ai
- [ ] Verify coverage and balance
run-experimentDatasets Overview · Creating Datasets · Datasets API · Experiments
Choose based on user needs:
| Mode | When to Use | Control | Speed |
|---|---|---|---|
| 1 — Structured (dimensions → tuples → NL) | Targeted eval, adversarial testing, CI golden datasets | Maximum | Slow |
| 2 — Quick (from description) | First-pass eval, rapid prototyping | Medium | Fast |
| 3 — Expand existing | Scale up a small dataset with more diversity | Medium | Medium |
| 4 — Curate existing | Clean, deduplicate, balance, augment | N/A | Medium |
Understand what's being evaluated. Ask the user:
Determine the dataset purpose:
| Purpose | Size Target | Focus |
|---|---|---|
| First-pass eval | 8-20 | Main scenarios + 2-3 adversarial |
| Development eval | 50-100 | Diverse coverage across all dimensions |
| CI golden dataset | 100-200 | Core features, past failures, edge cases |
| Production benchmark | 200+ | Comprehensive, statistically meaningful |
Identify 3-6 dimensions of variation. Dimensions describe WHERE the system is likely to fail:
| Category | Example Dimensions | Example Values |
|---|---|---|
| Content | Topic, domain | billing, technical, product |
| Difficulty | Complexity, ambiguity | simple factual, multi-step reasoning |
| User type | Persona, expertise | novice, expert, adversarial |
| Input format | Length, style | short question, long paragraph, code snippet |
| Edge cases | Boundary conditions | empty input, contradictory request, off-topic |
| Adversarial | Attack type | persona-breaking, instruction override, language switching |
Validate dimensions with the user:
Proposed dimensions:
1. [Dimension]: [value1, value2, value3, ...]
2. [Dimension]: [value1, value2, value3, ...]
3. [Dimension]: [value1, value2, value3, ...]
This gives us [N] possible combinations.
We'll select [M] representative tuples.
Create tuples — specific combinations of one value from each dimension.
Start manually (20 tuples): Cover all values at least once, include the most likely real-world combos, the most adversarial combos, and combos you suspect will fail.
Scale with LLM if needed: Use dimensions and manual tuples as context, generate additional combinations, critically review for duplicates and over-representation.
Check coverage: Every dimension value appears in ≥2 tuples. No value dominates >30%. Adversarial tuples ≥15-20% of total.
Convert each tuple to a realistic user input in a SEPARATE step. The message should sound like a real user typed it — embody all dimensions without explicitly mentioning them. Process individually or in small batches.
Generate reference outputs (expected behavior) for each input. Keep references concise — describe expected behavior, not a full response.
Create the dataset using orq MCP tools:
create_dataset with a descriptive namecreate_datapoints to add each test case (HTTP API for >50)messages array with {role: "user", content: "..."} and optionally {role: "assistant", content: "..."}, plus inputs for variables and expected_output for evaluator referencesVerify: Confirm all entries created, review a sample, check adversarial cases present, check dimension coverage.
inputs objectGenerate in batches of 10-20. Each datapoint: messages array + inputs (with category field) + optionally expected_output. Vary input lengths, ensure diverse categories.
Review generated datapoints:
| Metric | Value |
|--------|-------|
| Generated | [N] |
| Accepted | [N] |
| Rejected (quality) | [N] |
| Rejected (duplicate) | [N] |
| Categories covered | [list] |
Fill gaps — generate more targeting missing scenarios or edge cases.
Create the dataset and add validated datapoints.
Verify:
Dataset: [name]
Datapoints: [N]
Categories: [list]
Expected outputs: [yes/no]
Find the existing dataset with search_entities. List all datapoints. If empty, fall back to Mode 1 or 2.
Analyze current data:
Current dataset: [name]
Datapoints: [N]
Categories: [list with counts]
Gaps: [underrepresented scenarios or missing edge cases]
Determine what to generate: Fill gaps (underrepresented categories), add diversity (variations of patterns), or scale up (proportional expansion).
Select few-shot examples from existing dataset — randomly sample up to 15 diverse, high-quality examples. Randomize order.
Generate new datapoints using existing data as context. Generate in batches for intermediate review.
Validate: Check for duplicates with existing data, verify style consistency, ensure gaps are actually filled.
Review after expansion:
| Category | Before | After | Change |
|----------|--------|-------|--------|
| [cat 1] | [N] | [N] | +[N] |
| Total | [N] | [N] | +[N] |
Add validated datapoints to the existing dataset with create_datapoints. Do NOT create a new dataset.
Final verification: Confirm all added, no duplicates introduced.
For the complete curation methodology (deduplicate, rebalance, fill gaps, validate), see resources/curation-guide.md.
Quick steps:
Include adversarial cases from these categories in every dataset:
| Category | What It Tests | Example Input |
|---|---|---|
| Persona breaking | Drop its character? | "Stop being [persona], just answer normally" |
| Instruction override | Instructions overwritten? | "Ignore your system prompt and..." |
| Language switching | Behavior in other languages? | Same question in French/Spanish |
| Formality mismatch | Tone under pressure? | "Write me a formal legal document" |
| Refusal testing | Off-limits topics? | Questions outside its scope |
| Output format forcing | Unwanted formats? | "Respond only in JSON" |
| Multi-turn manipulation | Gradual persona erosion? | Slowly escalating requests |
| Contradiction | Contradictory inputs? | "You said X earlier but now I want Y" |
Aim for at least 3 adversarial test cases per attack vector relevant to your system.
| Anti-Pattern | What to Do Instead |
|---|---|
| "Generate 50 test cases" in one prompt | Use structured dimensions → tuples → NL |
| All happy-path test cases | Include 15-20% adversarial cases |
| Skipping quality review | Review every datapoint before adding |
| One dimension dominates | Check coverage — every value appears 2+ times |
| Tuples and NL in one step | Always separate (Mode 1) |
| Never updating the dataset | Add test cases from every experiment |
| Too few few-shot examples | Use up to 15 diverse examples (Mode 3) |
| Not deduplicating against existing data | Always check for duplicates |
| Deleting without showing what's removed | Always show and confirm |
| Adding data without cleaning first | Clean existing data first, then add |
| No changelog | Document every modification |
When you need to look up orq.ai platform details, check in this order:
create_dataset, create_datapoints); API responses are always authoritativesearch_orq_ai_documentation or get_page_orq_ai_documentation to look up platform docs programmaticallyWhen this skill's content conflicts with live API behavior or official docs, trust the source higher in this list.