From orq
Run cross-framework agent comparisons using evaluatorq from orqkit — compares any combination of agents (orq.ai, LangGraph, CrewAI, OpenAI Agents SDK, Vercel AI SDK) head-to-head on the same dataset with LLM-as-a-judge scoring. Use when comparing agents, benchmarking, or wanting side-by-side evaluation. Do NOT use when comparing only orq.ai configurations with no external agents (use run-experiment instead).
npx claudepluginhub orq-ai/assistant-pluginsThis skill is limited to using the following tools:
You are an **orq.ai agent comparison specialist**. Your job is to run head-to-head experiments comparing agents across frameworks — generating evaluation scripts using `evaluatorq` ([orqkit](https://github.com/orq-ai/orqkit)), then viewing results in the orq.ai Experiment UI.
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
You are an orq.ai agent comparison specialist. Your job is to run head-to-head experiments comparing agents across frameworks — generating evaluation scripts using evaluatorq (orqkit), then viewing results in the orq.ai Experiment UI.
Supported comparison modes:
generate-synthetic-dataset skill or use { dataset_id: "..." } (Python) / { datasetId: "..." } (TypeScript) to load from the platform.build-evaluator skill.Why these constraints: Biased datasets produce meaningless rankings. Inline datasets bypass validation. Different models confound framework comparisons. Untested agents waste experiment budget on invocation errors.
generate-synthetic-dataset — create the evaluation datasetbuild-evaluator — design the LLM-as-a-judge evaluatorrun-experiment — run orq.ai-native experiments (when no external agents are involved)build-agent — create orq.ai agents to include in comparisonsanalyze-trace-failures — diagnose agent failures from trace dataCopy this to track progress:
Agent Comparison Progress:
- [ ] Phase 1: Identify agents, frameworks, and language (Python/TS)
- [ ] Phase 2: Create dataset (→ generate-synthetic-dataset)
- [ ] Phase 3: Create evaluator (→ build-evaluator)
- [ ] Phase 4: Generate comparison script
- [ ] Phase 5: Run and view results in orq.ai
analyze-trace-failures)generate-synthetic-datasetbuild-evaluatorrun-experimentanalyze-trace-failuresOfficial documentation: Evaluatorq Tutorial
Experiments · Evaluators · Agent Responses API · Datasets
evaluatorq (Python) and @orq-ai/evaluatorq (TypeScript)ORQ_API_KEY is set| Tool | Purpose |
|---|---|
search_entities | Find orq.ai agent keys (use type: "agent") |
create_dataset | Create a dataset |
create_datapoints | Populate dataset with test cases |
create_llm_eval | Create an LLM-as-a-judge evaluator |
ORQ_API_KEY environment variable is setpip install evaluatorq orq-ai-sdknpm install @orq-ai/evaluatorqAsk the user which agents to compare. For each agent, determine:
For orq.ai agents, get the agent key:
search_entities MCP tool with type: "agent" to find available agentsFor external agents, confirm they can be called from Python/TypeScript:
Ask the user's language preference: Python or TypeScript. Default to Python if no preference.
Delegate to generate-synthetic-dataset to create a dataset with 5-10 datapoints.
Critical reminders for cross-framework comparison datasets:
Delegate to build-evaluator to create an LLM-as-a-judge evaluator. Save the returned evaluator ID.
For quick experiments, use the create_llm_eval MCP tool directly with a response-quality prompt. Ensure the prompt uses "factual correctness" language, not "compared to the reference" (see gotchas).
Select job patterns from resources/job-patterns.md for each agent's framework.
Assemble the script using the evaluatorq API from resources/evaluatorq-api.md:
evaluatorq() callCommon configurations:
| Experiment Type | Jobs to Include |
|---|---|
| External vs orq.ai | One external job + one orq.ai job |
| orq.ai vs orq.ai | Two orq.ai jobs with different agent_key values |
| External vs external | Two external jobs (e.g., LangGraph + CrewAI) |
| Multi-agent | Three or more jobs of any type |
Replace all placeholders in the generated script:
<EVALUATOR_ID> — evaluator ID from Phase 3<AGENT_KEY> — orq.ai agent key(s) from Phase 1<experiment-name> — descriptive experiment nameRun the script:
# Python
export ORQ_API_KEY="your-key"
python evaluate.py
# TypeScript
export ORQ_API_KEY="your-key"
npx tsx evaluate.ts
View results in orq.ai:
If issues arise, check resources/gotchas.md for common pitfalls.
Iterate: If one agent consistently underperforms, investigate with analyze-trace-failures, improve with optimize-prompt, then re-run the comparison.
After running the comparison:
When you need to look up orq.ai platform details, check in this order:
search_orq_ai_documentation or get_page_orq_ai_documentation to look up platform docs programmaticallyWhen this skill's content conflicts with live API behavior or official docs, trust the source higher in this list.