From outputai
Creates offline evaluation tests for Output SDK workflows using @outputai/evals: verify() evaluators, YAML datasets, eval workflows, and CLI tests.
How this skill is triggered — by the user, by Claude, or both
Slash command
/outputai:output-dev-eval-testingThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
The `@outputai/evals` package provides an offline evaluation framework for testing workflow quality using datasets and evaluators. This is **complementary** to the runtime `evaluator()` from `@outputai/core`:
The @outputai/evals package provides an offline evaluation framework for testing workflow quality using datasets and evaluators. This is complementary to the runtime evaluator() from @outputai/core:
| Aspect | Runtime Evaluators (@outputai/core) | Offline Eval Tests (@outputai/evals) |
|---|---|---|
| When | During workflow execution | After execution, at test time |
| Where | evaluators.ts in workflow folder | tests/evals/ in workflow folder |
| Purpose | Live quality scoring with confidence | Dataset-driven pass/fail verification |
| Triggered by | Workflow orchestration | output workflow test CLI command |
| Returns | EvaluationBooleanResult, etc. | Verdict helpers (pass/partial/fail) |
Use offline eval testing when you want to validate workflow behavior against known datasets, build regression test suites, or assess subjective quality with LLM judges.
tests/evals/ or tests/datasets/verify() from @outputai/evalsevalWorkflow()output workflow test commandsAdd a tests/ directory inside the workflow folder:
src/workflows/{workflow_name}/
├── workflow.ts
├── steps.ts
├── evaluators.ts # Runtime evaluators (optional)
├── types.ts
└── tests/
├── datasets/
│ ├── happy_path.yml
│ └── edge_case.yml
└── evals/
├── evaluators.ts # Offline eval test evaluators
├── workflow.ts # Eval workflow definition
└── judge_topic@v1.prompt # LLM judge prompts (optional)
verify()Import verify and Verdict from @outputai/evals (not @outputai/core):
// tests/evals/evaluators.ts
import { verify, Verdict } from '@outputai/evals';
import { z } from '@outputai/core';
verify() Signatureverify(options, checkFn)
Options:
name — unique evaluator identifier (snake_case)input — Zod schema for the workflow input (optional, defaults to z.any())output — Zod schema for the workflow output (optional, defaults to z.any())Check function receives:
{
input, // typed workflow input
output, // typed workflow output
context: {
ground_truth: Record<string, unknown> // from dataset YAML
}
}
Returns: any Verdict helper result.
import { verify, Verdict } from '@outputai/evals';
import { z } from '@outputai/core';
export const evaluateSum = verify(
{
name: 'evaluate_sum',
input: z.object({ values: z.array(z.number()) }),
output: z.object({ result: z.number() })
},
({ input, output }) =>
Verdict.equals(output.result, input.values.reduce((a, b) => a + b, 0))
);
Ground truth values come from the dataset YAML and are available via context.ground_truth:
export const lengthCheck = verify(
{ name: 'length_check', input: blogInput, output: blogOutput },
({ output, context }) =>
Verdict.gte(output.blog_post.length, Number(context.ground_truth.min_length ?? 100))
);
All deterministic helpers return results with confidence 1.0.
| Method | Description |
|---|---|
Verdict.equals(actual, expected) | Strict equality (===) |
Verdict.closeTo(actual, expected, tolerance) | Within numeric tolerance |
Verdict.gt(actual, threshold) | Greater than |
Verdict.gte(actual, threshold) | Greater than or equal |
Verdict.lt(actual, threshold) | Less than |
Verdict.lte(actual, threshold) | Less than or equal |
Verdict.inRange(actual, min, max) | Within inclusive range |
| Method | Description |
|---|---|
Verdict.contains(haystack, needle) | String includes substring |
Verdict.matches(value, pattern) | Regex match |
Verdict.includesAll(actual, expected) | Array contains all expected values |
Verdict.includesAny(actual, expected) | Array contains at least one expected value |
| Method | Description |
|---|---|
Verdict.isTrue(value) | Value is true |
Verdict.isFalse(value) | Value is false |
| Method | Description |
|---|---|
Verdict.pass(reasoning?) | Explicit pass |
Verdict.partial(confidence, reasoning?, feedback?) | Partial pass with confidence |
Verdict.fail(reasoning, feedback?) | Explicit fail |
For subjective quality assessments, use judge functions with .prompt files:
import { verify, judgeVerdict, judgeScore, judgeLabel } from '@outputai/evals';
// Returns pass/partial/fail verdict from an LLM
export const evaluateTopic = verify(
{ name: 'evaluate_topic', input: blogInput, output: blogOutput },
async ({ input, output, context }) =>
judgeVerdict({
prompt: 'judge_topic@v1',
variables: {
blog_title: output.title,
blog_post: output.blog_post,
required_topic: String(context.ground_truth.required_topic ?? input.topic)
}
})
);
// Returns a numeric score from an LLM
export const evaluateQuality = verify(
{ name: 'evaluate_quality', input: blogInput, output: blogOutput },
async ({ input, output }) =>
judgeScore({
prompt: 'judge_quality@v1',
variables: { blog_title: output.title, blog_post: output.blog_post, topic: input.topic }
})
);
// Returns a string label from an LLM
export const evaluateTone = verify(
{ name: 'evaluate_tone', input: blogInput, output: blogOutput },
async ({ output }) =>
judgeLabel({
prompt: 'judge_tone@v1',
variables: { blog_title: output.title, blog_post: output.blog_post }
})
);
.prompt File FormatJudge prompt files live alongside evaluators in tests/evals/:
# tests/evals/judge_topic@v1.prompt
---
provider: anthropic
model: claude-haiku-4-5-20251001
temperature: 0
maxTokens: 1000
---
<system>
You are an evaluation judge. Assess whether a blog post is faithfully about the required topic.
Return a JSON object with:
- verdict: "pass" if the blog clearly focuses on the topic, "partial" if it mentions the topic but lacks depth, "fail" if it is not about the topic
- reasoning: a brief explanation of your judgment
</system>
<user>
Required topic: {{ required_topic }}
Blog title: {{ blog_title }}
Blog post:
{{ blog_post }}
Judge whether this blog post is faithfully about the required topic.
</user>
The eval workflow wires evaluators together and defines how to interpret results.
// tests/evals/workflow.ts
import { evalWorkflow } from '@outputai/evals';
import { evaluateSum } from './evaluators.js';
export default evalWorkflow({
name: 'simple_eval',
evals: [
{
evaluator: evaluateSum,
criticality: 'required',
interpret: { type: 'boolean' }
}
]
});
Each entry in the evals array has:
evaluator — the function created by verify()criticality — 'required' (affects pass/fail) or 'informational' (reported but doesn't block)interpret — how to convert the evaluator's return value into a verdict| Type | Evaluator Returns | Mapping |
|---|---|---|
{ type: 'boolean' } | Verdict.equals(), Verdict.gte(), etc. | true = pass, false = fail |
{ type: 'verdict' } | judgeVerdict() or Verdict.pass/partial/fail() | Direct pass-through |
{ type: 'number', pass: 0.7, partial: 0.4 } | judgeScore() | >=pass = pass, >=partial = partial, else fail |
{ type: 'string', pass: ['a', 'b'], partial: ['c'] } | judgeLabel() | Label in pass list = pass, in partial list = partial, else fail |
export default evalWorkflow({
name: 'blog_generator_eval',
evals: [
{
evaluator: lengthOfOutput,
criticality: 'required',
interpret: { type: 'boolean' }
},
{
evaluator: evaluateTopic,
criticality: 'required',
interpret: { type: 'verdict' }
},
{
evaluator: evaluateQuality,
criticality: 'required',
interpret: { type: 'number', pass: 0.7, partial: 0.4 }
},
{
evaluator: evaluateContent,
criticality: 'informational',
interpret: { type: 'boolean' }
},
{
evaluator: evaluateTone,
criticality: 'informational',
interpret: { type: 'string', pass: ['professional', 'informative'], partial: ['casual'] }
}
]
});
The eval workflow name must end in _eval and match the pattern {workflow_name}_eval. The CLI resolves this automatically — output workflow test blog_generator looks for blog_generator_eval.
Datasets are YAML files in tests/datasets/. Each file represents one test case.
name: basic_input
input:
values:
- 1
- 2
- 3
- 4
- 5
last_output:
output:
result: 15
executionTimeMs: 100
date: '2026-02-13T00:00:00.000Z'
Ground truth provides expected values for evaluators. You can set global values and per-evaluator overrides:
name: stripe_blog
input:
topic: "Stripe the payment processor"
requirements: "Include a link to https://stripe.com/en-gb/pricing"
last_output:
output:
title: "Stripe: The Modern Payment Processing Platform"
blog_post: |
Stripe has revolutionized online payment processing...
executionTimeMs: 5000
date: '2026-02-16T00:00:00.000Z'
ground_truth:
notes: "Known good case"
evals:
length_of_output:
min_length: 100
evaluate_topic:
required_topic: "Stripe the payment processor"
evaluate_content:
required_content: "https://stripe.com/en-gb/pricing"
The ground_truth.evals.<evaluator_name> values are merged with the top-level ground truth and passed to the evaluator via context.ground_truth.
output workflow test <workflow_name>Runs evaluations against all datasets for a workflow.
| Flag | Description |
|---|---|
--cached | Use cached output from dataset files (skip workflow execution) |
--save | Run workflow fresh and save output + eval results back to dataset files |
--dataset <names> | Comma-separated list of dataset names to run (default: all) |
--format <type> | Output format: text (default) or json |
Execution flow:
tests/datasets/--cached: executes the workflow for each dataset to get fresh output{workflow_name}_eval workflowoutput workflow dataset list <workflow_name>Lists all datasets for a workflow with their cached status.
| Flag | Description |
|---|---|
--format <type> | Output format: table (default), text, or json |
output workflow dataset generate <workflow_name> [scenario]Generates a new dataset file by running the workflow.
| Flag | Description |
|---|---|
--input <json> | Workflow input as a JSON string or file path |
--name <name> | Dataset filename (defaults to scenario name) |
--trace <path> | Generate from a local trace file instead of running the workflow |
--download | Download traces from S3 and convert to datasets |
--limit <n> | Max traces to download from S3 (default: 5) |
# Generate dataset from inline JSON input
output workflow dataset generate my_workflow --input '{"key": "value"}' --name my_test
# Generate from a scenario file
output workflow dataset generate my_workflow basic
# Run evals with cached output (fast, no re-execution)
output workflow test my_workflow --cached
# Run evals fresh and save results
output workflow test my_workflow --save
# Run specific datasets only
output workflow test my_workflow --dataset happy_path,edge_case
# List all datasets
output workflow dataset list my_workflow
# 1. Start the dev server
npm run output:dev
# 2. Generate datasets from real workflow runs
output workflow dataset generate blog_generator --input '{"topic": "AI"}' --name ai_post
# 3. Edit the dataset YAML to add ground_truth values for your evaluators
# 4. Run evals with --save to cache output and eval results
output workflow test blog_generator --save
# 5. Iterate on evaluators, re-run with cached output (fast)
output workflow test blog_generator --cached
# 6. List all datasets
output workflow dataset list blog_generator
verify, Verdict from @outputai/evals (not @outputai/core)evalWorkflow from @outputai/evals.js extension{workflow_name}_eval patterntests/datasets/tests/evals/name in snake_casecriticality is set to 'required' or 'informational' for each evalinterpret type matches evaluator return type.prompt files are in tests/evals/ alongside evaluatorsz is imported from @outputai/core (not zod)output-dev-evaluator-function — Runtime evaluators using evaluator() from @outputai/coreoutput-dev-scenario-file — Creating scenario JSON files for workflow executionoutput-dev-folder-structure — Understanding project directory layoutoutput-dev-prompt-file — Creating .prompt files for LLM operationsnpx claudepluginhub growthxai/output --plugin outputaiCreates evaluator functions in evaluators.ts for Output SDK workflows to implement quality assessment, validation logic, and LLM-powered content evaluation with confidence scores.
Builds LangSmith evaluation pipelines: create LLM-as-Judge/custom evaluators, capture agent outputs/trajectories via run functions, run locally with evaluate() or CLI.
Generates evaluation test cases for skills by analyzing skill config and metadata. Bootstraps datasets or expands existing ones for /eval-run.