From outputai
Creates evaluator functions in evaluators.ts for Output SDK workflows to implement quality assessment, validation logic, and LLM-powered content evaluation with confidence scores.
npx claudepluginhub growthxai/output --plugin outputaiThis skill is limited to using the following tools:
This skill documents how to create evaluator functions in `evaluators.ts` for Output SDK workflows. Evaluators are used to assess quality, validate outputs, and provide confidence-scored judgments about workflow results.
Creates offline evaluation tests for Output SDK workflows using @outputai/evals: verify() evaluators, YAML datasets, eval workflows, and CLI tests.
Builds LangSmith evaluation pipelines: create LLM-as-Judge/custom evaluators, capture agent outputs/trajectories via run functions, run locally with evaluate() or CLI.
Writes, edits, reviews, and validates AgentV EVAL.yaml files for agent skill evaluations. Adds test cases, configures graders, converts from evals.json or chat transcripts.
Share bugs, ideas, or general feedback.
This skill documents how to create evaluator functions in evaluators.ts for Output SDK workflows. Evaluators are used to assess quality, validate outputs, and provide confidence-scored judgments about workflow results.
For smaller workflows, use a single evaluators.ts file:
src/workflows/{workflow-name}/
├── workflow.ts
├── steps.ts
├── evaluators.ts # All evaluators in one file
├── types.ts
└── ...
For larger workflows with many evaluators, use an evaluators/ folder:
src/workflows/{workflow-name}/
├── workflow.ts
├── steps.ts
├── evaluators/ # Evaluators split into individual files
│ ├── quality.ts
│ ├── accuracy.ts
│ └── completeness.ts
├── types.ts
└── ...
Important: evaluator() calls MUST be in files containing 'evaluators' in the path:
src/workflows/my_workflow/evaluators.ts ✓src/workflows/my_workflow/evaluators/quality.ts ✓src/shared/evaluators/common_evaluators.ts ✓src/workflows/my_workflow/helpers.ts ✗ (cannot contain evaluator() calls)Evaluators are Temporal activities with strict import rules to ensure deterministic replay.
./utils.js, ./types.js, ./helpers.js./lib/helpers.js../../shared/utils/*.js../../shared/clients/*.js../../shared/services/*.jsExample of WRONG imports:
// WRONG - evaluators cannot import other evaluators
import { otherEvaluator } from '../../shared/evaluators/other.js'; // ✗
import { anotherEvaluator } from './other_evaluators.js'; // ✗
// CORRECT - Import from @outputai/core
import {
evaluator,
z,
EvaluationBooleanResult,
EvaluationNumberResult,
EvaluationStringResult,
EvaluationFeedback
} from '@outputai/core';
// WRONG - Never import z from zod
import { z } from 'zod';
// CORRECT - Use @outputai/llm wrapper
import { generateText, Output } from '@outputai/llm';
// WRONG - Never call LLM providers directly
import OpenAI from 'openai';
All imports MUST use .js extension:
// CORRECT
import { BlogContent } from './types.js';
// WRONG - Missing .js extension
import { BlogContent } from './types';
import { evaluator, z, EvaluationBooleanResult } from '@outputai/core';
export const myEvaluator = evaluator({
name: 'my_evaluator',
description: 'Description of what this evaluator assesses',
inputSchema: z.object({ /* input schema */ }),
fn: async (input) => {
// Evaluation logic
return new EvaluationBooleanResult({
value: true,
confidence: 0.95
});
}
});
Unique identifier for the evaluator. Use snake_case.
name: 'evaluate_content_quality'
Human-readable description of what the evaluator assesses.
description: 'Evaluate the quality and completeness of generated content'
Schema for validating evaluator input.
inputSchema: z.object({
content: z.string(),
expectedLength: z.number()
})
The evaluator execution function. Returns an evaluation result with value and confidence.
fn: async (input) => {
const isValid = input.content.length >= input.expectedLength;
return new EvaluationBooleanResult({
value: isValid,
confidence: 0.95
});
}
Use for pass/fail or true/false evaluations:
import { EvaluationBooleanResult } from '@outputai/core';
return new EvaluationBooleanResult({
value: true, // boolean result
confidence: 0.95, // 0.0 to 1.0
reasoning: 'Optional explanation of the evaluation'
});
Use for numeric scores or ratings:
import { EvaluationNumberResult } from '@outputai/core';
return new EvaluationNumberResult({
value: 85, // numeric result (e.g., 0-100 score)
confidence: 0.85, // 0.0 to 1.0
reasoning: 'Optional explanation of the score'
});
Use for categorical or text-based evaluations:
import { EvaluationStringResult } from '@outputai/core';
return new EvaluationStringResult({
value: 'positive', // string result (e.g., category, sentiment, label)
confidence: 0.9, // 0.0 to 1.0
reasoning: 'Optional explanation of the classification'
});
| Property | Type | Required | Description |
|---|---|---|---|
value | boolean, number, or string | Yes | The evaluation result |
confidence | number (0.0-1.0) | Yes | Confidence in the evaluation |
reasoning | string | No | Explanation of the evaluation |
name | string | No | Name for this specific result (useful in dimensions) |
feedback | EvaluationFeedback[] | No | Array of feedback objects with issues and suggestions |
dimensions | EvaluationResult[] | No | Nested results for multi-dimensional evaluation |
import { evaluator, z, EvaluationBooleanResult } from '@outputai/core';
export const evaluateCompleteness = evaluator({
name: 'evaluate_completeness',
description: 'Check if content meets minimum length requirements',
inputSchema: z.object({
content: z.string(),
minLength: z.number().default(100)
}),
fn: async ({ content, minLength }) => {
const isComplete = content.length >= minLength;
return new EvaluationBooleanResult({
value: isComplete,
confidence: 1.0,
reasoning: isComplete
? `Content has ${content.length} characters, meets minimum of ${minLength}`
: `Content has ${content.length} characters, below minimum of ${minLength}`
});
}
});
import { evaluator, z, EvaluationBooleanResult } from '@outputai/core';
export const evaluateGibberish = evaluator({
name: 'evaluate_gibberish',
description: 'Check if a given string is gibberish',
inputSchema: z.string(),
fn: async content => {
const gibberishPatterns = ['foo', 'bar', 'lorem', 'ipsum'];
const isGibberish = gibberishPatterns.some(p => content.toLowerCase().includes(p));
return new EvaluationBooleanResult({
value: !isGibberish,
confidence: 0.95
});
}
});
import { evaluator, z, EvaluationNumberResult } from '@outputai/core';
export const evaluateReadability = evaluator({
name: 'evaluate_readability',
description: 'Calculate readability score based on sentence structure',
inputSchema: z.object({
content: z.string()
}),
fn: async ({ content }) => {
const sentences = content.split(/[.!?]+/).filter(s => s.trim());
const words = content.split(/\s+/).filter(w => w.trim());
const avgWordsPerSentence = words.length / Math.max(sentences.length, 1);
// Simple readability score (lower avg words = more readable)
const score = Math.max(0, Math.min(100, 100 - (avgWordsPerSentence - 15) * 5));
return new EvaluationNumberResult({
value: Math.round(score),
confidence: 0.8,
reasoning: `Average ${avgWordsPerSentence.toFixed(1)} words per sentence`
});
}
});
import { evaluator, z, EvaluationStringResult } from '@outputai/core';
export const evaluateSentiment = evaluator({
name: 'evaluate_sentiment',
description: 'Classify the sentiment of content',
inputSchema: z.object({
content: z.string()
}),
fn: async ({ content }) => {
const positiveWords = ['great', 'excellent', 'amazing', 'good', 'love'];
const negativeWords = ['bad', 'terrible', 'awful', 'hate', 'poor'];
const lowerContent = content.toLowerCase();
const positiveCount = positiveWords.filter(w => lowerContent.includes(w)).length;
const negativeCount = negativeWords.filter(w => lowerContent.includes(w)).length;
let sentiment: string;
let confidence: number;
if (positiveCount > negativeCount) {
sentiment = 'positive';
confidence = Math.min(0.95, 0.6 + positiveCount * 0.1);
} else if (negativeCount > positiveCount) {
sentiment = 'negative';
confidence = Math.min(0.95, 0.6 + negativeCount * 0.1);
} else {
sentiment = 'neutral';
confidence = 0.7;
}
return new EvaluationStringResult({
value: sentiment,
confidence,
reasoning: `Found ${positiveCount} positive and ${negativeCount} negative indicators`
});
}
});
import { evaluator, z, EvaluationNumberResult } from '@outputai/core';
import { generateText, Output } from '@outputai/llm';
export const evaluateSignalToNoise = evaluator({
name: 'evaluate_signal_to_noise',
description: 'Evaluate the signal-to-noise ratio of content',
inputSchema: z.object({
title: z.string(),
content: z.string()
}),
fn: async ({ title, content }) => {
const { output } = await generateText({
prompt: 'signal_noise@v1', // References prompts/signal_noise@v1.prompt
variables: {
title,
content
},
output: Output.object({
schema: z.object({
score: z.number().describe('Signal-to-noise score 0-100')
})
})
});
return new EvaluationNumberResult({
value: output.score,
confidence: 0.85
});
}
});
import { evaluator, z, EvaluationBooleanResult } from '@outputai/core';
import { generateText, Output } from '@outputai/llm';
export const evaluateFactualAccuracy = evaluator({
name: 'evaluate_factual_accuracy',
description: 'Check if content contains factual claims that can be verified',
inputSchema: z.object({
content: z.string(),
topic: z.string()
}),
fn: async ({ content, topic }) => {
const { output } = await generateText({
prompt: 'factual_check@v1',
variables: { content, topic },
output: Output.object({
schema: z.object({
isFactual: z.boolean().describe('Whether content appears factually accurate'),
confidence: z.number().describe('Confidence in assessment 0-1'),
issues: z.array(z.string()).optional().describe('Any factual issues found')
})
})
});
return new EvaluationBooleanResult({
value: output.isFactual,
confidence: output.confidence,
reasoning: output.issues?.length
? `Issues found: ${output.issues.join(', ')}`
: 'No factual issues detected'
});
}
});
import { evaluator, z, EvaluationStringResult } from '@outputai/core';
import { generateText, Output } from '@outputai/llm';
export const evaluateContentCategory = evaluator({
name: 'evaluate_content_category',
description: 'Classify content into a category',
inputSchema: z.object({
content: z.string(),
categories: z.array(z.string())
}),
fn: async ({ content, categories }) => {
const { output } = await generateText({
prompt: 'categorize_content@v1',
variables: {
content,
categories: categories.join(', ')
},
output: Output.object({
schema: z.object({
category: z.string().describe('The best matching category'),
confidence: z.number().describe('Confidence in classification 0-1'),
explanation: z.string().describe('Why this category was chosen')
})
})
});
return new EvaluationStringResult({
value: output.category,
confidence: output.confidence,
reasoning: output.explanation
});
}
});
Use the feedback field to provide actionable improvement suggestions alongside your evaluation result. Import EvaluationFeedback from @outputai/core to create feedback objects.
import { evaluator, z, EvaluationStringResult, EvaluationFeedback } from '@outputai/core';
export const evaluateWithFeedback = evaluator({
name: 'evaluate_with_feedback',
description: 'Evaluate content quality and provide actionable feedback',
inputSchema: z.string(),
fn: async (response) => {
const feedback = [];
if (response.length < 50) {
feedback.push(new EvaluationFeedback({
issue: 'Response is too short',
suggestion: 'Expand the response with more detail',
priority: 'medium'
}));
}
return new EvaluationStringResult({
value: feedback.length === 0 ? 'good' : 'needs_improvement',
confidence: 0.85,
feedback: feedback
});
}
});
| Property | Type | Description |
|---|---|---|
issue | string | The problem identified |
suggestion | string | Recommended fix |
priority | string | Priority level (e.g., 'low', 'medium', 'high') |
Use the dimensions field to nest EvaluationResult instances for sub-scores. Each dimension should use the name field to identify it.
import { evaluator, z, EvaluationStringResult, EvaluationNumberResult } from '@outputai/core';
export const evaluateMultiDimensional = evaluator({
name: 'evaluate_multi_dimensional',
description: 'Evaluate content across multiple quality dimensions',
inputSchema: z.string(),
fn: async (response) => {
const coherenceScore = calculateCoherence(response);
const relevanceScore = calculateRelevance(response);
const overallScore = (coherenceScore + relevanceScore) / 2;
return new EvaluationStringResult({
value: overallScore > 0.7 ? 'high_quality' : 'low_quality',
confidence: 0.9,
dimensions: [
new EvaluationNumberResult({
value: coherenceScore,
confidence: 0.85,
name: 'coherence'
}),
new EvaluationNumberResult({
value: relevanceScore,
confidence: 0.88,
name: 'relevance'
})
]
});
}
});
Based on a real workflow evaluator file:
import { evaluator, z, EvaluationBooleanResult, EvaluationNumberResult } from '@outputai/core';
import { generateText, Output } from '@outputai/llm';
import { blogContentSchema } from './types.js';
import type { BlogContent, QualityMetrics } from './types.js';
// Simple boolean evaluator
export const evaluateMinimumLength = evaluator({
name: 'evaluate_minimum_length',
description: 'Check if blog content meets minimum length requirements',
inputSchema: blogContentSchema,
fn: async (input: BlogContent) => {
const MIN_TOKENS = 500;
const meetsRequirement = input.tokenCount >= MIN_TOKENS;
return new EvaluationBooleanResult({
value: meetsRequirement,
confidence: 1.0,
reasoning: `Content has ${input.tokenCount} tokens (minimum: ${MIN_TOKENS})`
});
}
});
// LLM-powered number evaluator
export const evaluateSignalToNoise = evaluator({
name: 'evaluate_signal_to_noise',
description: 'Evaluate the signal-to-noise ratio of blog content',
inputSchema: blogContentSchema,
fn: async (input: BlogContent) => {
const { output } = await generateText({
prompt: 'signal_noise@v1',
variables: {
title: input.title,
content: input.content
},
output: Output.object({
schema: z.object({
score: z.number().describe('Signal-to-noise score 0-100')
})
})
});
return new EvaluationNumberResult({
value: output.score,
confidence: 0.85
});
}
});
// LLM-powered boolean evaluator
export const evaluateRelevance = evaluator({
name: 'evaluate_relevance',
description: 'Check if content is relevant to the stated topic',
inputSchema: z.object({
content: z.string(),
topic: z.string(),
keywords: z.array(z.string())
}),
fn: async ({ content, topic, keywords }) => {
const { output } = await generateText({
prompt: 'relevance_check@v1',
variables: { content, topic, keywords: keywords.join(', ') },
output: Output.object({
schema: z.object({
isRelevant: z.boolean(),
relevanceScore: z.number().describe('Relevance score 0-1'),
explanation: z.string()
})
})
});
return new EvaluationBooleanResult({
value: output.isRelevant,
confidence: output.relevanceScore,
reasoning: output.explanation
});
}
});
// Boolean for pass/fail decisions
return new EvaluationBooleanResult({ value: true, confidence: 0.9 });
// Number for scores and ratings
return new EvaluationNumberResult({ value: 85, confidence: 0.85 });
// String for categories, labels, or classifications
return new EvaluationStringResult({ value: 'positive', confidence: 0.9 });
// High confidence for deterministic checks
confidence: 1.0 // e.g., length checks, pattern matching
// Medium confidence for heuristic-based evaluations
confidence: 0.85 // e.g., LLM-based assessments
// Lower confidence for uncertain evaluations
confidence: 0.7 // e.g., subjective quality judgments
return new EvaluationBooleanResult({
value: false,
confidence: 0.95,
reasoning: `Content contains ${errorCount} grammatical errors, exceeding threshold of ${maxErrors}`
});
// Good - single responsibility
export const evaluateGrammar = evaluator({ ... });
export const evaluateReadability = evaluator({ ... });
export const evaluateTone = evaluator({ ... });
// Avoid - doing too much in one evaluator
export const evaluateEverything = evaluator({ ... });
// Good - clear what is being evaluated
name: 'evaluate_content_originality'
name: 'evaluate_factual_accuracy'
name: 'evaluate_sentiment_alignment'
// Avoid - vague names
name: 'check'
name: 'validate'
name: 'evaluate_stuff'
feedback: [
new EvaluationFeedback({
issue: 'Missing conclusion paragraph',
suggestion: 'Add a summary paragraph at the end',
priority: 'high'
})
]
dimensions: [
new EvaluationNumberResult({ value: 8, confidence: 0.9, name: 'coherence' }),
new EvaluationNumberResult({ value: 6, confidence: 0.85, name: 'relevance' })
]
evaluator, z, result types imported from @outputai/coregenerateText and Output imported from @outputai/llm if using LLM (not direct provider).describe() instead of .min()/.max() on z.number().js extensionname, description, inputSchema, fnsnake_caseEvaluationBooleanResult, EvaluationNumberResult, or EvaluationStringResult)EvaluationFeedback imported from @outputai/core when using feedbackissue, suggestion, and priorityname field to identify sub-evaluationsoutput-dev-workflow-function - Orchestrating evaluators in workflow.tsoutput-dev-step-function - Creating step functionsoutput-dev-types-file - Defining evaluator input schemasoutput-dev-prompt-file - Creating prompt files for LLM-powered evaluatorsoutput-dev-folder-structure - Understanding project layout