From outputai
Creates evaluator functions in evaluators.ts for Output SDK workflows to implement quality assessment, validation logic, and LLM-powered content evaluation with confidence scores.
How this skill is triggered — by the user, by Claude, or both
Slash command
/outputai:output-dev-evaluator-functionThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill documents how to create evaluator functions in `evaluators.ts` for Output SDK workflows. Evaluators are used to assess quality, validate outputs, and provide confidence-scored judgments about workflow results.
This skill documents how to create evaluator functions in evaluators.ts for Output SDK workflows. Evaluators are used to assess quality, validate outputs, and provide confidence-scored judgments about workflow results.
For smaller workflows, use a single evaluators.ts file:
src/workflows/{workflow-name}/
├── workflow.ts
├── steps.ts
├── evaluators.ts # All evaluators in one file
├── types.ts
└── ...
For larger workflows with many evaluators, use an evaluators/ folder:
src/workflows/{workflow-name}/
├── workflow.ts
├── steps.ts
├── evaluators/ # Evaluators split into individual files
│ ├── quality.ts
│ ├── accuracy.ts
│ └── completeness.ts
├── types.ts
└── ...
Important: evaluator() calls MUST be in files containing 'evaluators' in the path:
src/workflows/my_workflow/evaluators.ts ✓src/workflows/my_workflow/evaluators/quality.ts ✓src/shared/evaluators/common_evaluators.ts ✓src/workflows/my_workflow/helpers.ts ✗ (cannot contain evaluator() calls)Evaluators are Temporal activities with strict import rules to ensure deterministic replay.
./utils.js, ./types.js, ./helpers.js./lib/helpers.js../../shared/utils/*.js../../shared/clients/*.js../../shared/services/*.jsExample of WRONG imports:
// WRONG - evaluators cannot import other evaluators
import { otherEvaluator } from '../../shared/evaluators/other.js'; // ✗
import { anotherEvaluator } from './other_evaluators.js'; // ✗
// CORRECT - Import from @outputai/core
import {
evaluator,
z,
EvaluationBooleanResult,
EvaluationNumberResult,
EvaluationStringResult,
EvaluationFeedback
} from '@outputai/core';
// WRONG - Never import z from zod
import { z } from 'zod';
// CORRECT - Use @outputai/llm wrapper
import { generateText, Output } from '@outputai/llm';
// WRONG - Never call LLM providers directly
import OpenAI from 'openai';
All imports MUST use .js extension:
// CORRECT
import { BlogContent } from './types.js';
// WRONG - Missing .js extension
import { BlogContent } from './types';
import { evaluator, z, EvaluationBooleanResult } from '@outputai/core';
export const myEvaluator = evaluator({
name: 'my_evaluator',
description: 'Description of what this evaluator assesses',
inputSchema: z.object({ /* input schema */ }),
fn: async (input) => {
// Evaluation logic
return new EvaluationBooleanResult({
value: true,
confidence: 0.95
});
}
});
Unique identifier for the evaluator. Use snake_case.
name: 'evaluate_content_quality'
Human-readable description of what the evaluator assesses.
description: 'Evaluate the quality and completeness of generated content'
Schema for validating evaluator input.
inputSchema: z.object({
content: z.string(),
expectedLength: z.number()
})
The evaluator execution function. Returns an evaluation result with value and confidence.
fn: async (input) => {
const isValid = input.content.length >= input.expectedLength;
return new EvaluationBooleanResult({
value: isValid,
confidence: 0.95
});
}
Use for pass/fail or true/false evaluations:
import { EvaluationBooleanResult } from '@outputai/core';
return new EvaluationBooleanResult({
value: true, // boolean result
confidence: 0.95, // 0.0 to 1.0
reasoning: 'Optional explanation of the evaluation'
});
Use for numeric scores or ratings:
import { EvaluationNumberResult } from '@outputai/core';
return new EvaluationNumberResult({
value: 85, // numeric result (e.g., 0-100 score)
confidence: 0.85, // 0.0 to 1.0
reasoning: 'Optional explanation of the score'
});
Use for categorical or text-based evaluations:
import { EvaluationStringResult } from '@outputai/core';
return new EvaluationStringResult({
value: 'positive', // string result (e.g., category, sentiment, label)
confidence: 0.9, // 0.0 to 1.0
reasoning: 'Optional explanation of the classification'
});
| Property | Type | Required | Description |
|---|---|---|---|
value | boolean, number, or string | Yes | The evaluation result |
confidence | number (0.0-1.0) | Yes | Confidence in the evaluation |
reasoning | string | No | Explanation of the evaluation |
name | string | No | Name for this specific result (useful in dimensions) |
feedback | EvaluationFeedback[] | No | Array of feedback objects with issues and suggestions |
dimensions | EvaluationResult[] | No | Nested results for multi-dimensional evaluation |
import { evaluator, z, EvaluationBooleanResult } from '@outputai/core';
export const evaluateCompleteness = evaluator({
name: 'evaluate_completeness',
description: 'Check if content meets minimum length requirements',
inputSchema: z.object({
content: z.string(),
minLength: z.number().default(100)
}),
fn: async ({ content, minLength }) => {
const isComplete = content.length >= minLength;
return new EvaluationBooleanResult({
value: isComplete,
confidence: 1.0,
reasoning: isComplete
? `Content has ${content.length} characters, meets minimum of ${minLength}`
: `Content has ${content.length} characters, below minimum of ${minLength}`
});
}
});
import { evaluator, z, EvaluationBooleanResult } from '@outputai/core';
export const evaluateGibberish = evaluator({
name: 'evaluate_gibberish',
description: 'Check if a given string is gibberish',
inputSchema: z.string(),
fn: async content => {
const gibberishPatterns = ['foo', 'bar', 'lorem', 'ipsum'];
const isGibberish = gibberishPatterns.some(p => content.toLowerCase().includes(p));
return new EvaluationBooleanResult({
value: !isGibberish,
confidence: 0.95
});
}
});
import { evaluator, z, EvaluationNumberResult } from '@outputai/core';
export const evaluateReadability = evaluator({
name: 'evaluate_readability',
description: 'Calculate readability score based on sentence structure',
inputSchema: z.object({
content: z.string()
}),
fn: async ({ content }) => {
const sentences = content.split(/[.!?]+/).filter(s => s.trim());
const words = content.split(/\s+/).filter(w => w.trim());
const avgWordsPerSentence = words.length / Math.max(sentences.length, 1);
// Simple readability score (lower avg words = more readable)
const score = Math.max(0, Math.min(100, 100 - (avgWordsPerSentence - 15) * 5));
return new EvaluationNumberResult({
value: Math.round(score),
confidence: 0.8,
reasoning: `Average ${avgWordsPerSentence.toFixed(1)} words per sentence`
});
}
});
import { evaluator, z, EvaluationStringResult } from '@outputai/core';
export const evaluateSentiment = evaluator({
name: 'evaluate_sentiment',
description: 'Classify the sentiment of content',
inputSchema: z.object({
content: z.string()
}),
fn: async ({ content }) => {
const positiveWords = ['great', 'excellent', 'amazing', 'good', 'love'];
const negativeWords = ['bad', 'terrible', 'awful', 'hate', 'poor'];
const lowerContent = content.toLowerCase();
const positiveCount = positiveWords.filter(w => lowerContent.includes(w)).length;
const negativeCount = negativeWords.filter(w => lowerContent.includes(w)).length;
let sentiment: string;
let confidence: number;
if (positiveCount > negativeCount) {
sentiment = 'positive';
confidence = Math.min(0.95, 0.6 + positiveCount * 0.1);
} else if (negativeCount > positiveCount) {
sentiment = 'negative';
confidence = Math.min(0.95, 0.6 + negativeCount * 0.1);
} else {
sentiment = 'neutral';
confidence = 0.7;
}
return new EvaluationStringResult({
value: sentiment,
confidence,
reasoning: `Found ${positiveCount} positive and ${negativeCount} negative indicators`
});
}
});
import { evaluator, z, EvaluationNumberResult } from '@outputai/core';
import { generateText, Output } from '@outputai/llm';
export const evaluateSignalToNoise = evaluator({
name: 'evaluate_signal_to_noise',
description: 'Evaluate the signal-to-noise ratio of content',
inputSchema: z.object({
title: z.string(),
content: z.string()
}),
fn: async ({ title, content }) => {
const { output } = await generateText({
prompt: 'signal_noise@v1', // References prompts/signal_noise@v1.prompt
variables: {
title,
content
},
output: Output.object({
schema: z.object({
score: z.number().describe('Signal-to-noise score 0-100')
})
})
});
return new EvaluationNumberResult({
value: output.score,
confidence: 0.85
});
}
});
import { evaluator, z, EvaluationBooleanResult } from '@outputai/core';
import { generateText, Output } from '@outputai/llm';
export const evaluateFactualAccuracy = evaluator({
name: 'evaluate_factual_accuracy',
description: 'Check if content contains factual claims that can be verified',
inputSchema: z.object({
content: z.string(),
topic: z.string()
}),
fn: async ({ content, topic }) => {
const { output } = await generateText({
prompt: 'factual_check@v1',
variables: { content, topic },
output: Output.object({
schema: z.object({
isFactual: z.boolean().describe('Whether content appears factually accurate'),
confidence: z.number().describe('Confidence in assessment 0-1'),
issues: z.array(z.string()).optional().describe('Any factual issues found')
})
})
});
return new EvaluationBooleanResult({
value: output.isFactual,
confidence: output.confidence,
reasoning: output.issues?.length
? `Issues found: ${output.issues.join(', ')}`
: 'No factual issues detected'
});
}
});
import { evaluator, z, EvaluationStringResult } from '@outputai/core';
import { generateText, Output } from '@outputai/llm';
export const evaluateContentCategory = evaluator({
name: 'evaluate_content_category',
description: 'Classify content into a category',
inputSchema: z.object({
content: z.string(),
categories: z.array(z.string())
}),
fn: async ({ content, categories }) => {
const { output } = await generateText({
prompt: 'categorize_content@v1',
variables: {
content,
categories: categories.join(', ')
},
output: Output.object({
schema: z.object({
category: z.string().describe('The best matching category'),
confidence: z.number().describe('Confidence in classification 0-1'),
explanation: z.string().describe('Why this category was chosen')
})
})
});
return new EvaluationStringResult({
value: output.category,
confidence: output.confidence,
reasoning: output.explanation
});
}
});
Use the feedback field to provide actionable improvement suggestions alongside your evaluation result. Import EvaluationFeedback from @outputai/core to create feedback objects.
import { evaluator, z, EvaluationStringResult, EvaluationFeedback } from '@outputai/core';
export const evaluateWithFeedback = evaluator({
name: 'evaluate_with_feedback',
description: 'Evaluate content quality and provide actionable feedback',
inputSchema: z.string(),
fn: async (response) => {
const feedback = [];
if (response.length < 50) {
feedback.push(new EvaluationFeedback({
issue: 'Response is too short',
suggestion: 'Expand the response with more detail',
priority: 'medium'
}));
}
return new EvaluationStringResult({
value: feedback.length === 0 ? 'good' : 'needs_improvement',
confidence: 0.85,
feedback: feedback
});
}
});
| Property | Type | Description |
|---|---|---|
issue | string | The problem identified |
suggestion | string | Recommended fix |
priority | string | Priority level (e.g., 'low', 'medium', 'high') |
Use the dimensions field to nest EvaluationResult instances for sub-scores. Each dimension should use the name field to identify it.
import { evaluator, z, EvaluationStringResult, EvaluationNumberResult } from '@outputai/core';
export const evaluateMultiDimensional = evaluator({
name: 'evaluate_multi_dimensional',
description: 'Evaluate content across multiple quality dimensions',
inputSchema: z.string(),
fn: async (response) => {
const coherenceScore = calculateCoherence(response);
const relevanceScore = calculateRelevance(response);
const overallScore = (coherenceScore + relevanceScore) / 2;
return new EvaluationStringResult({
value: overallScore > 0.7 ? 'high_quality' : 'low_quality',
confidence: 0.9,
dimensions: [
new EvaluationNumberResult({
value: coherenceScore,
confidence: 0.85,
name: 'coherence'
}),
new EvaluationNumberResult({
value: relevanceScore,
confidence: 0.88,
name: 'relevance'
})
]
});
}
});
Based on a real workflow evaluator file:
import { evaluator, z, EvaluationBooleanResult, EvaluationNumberResult } from '@outputai/core';
import { generateText, Output } from '@outputai/llm';
import { blogContentSchema } from './types.js';
import type { BlogContent, QualityMetrics } from './types.js';
// Simple boolean evaluator
export const evaluateMinimumLength = evaluator({
name: 'evaluate_minimum_length',
description: 'Check if blog content meets minimum length requirements',
inputSchema: blogContentSchema,
fn: async (input: BlogContent) => {
const MIN_TOKENS = 500;
const meetsRequirement = input.tokenCount >= MIN_TOKENS;
return new EvaluationBooleanResult({
value: meetsRequirement,
confidence: 1.0,
reasoning: `Content has ${input.tokenCount} tokens (minimum: ${MIN_TOKENS})`
});
}
});
// LLM-powered number evaluator
export const evaluateSignalToNoise = evaluator({
name: 'evaluate_signal_to_noise',
description: 'Evaluate the signal-to-noise ratio of blog content',
inputSchema: blogContentSchema,
fn: async (input: BlogContent) => {
const { output } = await generateText({
prompt: 'signal_noise@v1',
variables: {
title: input.title,
content: input.content
},
output: Output.object({
schema: z.object({
score: z.number().describe('Signal-to-noise score 0-100')
})
})
});
return new EvaluationNumberResult({
value: output.score,
confidence: 0.85
});
}
});
// LLM-powered boolean evaluator
export const evaluateRelevance = evaluator({
name: 'evaluate_relevance',
description: 'Check if content is relevant to the stated topic',
inputSchema: z.object({
content: z.string(),
topic: z.string(),
keywords: z.array(z.string())
}),
fn: async ({ content, topic, keywords }) => {
const { output } = await generateText({
prompt: 'relevance_check@v1',
variables: { content, topic, keywords: keywords.join(', ') },
output: Output.object({
schema: z.object({
isRelevant: z.boolean(),
relevanceScore: z.number().describe('Relevance score 0-1'),
explanation: z.string()
})
})
});
return new EvaluationBooleanResult({
value: output.isRelevant,
confidence: output.relevanceScore,
reasoning: output.explanation
});
}
});
// Boolean for pass/fail decisions
return new EvaluationBooleanResult({ value: true, confidence: 0.9 });
// Number for scores and ratings
return new EvaluationNumberResult({ value: 85, confidence: 0.85 });
// String for categories, labels, or classifications
return new EvaluationStringResult({ value: 'positive', confidence: 0.9 });
// High confidence for deterministic checks
confidence: 1.0 // e.g., length checks, pattern matching
// Medium confidence for heuristic-based evaluations
confidence: 0.85 // e.g., LLM-based assessments
// Lower confidence for uncertain evaluations
confidence: 0.7 // e.g., subjective quality judgments
return new EvaluationBooleanResult({
value: false,
confidence: 0.95,
reasoning: `Content contains ${errorCount} grammatical errors, exceeding threshold of ${maxErrors}`
});
// Good - single responsibility
export const evaluateGrammar = evaluator({ ... });
export const evaluateReadability = evaluator({ ... });
export const evaluateTone = evaluator({ ... });
// Avoid - doing too much in one evaluator
export const evaluateEverything = evaluator({ ... });
// Good - clear what is being evaluated
name: 'evaluate_content_originality'
name: 'evaluate_factual_accuracy'
name: 'evaluate_sentiment_alignment'
// Avoid - vague names
name: 'check'
name: 'validate'
name: 'evaluate_stuff'
feedback: [
new EvaluationFeedback({
issue: 'Missing conclusion paragraph',
suggestion: 'Add a summary paragraph at the end',
priority: 'high'
})
]
dimensions: [
new EvaluationNumberResult({ value: 8, confidence: 0.9, name: 'coherence' }),
new EvaluationNumberResult({ value: 6, confidence: 0.85, name: 'relevance' })
]
evaluator, z, result types imported from @outputai/coregenerateText and Output imported from @outputai/llm if using LLM (not direct provider).describe() instead of .min()/.max() on z.number().js extensionname, description, inputSchema, fnsnake_caseEvaluationBooleanResult, EvaluationNumberResult, or EvaluationStringResult)EvaluationFeedback imported from @outputai/core when using feedbackissue, suggestion, and priorityname field to identify sub-evaluationsoutput-dev-workflow-function - Orchestrating evaluators in workflow.tsoutput-dev-step-function - Creating step functionsoutput-dev-types-file - Defining evaluator input schemasoutput-dev-prompt-file - Creating prompt files for LLM-powered evaluatorsoutput-dev-folder-structure - Understanding project layoutnpx claudepluginhub growthxai/output --plugin outputaiCreates offline evaluation tests for Output SDK workflows using @outputai/evals: verify() evaluators, YAML datasets, eval workflows, and CLI tests.
Builds LangSmith evaluation pipelines: create LLM-as-Judge/custom evaluators, capture agent outputs/trajectories via run functions, run locally with evaluate() or CLI.
Creates and runs LLM-as-judge and code evaluators on Arize: hallucination, faithfulness, correctness, relevance scoring, experiment evaluation, column mapping, and continuous monitoring.