Skill

adaline-evaluators

Create and manage evaluators in Adaline to score prompt outputs. Use when setting up LLM-as-a-judge, JavaScript, text-matcher, cost, latency, or response-length evaluators for quality assessment.

npx claudepluginhub adaline/skills --plugin skills

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Evaluators define scoring criteria for prompt outputs. Each evaluator run produces a grade (pass/fail), a numeric score, and a reason. Multiple evaluators can be stacked on the same prompt to assess different quality dimensions simultaneously.

Supporting Assets

references/api.md

SKILL.md

Similar Skills

ui-ux-pro-max

70.8k

Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.

ui-ux-pro-max

github-deep-research

63.9k

Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.

2 files

bytedance-deer-flow-1

surprise-me

63.9k

Dynamically discovers and combines enabled skills into cohesive, unexpected delightful experiences like interactive HTML or themed artifacts. Activates on 'surprise me', inspiration, or boredom cues.

bytedance-deer-flow-1

Stats

Stars1

Forks0

Last CommitApr 9, 2026

Actions

View Source View Plugin View on GitHub View README

Adaline Evaluators

Concepts

Key terms:

Evaluator — a configured scoring rule attached to a prompt and dataset
Grade — binary outcome: pass or fail
Score — numeric value produced by the evaluator (interpretation depends on type)
Reason — explanation of why the output received that grade and score
Group — category of evaluator: ai-powered, code, or performance

Configuration

Set these environment variables when your Adaline credentials are available:

ADALINE_API_KEY — your workspace API key (from Settings > API Keys at app.adaline.ai)
promptId — the prompt to attach evaluators to (from the Adaline editor URL)
datasetId — the dataset to link evaluators to (from the Adaline dashboard)
Base URL: https://api.adaline.ai/v2

You can start integrating before you have credentials. All code examples use placeholder values — replace them with real values when ready.

Evaluator Types

LLM-as-a-Judge (group: ai-powered)

Uses a model to score outputs against a custom rubric. Most versatile option — use for qualitative assessment, tone, helpfulness, accuracy, or any criterion expressible in natural language.

Config:

{
  "type": "llm-as-a-judge",
  "group": "ai-powered",
  "value": "Does the response answer the user's question completely and without hallucination? Score pass if all claims are grounded in the context provided."
}

The value field is the rubric string. Write it as a clear instruction to the judge model. Include what constitutes pass vs. fail.

JavaScript (group: code)

Runs a custom JavaScript function body against the output. Return 'pass' or 'fail' as a string. Use for structural checks, format validation, or logic that text matching cannot express.

Config:

{
  "type": "javascript",
  "group": "code",
  "value": "const parsed = JSON.parse(output); return parsed.hasOwnProperty('answer') ? 'pass' : 'fail';"
}

The function body receives output (string), input (string), and expectedOutput (string) as variables.

Text Matcher (group: code)

Checks the output string against a pattern using a built-in operator. Use for exact or partial string validation without custom code.

Operators: regex, equals, starts-with, ends-with, contains-all, contains-any, not-contains-any

Config:

{
  "type": "text-matcher",
  "group": "code",
  "value": {
    "operator": "contains-all",
    "patterns": ["summary", "conclusion"]
  }
}

For regex, equals, starts-with, ends-with: use a single "pattern" string field. For contains-all, contains-any, not-contains-any: use a "patterns" array.

Cost (group: performance)

Passes when the token cost of the prompt run meets a threshold. Use to enforce spending SLAs on prompt runs.

Config:

{
  "type": "cost",
  "group": "performance",
  "value": {
    "threshold": {
      "value": 0.01,
      "unit": "USD",
      "operator": "less"
    }
  }
}

Operators: less, greater, equals Unit: USD

Latency (group: performance)

Passes when the response time of the prompt run meets a threshold. Use to enforce latency budgets.

Config:

{
  "type": "latency",
  "group": "performance",
  "value": {
    "threshold": {
      "value": 2000,
      "unit": "ms",
      "operator": "less"
    }
  }
}

Operators: less, greater, equals Unit: ms

Response Length (group: performance)

Passes when the length of the model response meets a threshold. Use to enforce brevity or minimum coverage requirements.

Config:

{
  "type": "response-length",
  "group": "performance",
  "value": {
    "threshold": {
      "value": 200,
      "unit": "words",
      "operator": "less"
    }
  }
}

Operators: less, greater, equals Units: words, tokens, characters

Quick Triage

Symptom	Fix
Need to assess tone, accuracy, or helpfulness	Use LLM-as-a-judge with an explicit rubric
Need to validate JSON structure or field presence	Use JavaScript evaluator
Need to check if output contains required phrases	Use Text Matcher with `contains-all`
Need to ensure responses stay under a cost budget	Use Cost evaluator with `less` operator
Need to catch slow prompt runs	Use Latency evaluator with `less` operator
Need to prevent overly long or too-short responses	Use Response Length evaluator

Creating an Evaluator

curl -X POST https://api.adaline.ai/v2/prompts/{promptId}/evaluators \
  -H "Authorization: Bearer $ADALINE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "projectId": "your-project-id",
    "title": "Factual accuracy judge",
    "datasetId": "your-dataset-id",
    "config": {
      "type": "llm-as-a-judge",
      "group": "ai-powered",
      "value": "Does the response answer the question accurately without hallucination?"
    }
  }'

Best Practices

Use LLM-as-a-judge for qualitative criteria (accuracy, tone, helpfulness) — write rubrics as explicit pass/fail instructions
Use JavaScript or Text Matcher for structural and format validation — they are deterministic and cheaper than LLM calls
Use performance evaluators (cost, latency, response-length) to enforce SLAs without additional model calls
Stack multiple evaluators on the same prompt to cover both quality and performance dimensions
Write LLM-as-a-judge rubrics with clear pass/fail definitions — avoid vague criteria like "good quality"
Link every evaluator to a dataset so runs have consistent inputs for comparison over time

References

See references/api.md for the full REST API reference with all endpoints, request schemas, and curl examples.