Skill

adaline-evaluations

Run and manage evaluations in Adaline to test prompt quality at scale. Use when running batch evaluations, checking evaluation status, analyzing results, or setting up continuous evaluation pipelines.

npx claudepluginhub adaline/skills --plugin skills

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Evaluations execute prompts against datasets and score responses using evaluators. They run in the cloud asynchronously — you trigger a run and poll for results.

Supporting Assets

references/api.md

SKILL.md

Similar Skills

ui-ux-pro-max

70.8k

Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.

ui-ux-pro-max

github-deep-research

63.9k

Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.

2 files

bytedance-deer-flow-1

surprise-me

63.9k

Dynamically discovers and combines enabled skills into cohesive, unexpected delightful experiences like interactive HTML or themed artifacts. Activates on 'surprise me', inspiration, or boredom cues.

bytedance-deer-flow-1

Stats

Stars1

Forks0

Last CommitApr 9, 2026

Actions

View Source View Plugin View on GitHub View README

Adaline Evaluations

Concepts

Evaluations execute prompts against datasets and score responses using evaluators. They run in the cloud asynchronously — you trigger a run and poll for results.

Key terms:

Evaluation — a batch job that runs a prompt against every row in a dataset, scores each response, and aggregates metrics
Evaluator — a scoring function (LLM judge, regex, exact match) attached to a prompt that produces a grade (pass/fail/unknown) and numeric score
Dataset — a collection of rows; each row supplies input variables for the prompt
Grade — categorical result per row: pass, fail, or unknown
Score — numeric result per row (0.0–1.0 range, evaluator-defined)
Metrics — aggregated counts and totals across all rows in a run

Status Lifecycle

queued -> running -> completed
                  -> failed
                  -> cancelling -> cancelled

queued — evaluation accepted, waiting for a worker slot (up to 5 concurrent evaluations)
running — actively processing rows; check progress for completion percentage
completed — all rows processed; metrics and results are available
failed — the evaluation could not complete; check error details
cancelling — cancel requested; rows already in-flight may still complete
cancelled — evaluation stopped before finishing

Configuration

Set these environment variables when your Adaline credentials are available:

ADALINE_API_KEY — your workspace API key (from Settings > API Keys at app.adaline.ai)
promptId — the prompt to evaluate (from the Adaline dashboard or prompts API)
datasetId — the dataset to run against (from the datasets API)
evaluatorId — optional; if omitted, all active evaluators attached to the prompt run

You can start integrating before you have credentials. All code examples use placeholder values — replace them with real values when ready.

Quick Triage

Symptom	First Fix
Evaluation stuck in `queued`	Check concurrent limit (max 5); wait for another run to complete or cancel it
High `unknownRuns` count	Check evaluator configuration; `unknown` means the evaluator could not produce a grade
Evaluation `failed` immediately	Verify `promptId`, `datasetId`, and that the dataset has rows
Results show unexpected `fail` grades	Filter results by `grade=fail` and inspect `reason` field for each row
Cost higher than expected	Check `totalCost` in metrics; compare `totalTokenCount` against prompt length

Running an Evaluation

Step 1 — Trigger

POST /prompts/{promptId}/evaluations

Pass datasetId in the request body. Optionally pass evaluatorIds to run specific evaluators only. The response returns status queued immediately.

Step 2 — Poll status

GET /prompts/{promptId}/evaluations/{id}

Check status and progress.percent. Repeat until status is completed or terminal.

Step 3 — Fetch results

GET /prompts/{promptId}/evaluations/{id}/results

Each result row contains grade, score, reason, cost, latency, and tokenCount. Filter by grade=fail to focus debugging on failures.

Step 4 — Cancel if needed

POST /prompts/{promptId}/evaluations/{id}/cancel

Use when a run is no longer needed or blocking a new one.

Metrics Reference

Metrics are available on the evaluation object once status is completed:

Field	Description
`passRuns`	Number of rows graded `pass`
`failRuns`	Number of rows graded `fail`
`unknownRuns`	Number of rows where grade could not be determined
`totalCost`	Summed cost across all rows (dollars)
`totalLatency`	Summed latency across all rows (milliseconds)
`totalTokenCount`	Summed token count across all rows

Progress Reference

Progress is available on the evaluation object while status is running:

Field	Description
`totalRows`	Total rows in the dataset
`completedRows`	Rows that finished (pass or fail grade)
`failedRows`	Rows that errored during processing
`percent`	Integer 0–100

Three Evaluation Modes

Adaline supports three prompt execution modes in evaluations:

Single prompt — one prompt per dataset row; standard mode
Chained prompts — output of prompt A feeds into prompt B; model multi-step pipelines
Multi-turn chat — dataset rows contain conversation history; evaluates chat quality

Best Practices

Run evaluations before deploying a prompt change — treat passing evals as a ship gate
Compare multiple runs to detect regressions: same dataset, different prompt versions
Filter failing results by grade=fail and read reason to diagnose prompt issues
Use continuous evaluations (scheduled or CI-triggered) for production monitoring
Keep concurrent runs under the 5-run limit by cancelling stale evaluations
Tag evaluation runs with metadata (version, author) for traceability

References

See references/api.md for the full REST API reference with request/response schemas and curl examples.