From skills
Run and manage evaluations in Adaline to test prompt quality at scale. Use when running batch evaluations, checking evaluation status, analyzing results, or setting up continuous evaluation pipelines.
npx claudepluginhub adaline/skills --plugin skillsThis skill uses the workspace's default tool permissions.
Evaluations execute prompts against datasets and score responses using evaluators. They run in the cloud asynchronously — you trigger a run and poll for results.
Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.
Dynamically discovers and combines enabled skills into cohesive, unexpected delightful experiences like interactive HTML or themed artifacts. Activates on 'surprise me', inspiration, or boredom cues.
Evaluations execute prompts against datasets and score responses using evaluators. They run in the cloud asynchronously — you trigger a run and poll for results.
Key terms:
pass, fail, or unknownqueued -> running -> completed
-> failed
-> cancelling -> cancelled
queued — evaluation accepted, waiting for a worker slot (up to 5 concurrent evaluations)running — actively processing rows; check progress for completion percentagecompleted — all rows processed; metrics and results are availablefailed — the evaluation could not complete; check error detailscancelling — cancel requested; rows already in-flight may still completecancelled — evaluation stopped before finishingSet these environment variables when your Adaline credentials are available:
ADALINE_API_KEY — your workspace API key (from Settings > API Keys at app.adaline.ai)promptId — the prompt to evaluate (from the Adaline dashboard or prompts API)datasetId — the dataset to run against (from the datasets API)evaluatorId — optional; if omitted, all active evaluators attached to the prompt runYou can start integrating before you have credentials. All code examples use placeholder values — replace them with real values when ready.
| Symptom | First Fix |
|---|---|
Evaluation stuck in queued | Check concurrent limit (max 5); wait for another run to complete or cancel it |
High unknownRuns count | Check evaluator configuration; unknown means the evaluator could not produce a grade |
Evaluation failed immediately | Verify promptId, datasetId, and that the dataset has rows |
Results show unexpected fail grades | Filter results by grade=fail and inspect reason field for each row |
| Cost higher than expected | Check totalCost in metrics; compare totalTokenCount against prompt length |
POST /prompts/{promptId}/evaluations
Pass datasetId in the request body. Optionally pass evaluatorIds to run specific evaluators only. The response returns status queued immediately.
GET /prompts/{promptId}/evaluations/{id}
Check status and progress.percent. Repeat until status is completed or terminal.
GET /prompts/{promptId}/evaluations/{id}/results
Each result row contains grade, score, reason, cost, latency, and tokenCount. Filter by grade=fail to focus debugging on failures.
POST /prompts/{promptId}/evaluations/{id}/cancel
Use when a run is no longer needed or blocking a new one.
Metrics are available on the evaluation object once status is completed:
| Field | Description |
|---|---|
passRuns | Number of rows graded pass |
failRuns | Number of rows graded fail |
unknownRuns | Number of rows where grade could not be determined |
totalCost | Summed cost across all rows (dollars) |
totalLatency | Summed latency across all rows (milliseconds) |
totalTokenCount | Summed token count across all rows |
Progress is available on the evaluation object while status is running:
| Field | Description |
|---|---|
totalRows | Total rows in the dataset |
completedRows | Rows that finished (pass or fail grade) |
failedRows | Rows that errored during processing |
percent | Integer 0–100 |
Adaline supports three prompt execution modes in evaluations:
grade=fail and read reason to diagnose prompt issuesSee references/api.md for the full REST API reference with request/response schemas and curl examples.