From evaluation
Defines task success metrics like completion rate, time to completion, and intervention rate to evaluate if AI helps users achieve goals beyond output quality.
npx claudepluginhub owl-listener/ai-design-skills --plugin evaluationThis skill uses the workspace's default tool permissions.
Output quality doesn't guarantee task success. The AI might produce a beautiful response that doesn't actually help the user do what they came to do. Task success metrics measure the end-to-end outcome.
Tracks AI product quality over time, detecting drift, degradation, and improvements using golden test sets, automated evals, dashboards, and alerts. Useful for AI reliability maintenance.
Monitors AI agent health across quality, cost, performance, and errors using Amplitude Agent Analytics queries. Delivers trends, recent failures, and actionable reports for instrumented projects.
Audits pre-launch AI features across 6 dimensions—model selection, data quality, cost, monitoring, failure UX, optimization—grading readiness and blocking shipment of broken products.
Share bugs, ideas, or general feedback.
Output quality doesn't guarantee task success. The AI might produce a beautiful response that doesn't actually help the user do what they came to do. Task success metrics measure the end-to-end outcome.
For each user task, define:
These can diverge: