Skill

longitudinal-measurement

Tracks AI product quality over time, detecting drift, degradation, and improvements using golden test sets, automated evals, dashboards, and alerts. Useful for AI reliability maintenance.

ai-ml

monitoring

npx claudepluginhub owl-listener/ai-design-skills --plugin evaluation

Tool Access

This skill uses the workspace's default tool permissions.

Preview

AI products change over time — models get updated, usage patterns shift, and quality can drift without anyone noticing. Longitudinal measurement is how you track quality across time and catch degradation before users do.

SKILL.md

Similar Skills

calibrate

Guides post-launch AI feature calibration: document production error patterns, review eval performance, decide agency promotion. Uses CC/CD loop with /calibrate shortcuts.

1 file

bette-think

monitor-ai-quality

Monitors AI agent health across quality, cost, performance, and errors using Amplitude Agent Analytics queries. Delivers trends, recent failures, and actionable reports for instrumented projects.

amplitude

ai-native-product

Guides AI-native product development addressing agency-control tradeoffs, calibration loops, CCCD framework, and eval strategies for AI agents and LLM features.

6 files

claude-superskills

Stats

Parent Repo Stars18

Parent Repo Forks3

Last CommitApr 25, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Longitudinal Measurement

What Changes Over Time

Model updates: New model versions may improve some capabilities and regress others
Prompt drift: System prompts accumulate edits that may interact in unexpected ways
Usage evolution: Users discover new use cases that weren't tested for
Data drift: The real-world inputs diverge from what was tested
Expectation drift: Users' expectations change as they become more experienced

What to Measure Longitudinally

Quality scores: Track rubric scores on a consistent test set over time
Task success rates: Monitor whether users are completing tasks at the same rate
Satisfaction signals: Track trends in explicit and implicit satisfaction
Error rates: Monitor failure frequency and type distribution
Latency: Response time changes can indicate degradation
Engagement patterns: Changes in usage frequency, depth, and breadth

Measurement Infrastructure

Golden test sets: A fixed set of inputs evaluated regularly to detect quality changes
Automated evaluation: Run golden test sets automatically on a schedule
Dashboards: Visualise trends and set alerts for significant changes
Regression detection: Statistical methods to distinguish real changes from noise
User cohort tracking: Follow specific user groups over time

Responding to Drift

When measurements show drift:

Detect: Automated alerts flag significant changes
Diagnose: Was it a model update, prompt change, data shift, or usage change?
Assess: Is the drift harmful, neutral, or actually an improvement?
Act: Adjust prompts, revert changes, update guardrails, or accept the new baseline
Verify: Confirm the fix worked and set the new baseline

Design Artefacts

Longitudinal measurement plan
Golden test set specifications
Quality trend dashboards
Drift detection alert configurations
Response protocols for detected drift