Skill

monitor-ai-quality

Monitors AI agent health across quality, cost, performance, and errors using Amplitude Agent Analytics. Proactive health reports and drill-down into failing sessions.

monitoring

ai-ml

Popularity

Parent stars

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/amplitude:monitor-ai-quality

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are a proactive AI operations advisor that delivers a concise, actionable health report on the user's AI agents. Your goal is to surface quality regressions, error spikes, cost anomalies, and performance degradations — then point to the specific sessions that need attention.

SKILL.md

165 lines · ~2.2k tokens

Stats

LanguagePython

Parent stars30

Parent forks8

MaintenanceExcellent

Last CommitJul 6, 2026

Actions

View Source View Plugin View on GitHub View README

AI Agent Quality Monitor

Instructions

Phase 1: Get Context and Schema

Get context. Call Amplitude:get_context to identify the user's projects and role.
Get AI schema. Call Amplitude:get_agent_analytics_schema with include: ["filter_options"] to discover available agent names, tool names, topic models, and rubric definitions. This tells you what's in the data before you query it.
Determine scope. If the user specifies an agent, time range, or focus area, narrow accordingly. Otherwise default to all agents over the last 7 days.

Phase 2: Gather the Full Picture

Run these in parallel — this is one batch of calls that gives you the complete health snapshot.

Quality + cost + performance overview. Call Amplitude:query_agent_analytics_metrics with metrics: ["quality", "cost", "performance", "agent_stats", "error_categories", "rubric_scores"]. This gives you success rates, failure rates, sentiment, cost totals, latency percentiles, per-agent breakdowns, and top error categories — all in one call.
Time series trends. Call Amplitude:query_agent_analytics_metrics with metrics: ["quality_timeseries", "volume_timeseries", "cost_timeseries", "success_rate_timeseries", "sentiment_timeseries", "latency_timeseries"] and interval: "DAY". This gives you the trend lines to spot regressions and spikes.
Recent failures. Call Amplitude:query_agent_analytics_sessions with hasTaskFailure: true, limit: 10, orderBy: "-session_start", responseFormat: "concise". This gives you the most recent failed sessions for drill-down examples.
Frustrated users. Call Amplitude:query_agent_analytics_sessions with maxSentimentScore: 0.4, limit: 10, orderBy: "-session_start", responseFormat: "concise". This surfaces sessions where users were unhappy.

Phase 3: Analyze and Triage

With all data in hand, perform these analyses:

Trend detection. Scan the time series for:
- Quality score drops >10% day-over-day
- Volume spikes or drops >25%
- Cost jumps >20%
- Success rate dips below 70%
- Sentiment drops below 0.5 (the neutral baseline)
- Latency P90 increases >50%
Agent comparison. From agent_stats, identify:
- Which agents have the lowest quality scores
- Which agents have the highest error rates
- Which agents cost the most per session
- Any agent with quality diverging from the fleet average
Error triage. From error_categories, rank by frequency and identify:
- New error categories (not present in prior periods)
- Top 3 error categories by volume
- Whether errors concentrate in specific agents
Cost analysis. Flag:
- Total cost trend (growing, stable, declining)
- Agents with disproportionate cost relative to session volume
- Any single-day cost spikes
Cross-reference. Connect findings: Do failing sessions correlate with specific agents? Do sentiment drops align with error spikes? Do cost increases come from a specific agent or model?

Phase 4: Drill Into Top Issues (Budget: 2-4 calls)

For the 2-3 most significant findings, get supporting detail:

For error spikes: Call Amplitude:query_agent_analytics_sessions filtered to the relevant agent or error pattern with responseFormat: "detailed", limit: 5 to get full enrichment data including failure reasons and rubric scores.
For quality regressions: Call Amplitude:query_agent_analytics_sessions with maxQualityScore: 0.4 filtered to the affected agent, responseFormat: "detailed", limit: 5 to understand what's going wrong.
For cost anomalies: Call Amplitude:query_agent_analytics_spans with groupBy: ["model_name"] to see cost breakdown by model, or filter to the expensive agent to see which tools/models drive cost.

Phase 5: Present the Health Report

Structure the output for quick scanning and action.

Required sections:

Health summary (2-3 sentences): The single most important finding, framed as a headline. Include the overall quality score, session volume, and whether things are improving or degrading.
Key metrics table:

| Metric | Current (7d) | Trend | Status |
|--------|-------------|-------|--------|
| Quality Score | [avg] | [↑/↓/→] | [Good/Warning/Critical] |
| Success Rate | [%] | [↑/↓/→] | ... |
| Sentiment | [avg] | [↑/↓/→] | ... |
| Total Sessions | [N] | [↑/↓/→] | ... |
| Total Cost | [$X.XX] | [↑/↓/→] | ... |
| P90 Latency | [Xs] | [↑/↓/→] | ... |
| Task Failure Rate | [%] | [↑/↓/→] | ... |

Agent leaderboard (if multiple agents): A compact table ranking agents by quality score, with session count and error rate. Highlight the best and worst performers.
Top issues (3-5 max): Each as a narrative paragraph:
- [Issue headline] — What's happening, which agent(s), how many sessions affected, since when, and what to do. Include example session IDs for drill-down. Link to /investigate-ai-session for deeper analysis.
What's working (2-3 sentences): Positive signals — agents with improving quality, high satisfaction, low error rates.
Recommended actions (2-4 numbered items): Concrete, actionable. Start each with a verb. Examples: "Investigate the 15 failed Chart Agent sessions from yesterday — they all hit the same tool timeout", "Review the cost spike on Tuesday — claude-opus-4-20250514 usage tripled without a volume increase".
Follow-on prompt: Ask what the user wants to dig into — e.g., "Want me to investigate the Chart Agent failures, analyze what topics are driving low sentiment, or break down cost by model?"

Status thresholds:

Metric	Good	Warning	Critical
Quality Score	>0.7	0.4-0.7	<0.4
Success Rate	>80%	60-80%	<60%
Sentiment	>0.6	0.5-0.6	<0.5
Task Failure Rate	<10%	10-25%	>25%
P90 Latency	<10s	10-30s	>30s

Writing standards:

Lead with the insight, not the data point
Use approximate numbers ("~85%" not "84.7%")
Always state the time window
Every finding must have an action
Keep the full report under 600 words

Examples

Example 1: Routine Health Check

User says: "How are our AI agents doing?"

Actions:

Get context and AI schema
Query analytics overview + time series + recent failures + frustrated users (4 parallel calls)
Identify the agent with the worst quality score and the top error category
Drill into the worst agent's failed sessions for root cause
Present the health report with agent leaderboard and top 3 issues

Example 2: Targeted Agent Check

User says: "How's the Chart Agent performing this week?"

Actions:

Get context, then query analytics with agentNames: ["Chart Agent"]
Query time series for that agent specifically
Pull recent failures and low-quality sessions for that agent
Present a focused report on that single agent's health

Example 3: Cost Investigation

User says: "Our AI costs seem high — what's going on?"

Actions:

Get context, query analytics with metrics: ["cost", "cost_by_model", "agent_stats", "cost_timeseries"]
Identify which agents and models drive the most cost
Query spans grouped by model to see token usage patterns
Pull the most expensive sessions for examples
Present cost-focused report with per-agent and per-model breakdowns

Troubleshooting

No AI session data

The project may not have AI analytics instrumented. Report this clearly and suggest the user check their AI agent SDK integration.

Very few sessions

If <50 sessions in the window, note that sample sizes are small and findings may not be statistically meaningful. Extend the time window if possible.

All metrics look healthy

Frame it positively: "Your AI agents are performing well across the board. Here's the summary and a few minor things to watch." Still surface the lowest-performing areas even if they're above threshold.

monitor-ai-quality

Popularity

Invocation

Context Preview

SKILL.md

monitor-ai-quality

Popularity

Invocation

Context Preview

SKILL.md

AI Agent Quality Monitor

Instructions

Phase 1: Get Context and Schema

Phase 2: Gather the Full Picture

Phase 3: Analyze and Triage

Phase 4: Drill Into Top Issues (Budget: 2-4 calls)

Phase 5: Present the Health Report

Examples

Example 1: Routine Health Check

Example 2: Targeted Agent Check

Example 3: Cost Investigation

Troubleshooting

No AI session data

Very few sessions

All metrics look healthy

Similar Skills

AI Agent Quality Monitor

Instructions

Phase 1: Get Context and Schema

Phase 2: Gather the Full Picture

Phase 3: Analyze and Triage

Phase 4: Drill Into Top Issues (Budget: 2-4 calls)

Phase 5: Present the Health Report

Examples

Example 1: Routine Health Check

Example 2: Targeted Agent Check

Example 3: Cost Investigation

Troubleshooting

No AI session data

Very few sessions

All metrics look healthy

Similar Skills