Monitors active and recently completed experiments across Amplitude projects, triages by importance, analyzes, and reports on the most impactful ones.
From analytics-skillsnpx claudepluginhub amplitude/builder-skills --plugin analytics-skillsThis skill uses the workspace's default tool permissions.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Enforces A/B test setup with gates for hypothesis locking, metrics definition, sample size calculation, assumptions checks, and execution readiness before implementation.
Designs, audits, and improves analytics tracking systems using Signal Quality Index for reliable, decision-ready data in marketing, product, and growth.
Scan active and recently completed experiments, surface what needs attention, and report on the ones that matter.
This is a monitoring skill — keep output accessible to non-experts. Avoid statistical jargon (no p-values, no power analysis). For deep-dive analysis of a specific experiment, use the analyze-experiments skill instead.
get_experiments: 3-5 IDs max per call. Filter using search results BEFORE fetching.query_experiment responses are large. Extract only summary objects and validity flags. Ignore timeseries, xValues, bulk arrays.search does NOT match metric IDs in queries. Search with entityTypes: ["METRIC"], empty queries, limitPerQuery: 50, scoped to project. Match IDs from results.The report has two parts:
Do NOT duplicate information between the summary table and the details. The summary table is the single source of truth for the portfolio state. Details expand on specific experiments.
Amplitude:get_context. If multiple projects, ask which to monitor.Amplitude:search({
entityTypes: ["EXPERIMENT"],
appIds: [projectId],
queries: [],
sortOrder: "lastModified",
sortDirection: "DESC",
limitPerQuery: 50
})
Amplitude:get_experiments in batches of 3-5 IDs (only the filtered set from Step 1).Amplitude:search({ entityTypes: ["METRIC"], appIds: [projectId], queries: [], limitPerQuery: 50 }). Build { id: name } mapping.Amplitude:query_experiment({ id: "<id>" }) (no metricIds) to get primary metric results and data quality flags. This data feeds both the summary table and the deep-dives.CRITICAL: Never show raw metric IDs to the user. If a metric name can't be resolved, describe it by its role (e.g., "primary metric", "guardrail metric") — do NOT display strings like "rrgtky08" or "Metric gvgb5efj". Metric IDs are internal identifiers that mean nothing to a human reader.
This is the top of the report. It should be self-contained — someone reading only this section gets the full picture.
## Experiment Monitor: [Project Name]
Date: [Today] | Project: [Name] ([ID])
| Experiment | State | Duration | Lift | Verdict |
|------------|-------|----------|------|---------|
Column definitions:
Verdict vocabulary:
| Verdict | When to use |
|---|---|
| Ship | Significant positive primary, guardrails clean, data quality good |
| Iterate | Positive signal but guardrail regression or quality concern |
| Monitor | Running, not yet significant, nothing to act on |
| Abandon | Significant negative primary, or critical data quality issues |
| Decided: Ship | Recently completed, team decided to ship |
| Decided: Don't ship | Recently completed, team decided not to ship |
| Fix config | Query error, broken exposure event, or severely imbalanced traffic |
| Needs metrics | Running without metrics — can't evaluate |
| N/A | Survey/nudge deployment, not a feature experiment |
**Act now:**
- [Experiment] — [specific action with owner name]
**Keep watching:**
- [Experiment] — [brief status + when to check back]
**Needs setup:**
- [Experiment] — [what's missing + owner name]
Every experiment from the table should appear in exactly one action category. If there are no experiments in a category, omit that category.
Below the summary, provide additional detail organized into sections. Each experiment appears in exactly ONE section — no duplication.
Deep-dive on experiments with Ship, Iterate, or Abandon verdicts (up to 3). These have significant results worth examining.
For each, output the full report in a single pass:
---
## [Experiment Name]
[State] | [Duration]d | Control: [N] / Treatment: [N] | [Link to experiment]
Data Quality (first — before results):
Check data quality BEFORE reporting metric results. If there are critical issues, the reader needs that context before interpreting any numbers.
If everything passes, one line: "Data quality checks all pass." Then proceed to primary metric.
If issues found, report before the primary metric:
### Data Quality: [N issues found]
SRM (Sample Ratio Mismatch):
srmDetected field from the API responsesrmDetected: true: always report first and prominentlyTraffic allocation changes:
Sample size:
Validity flags — only report failures:
| Flag | What it means when it fails | Severity |
|---|---|---|
isVariancePositive = false | Metric data is invalid — statistical tests can't run | Critical |
isConfidenceIntervalNotFlipped = false | Calculation error in results — don't trust the numbers | Critical |
isMeanValid = false | Metric values are broken (NaN/infinite) — can't analyze | Critical |
statsAssumptionsMetForWholeExperiment = false | Statistical assumptions aren't met — results may be unreliable | High |
hasSuspiciousUplift = true | Unusually large effect — may be a measurement error, not a real change | High |
isPointEstimateInsideConfidenceInterval = false | Internal math inconsistency — results may be wrong | High |
isStandardErrorLargeEnough = false | Too much noise to get reliable estimates | Medium |
If SRM is detected or a Critical flag fails:
⚠️ Results below should be interpreted with caution due to [issue].
Primary Metric:
### [Metric Name]: [Plain-language verdict]
| Variant | Value | Lift | 95% CI |
|---------|-------|------|--------|
| Control | [X] | — | — |
| Treatment | [Y] | [+Z%] | [A% to B%] |
[One sentence plain-language interpretation.]
Framing rules:
Verdict words:
Add one sentence on practical significance only if lift is very small (<2%) or very large (>20%).
Guardrails & Secondary Metrics:
Call Amplitude:query_experiment({ id: "<id>", metricIds: [...] }).
Focus on catching regressions, not narrating every metric.
If no significant regressions: "All guardrails and secondary metrics are clean — no significant regressions detected."
If significant regressions found:
- **[Metric Name]:** Significant regression ([−X%], CI: [A% to B%]) — [one sentence on impact]
If large directional regressions (>20% relative lift) that aren't yet significant:
- **[Metric Name]:** Not significant, but directionally negative ([−X%]) — worth watching
Small directional changes that aren't significant should be omitted. If 5+ metrics, note that multiple comparisons increase the chance of false positives.
Feedback (only if primary is significant):
Call Amplitude:get_feedback_insights with experiment-related keywords. Report 2-3 themes in one line each. If no relevant feedback, skip the section.
Verdict:
### Verdict: [SHIP / ITERATE / MONITOR / ABANDON]
1. [Primary metric finding — one sentence]
2. [Quality / guardrail status — one sentence]
3. [What to do next — one sentence]
Decision matrix (internal reference, don't output):
| Signal | Ship | Iterate | Monitor | Abandon |
|---|---|---|---|---|
| Primary | Significant positive + practical | Significant positive but guardrail regression | Not yet significant, still running | Significant negative |
| Quality | All pass | Minor flags | All pass or insufficient data | SRM or critical flags |
| Guardrails | Clean | 1 regression needs fixing | Clean or not enough data | Significant regression on critical metric |
| Experiment state | Completed | Completed | Running | Completed or running |
For experiments with Monitor verdict, provide a one-line status each:
---
### Running Experiments
**[Name]** — Running [X]d, Control: [N] / Treatment: [N]. [Primary metric status — e.g., "trending positive (+3.3%) but not yet significant"]. Data quality clean. [Link]
**[Name]** — Running [X]d, Control: [N] / Treatment: [N]. Too early for conclusions — [reason, e.g., "sample sizes are 100-1,000 range"]. [Link]
Keep these tight — one line per experiment. Include the link at the end for anyone who wants to check in Amplitude.
For experiments with Decided: Ship or Decided: Don't ship verdict:
---
### Recently Decided
- **[Name]** — [Shipped/Rolled back] on [date] after [X] days. [One sentence on outcome or rationale]. [Link]
If the experiment was shipped without metrics or without statistical significance, note that briefly.
For experiments with Fix config, Needs metrics, or N/A verdict:
---
### Needs Setup
- **[Name]** — [Issue description]. Owner: [owner].
- **[Name]** — Survey/nudge deployment, not a feature experiment. No analysis needed.
Each experiment appears in exactly one detail section. Do NOT list an experiment in both the summary table actions AND a detail section with the same information — the detail section expands on the action item, it doesn't repeat it.
summary + validity flags onlyquery_experiment: Report the error clearly with the experiment owner's name. Flag as "Fix config" in the summary. Common causes: broken exposure event property, misconfigured experiment key.