Help us improve
Share bugs, ideas, or general feedback.
From posthog
Diagnoses bias, anomalies, and strange-looking results in PostHog experiments. Covers empty exposures, sample ratio mismatch, identity fragmentation, and significance traps.
npx claudepluginhub anthropics/claude-plugins-official --plugin posthogHow this skill is triggered — by the user, by Claude, or both
Slash command
/posthog:diagnosing-experiment-resultsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill answers: **My PostHog experiment results look wrong, biased, or empty — what's going on?**
Analyzes A/B tests: designs experiments with proper metrics and variants, checks running/completed experiments, and interprets results with statistical rigor for ship decisions.
Designs A/B tests with metrics and variants, analyzes running/completed Amplitude experiments, interprets results statistically for ship decisions.
Configures PostHog experiment analytics: exposure criteria (feature flag vs custom events), primary/secondary metrics (count, sum, ratio, retention), multivariate handling, and result interpretation. Use for metric setup or impact analysis.
Share bugs, ideas, or general feedback.
This skill answers: My PostHog experiment results look wrong, biased, or empty — what's going on?
Match the user's complaint in the dispatch table, then read the matching reference file for the diagnostic.
Each diagnostic in the reference files is tagged [HIGH], [MEDIUM], or [LOW] based on how
strongly it's verified — [HIGH] is verified directly in PostHog code, [MEDIUM] is partially or
team-source verified, [LOW] describes SDK/external behavior that wasn't verified here. Treat [LOW]
items as hypotheses to test, not facts to assert.
If the user refers to an experiment by name or description, load the finding-experiments skill first to
resolve it to a concrete ID.
Call experiment-get and pull these fields. They are inputs for almost every diagnostic:
parameters.feature_flag_variants[].rollout_percentage — the variant splitparameters.rollout_percentage — the overall rollout (% of users entering the experiment)exposure_criteria.multiple_variant_handling — defaults to "exclude" if absentexposure_criteria.exposure_event — null means default $feature_flag_calledexposure_criteria.filterTestAccounts — defaults to truefeature_flag.active, status (draft / running / paused / stopped), start_date, end_datefeature_flag.filters.groups[].variant — any non-null value is a forced-variant override on the
matched cohort (release-condition assignment, not randomized). Surfaces A7 by default.stats_config — Bayesian (default) or FrequentistBefore asking the user clarifying questions, pull the diagnostic snapshot in references/diagnostic-snapshot.md. Most diagnostics in this skill can be confirmed or ruled out from that data without an interview.
| User says... | Diagnostic group |
|---|---|
| "Smaller variant looks biased" / banner says bias | A — bias & skew |
| "Variant ratio doesn't match my split" / SRM warning | A — bias & skew |
| "Why isn't it 50/50?" / "users in both groups" | A — bias & skew |
"Users in both control and test" / high $multiple % | A — bias & skew |
| Multi-variant exposure on a server-rendered app | A — bias & skew |
| Banner about feature-flag/experiment state mismatch | A — bias & skew |
| "Migrating distinct_id" / "switching from anonymous to user_id" mid-run | A — bias & skew |
| Metric count is much smaller than exposures (e.g. 10× or 100× gap) | A — bias & skew (route here before D) |
| "Experiment shows 0 / not enough data" / empty | B — empty experiment |
| "Variant always undefined / false" | B — empty experiment |
| "$feature_flag_called fires but no exposures show up" | B — empty experiment |
| "Experiment says running but exposures haven't moved in weeks/months" | B — empty experiment |
| "Significance keeps flipping as we run longer" | C — interpretation traps |
| "Significance was declared, then it wasn't significant anymore" | C — interpretation traps |
| "30/16 split at 46 exposures, is this broken?" | C — interpretation traps |
| "A/A test is showing significant results" | C — interpretation traps |
| "Many metrics — some significant, some not" | C — interpretation traps |
| "Bayesian says 96% chance to win — should we ship?" | C — interpretation traps |
| "Confidence intervals overlap — does that mean not significant?" | C — interpretation traps |
| "An external tool (significance calculator or AI agent) disagrees with PostHog" | C — interpretation traps |
| "Should I ship? Primary is up but a secondary is down" | C — interpretation traps |
| "PostHog numbers ≠ my SQL count" | D — numbers vs SQL |
| "Funnel says X% but my raw event count says Y" | D — numbers vs SQL |
| "Sum of revenue looks wrong" / "breakdown shows 'none'" | D — numbers vs SQL |
| "Recordings panel doesn't match the stats" | D — numbers vs SQL |
| "I applied a filter but the user count didn't change" | D — numbers vs SQL |
| "I want to slice results by current person properties (as of now, not as of exposure)" | D — numbers vs SQL |
| "Changed split / rollout / metric / criteria mid-run, now odd" | E — mid-run changes |
| "Ended/shipped — flag now flipped to 0/100 unexpectedly" | E — mid-run changes |
| "Long-term metric moves opposite from primary" | E — mid-run changes |
| "Retention metric counts users I didn't expect" | E — mid-run changes |
| "Can't convert the feature flag back to a simple (boolean) flag after the experiment ends" | E — mid-run changes |
| "How do I restart an experiment with new variants?" | E — mid-run changes |
| Metric line is rendered but the result block is empty / no chance-to-win or significance | E — mid-run changes (E13 legacy methodology) |
If the symptom is unclear, ask one clarifying question before picking. Most diagnostics have different fixes — do not guess.
After matching the symptom in Step 2 and reading the relevant reference file(s), list each diagnostic that applies before recommending an action.
Surface co-occurring mechanisms independently — even when one is more salient, don't collapse them into a single "wait" or "fix" recommendation. Different mechanisms have different fixes: a systematic bias (e.g. uneven-split + Exclude) doesn't resolve by waiting; a statistical pattern (e.g. small-sample variance) does. Bundling them leaves the bias in place after the user follows the bundled advice.
Only list mechanisms that have a path to verification in the project state — config (from
experiment-get), snapshot data, activity log, or repo source. Config-derived mechanisms count: an
80/20 split with default multiple_variant_handling="exclude" is visible in experiment-get and is
therefore enumerable. Naming a mechanism with no source (e.g. SRM when the snapshot shows a clean
variant ratio) is not.
Variants don't look balanced, one variant looks biased, the in-app warning banner appeared, or users are
showing up under multiple variants. Covers the uneven-split + Exclude interaction, SRM, identity
fragmentation, bootstrap × /decide mismatch, and flag/experiment state inconsistency.
→ See references/bias-and-skew.md
A frequent pain point. Covers SDK call (wrong evaluation method, identify() timing, dedup),
exposure capture (custom event missing variant property, required properties, ad-blockers), and
exposure-criteria match (test-account filter, eligibility ordering, events firing before exposure).
→ See references/empty-experiment.md
Significance flipping, A/A test showing significance, Bayesian vs Frequentist confusion, multiple comparisons, low-volume variance, peeking / early stopping. Includes the legacy stats issue (A/A tests historically over-fired before the new Bayesian module) and how the win-probability methodology changed in Jan 2025 (single test vs control, not control vs all variants).
→ See references/interpretation.md
The experiment page applies an exposure scope, $multiple exclusion, test-account filter, and date range
that ad-hoc SQL almost never replicates. Covers funnel attribution (only first→last step counts for stats),
breakdowns (read from the exposure event, not the metric event), the "sum of revenue" mean-of-per-user
confusion, and the recordings-panel-vs-stats divergence.
→ See references/numbers-vs-sql.md
Increasing rollout is safe; decreasing is caution; changing the variant split is an anti-pattern; adding metrics mid-run is p-hacking; ship-variant can rewrite the flag in surprising ways; reset clears results not the flag. Also covers retention-metric quirks (first-event-must-be-after-exposure design), "matured users" filtering, and long-term vs short-term metric divergence.
→ See references/mid-run-changes.md
Surface diagnostics first (Step 3). Then recommend — but scope what you recommend to what the experiment's current state permits.
configuring-experiment-rollout and its reference file
references/changing-distribution-after-launch.md for the mid-run rules.On a stopped or archived experiment, don't preemptively offer reversal of a state mutation (ship-variant flag rewrite, manual flag edit, reset, archive). If the user asks "why did X happen?", explain X — don't append a "here's how to undo it" coda. That pattern assumes intent the user didn't signal. Conditional offers like "if this wasn't intended, you could…" or "want me to revert it?" count as preemptive too — only the user explicitly naming the reversal action ("how do I undo this?", "can I roll back ship-variant?", "how do I get the 50/50 split back?") is a request to surface reversal mechanics.
Use consistent terminology: variant split (between variants) is distinct from rollout (overall %
entering); the $feature_flag_called exposure event is distinct from a custom exposure event; the
Exclude / First seen options control multivariate handling, not exposure.