From autoresearch
Computes MAD-based confidence scores for experiment results to determine if improvements exceed noise. Use after 3+ positive metric data points.
npx claudepluginhub pbdeuchler/llm-plugins --plugin autoresearchThis skill uses the workspace's default tool permissions.
Determines whether an observed improvement is real or within measurement noise using Median Absolute Deviation (MAD).
Analyzes A/B test results for statistical significance, sample size validation, confidence intervals, guardrail metrics, and recommendations on launch, extension, or termination. Useful for evaluating experiments, interpreting split test data, or deciding variant rollouts.
Analyzes A/B tests and experiments with statistical rigor: assesses power, significance, validity, segments; recommends ship/kill/extend.
Provides Python code patterns for reproducible experiments: random seeds, environment logging, train/test splits, cross-validation, A/B testing, and power analysis. For ML/statistical designs.
Share bugs, ideas, or general feedback.
Determines whether an observed improvement is real or within measurement noise using Median Absolute Deviation (MAD).
confidence: null.Given all metric values in the current segment (positive values only):
|value - median|. Take the sorted median of those absolute deviations.keep-status metric value (respecting optimization direction).|best_kept - baseline|delta / MADnull — no measurable noise to compare against.keep results exist yet: return null.null — no improvement to score.The confidence score is a multiple of the session's noise floor:
| Score | Meaning | Action |
|---|---|---|
| ≥ 2.0× | Improvement likely real | Safe to trust |
| 1.0×–2.0× | Marginal — could be noise | Consider re-running to confirm |
| < 1.0× | Within noise floor | Treat as no improvement |
When logging an experiment result to autoresearch.jsonl:
"confidence": null in the JSONL record.keep vs discard: the confidence score is advisory. It never auto-discards. But flag improvements below 1.0× in your ASI notes as "within noise — may not be real."Runs: [15200, 15400, 14800, 15100, 14600]
Median: 15100
Deviations: [100, 300, 300, 0, 500] → sorted: [0, 100, 300, 300, 500]
MAD: 300
Baseline: 15200 (first run)
Best kept: 14600
Delta: |14600 - 15200| = 600
Confidence: 600 / 300 = 2.0× ← improvement is real