QA an analysis before sharing with stakeholders — methodology checks, accuracy verification, and bias detection. Use when reviewing an analysis for errors, checking for survivorship bias, validating aggregation logic, or preparing documentation for reproducibility.
Validates data analysis for accuracy, bias, and reproducibility before stakeholder delivery.
/plugin marketplace add bauhaus-infau/infau-skill-base/plugin install data@infau-skillsThis skill inherits all available tools. When active, it can use any tool Claude has access to.
Pre-delivery QA checklist, common data analysis pitfalls, result sanity checking, and documentation standards for reproducibility.
Run through this checklist before sharing any analysis with stakeholders.
The problem: A many-to-many join silently multiplies rows, inflating counts and sums.
How to detect:
-- Check row count before and after join
SELECT COUNT(*) FROM table_a; -- 1,000
SELECT COUNT(*) FROM table_a a JOIN table_b b ON a.id = b.a_id; -- 3,500 (uh oh)
How to prevent:
COUNT(DISTINCT a.id) instead of COUNT(*) when counting entities through joinsThe problem: Analyzing only entities that exist today, ignoring those that were deleted, churned, or failed.
Examples:
How to prevent: Ask "who is NOT in this dataset?" before drawing conclusions.
The problem: Comparing a partial period to a full period.
Examples:
How to prevent: Always filter to complete periods, or compare same-day-of-month / same-number-of-days.
The problem: The denominator changes between periods, making rates incomparable.
Examples:
How to prevent: Use consistent definitions across all compared periods. Note any definition changes.
The problem: Averaging pre-computed averages gives wrong results when group sizes differ.
Example:
How to prevent: Always aggregate from raw data. Never average pre-aggregated averages.
The problem: Different data sources use different timezones, causing misalignment.
Examples:
How to prevent: Standardize all timestamps to a single timezone (UTC recommended) before analysis. Document the timezone used.
The problem: Segments are defined by the outcome you're measuring, creating circular logic.
Examples:
How to prevent: Define segments based on pre-treatment characteristics, not outcomes.
For any key number in your analysis, verify it passes the "smell test":
| Metric Type | Sanity Check |
|---|---|
| User counts | Does this match known MAU/DAU figures? |
| Revenue | Is this in the right order of magnitude vs. known ARR? |
| Conversion rates | Is this between 0% and 100%? Does it match dashboard figures? |
| Growth rates | Is 50%+ MoM growth realistic, or is there a data issue? |
| Averages | Is the average reasonable given what you know about the distribution? |
| Percentages | Do segment percentages sum to ~100%? |
Every non-trivial analysis should include:
## Analysis: [Title]
### Question
[The specific question being answered]
### Data Sources
- Table: [schema.table_name] (as of [date])
- Table: [schema.other_table] (as of [date])
- File: [filename] (source: [where it came from])
### Definitions
- [Metric A]: [Exactly how it's calculated]
- [Segment X]: [Exactly how membership is determined]
- [Time period]: [Start date] to [end date], [timezone]
### Methodology
1. [Step 1 of the analysis approach]
2. [Step 2]
3. [Step 3]
### Assumptions and Limitations
- [Assumption 1 and why it's reasonable]
- [Limitation 1 and its potential impact on conclusions]
### Key Findings
1. [Finding 1 with supporting evidence]
2. [Finding 2 with supporting evidence]
### SQL Queries
[All queries used, with comments]
### Caveats
- [Things the reader should know before acting on this]
For any code (SQL, Python) that may be reused:
"""
Analysis: Monthly Cohort Retention
Author: [Name]
Date: [Date]
Data Source: events table, users table
Last Validated: [Date] -- results matched dashboard within 2%
Purpose:
Calculate monthly user retention cohorts based on first activity date.
Assumptions:
- "Active" means at least one event in the month
- Excludes test/internal accounts (user_type != 'internal')
- Uses UTC dates throughout
Output:
Cohort retention matrix with cohort_month rows and months_since_signup columns.
Values are retention rates (0-100%).
"""
Use when working with Payload CMS projects (payload.config.ts, collections, fields, hooks, access control, Payload API). Use when debugging validation errors, security issues, relationship queries, transactions, or hook behavior.