QA data analyses for methodology, accuracy, biases, and pitfalls before stakeholder sharing. Spot-checks calculations, SQL results, visualizations, and conclusions.
From datanpx claudepluginhub cy-wali/knowledge --plugin dataThis skill uses the workspace's default tool permissions.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Designs, audits, and improves analytics tracking systems using Signal Quality Index for reliable, decision-ready data in marketing, product, and growth.
Enforces A/B test setup with gates for hypothesis locking, metrics definition, sample size calculation, assumptions checks, and execution readiness before implementation.
If you see unfamiliar placeholders or need to check which tools are connected, see CONNECTORS.md.
Review an analysis for accuracy, methodology, and potential biases before sharing with stakeholders. Generates a confidence assessment and improvement suggestions.
/validate-data <analysis to review>
The analysis can be:
Examine:
Work through the checklist below — data quality, calculation, reasonableness, and presentation checks.
Systematically review against the detailed pitfall catalog below (join explosion, survivorship bias, incomplete period comparison, denominator shifting, average of averages, timezone mismatches, selection bias).
Where possible, spot-check:
Apply the result sanity-checking techniques below (magnitude checks, cross-validation, red-flag detection).
If the analysis includes charts:
Review whether:
Provide specific, actionable suggestions:
Rate the analysis on a 3-level scale:
Ready to share -- Analysis is methodologically sound, calculations verified, caveats noted. Minor suggestions for improvement but nothing blocking.
Share with noted caveats -- Analysis is largely correct but has specific limitations or assumptions that must be communicated to stakeholders. List the required caveats.
Needs revision -- Found specific errors, methodological issues, or missing analyses that should be addressed before sharing. List the required changes with priority order.
## Validation Report
### Overall Assessment: [Ready to share | Share with caveats | Needs revision]
### Methodology Review
[Findings about approach, data selection, definitions]
### Issues Found
1. [Severity: High/Medium/Low] [Issue description and impact]
2. ...
### Calculation Spot-Checks
- [Metric]: [Verified / Discrepancy found]
- ...
### Visualization Review
[Any issues with charts or visual presentation]
### Suggested Improvements
1. [Improvement and why it matters]
2. ...
### Required Caveats for Stakeholders
- [Caveat that must be communicated]
- ...
Run through this checklist before sharing any analysis with stakeholders.
The problem: A many-to-many join silently multiplies rows, inflating counts and sums.
How to detect:
-- Check row count before and after join
SELECT COUNT(*) FROM table_a; -- 1,000
SELECT COUNT(*) FROM table_a a JOIN table_b b ON a.id = b.a_id; -- 3,500 (uh oh)
How to prevent:
COUNT(DISTINCT a.id) instead of COUNT(*) when counting entities through joinsThe problem: Analyzing only entities that exist today, ignoring those that were deleted, churned, or failed.
Examples:
How to prevent: Ask "who is NOT in this dataset?" before drawing conclusions.
The problem: Comparing a partial period to a full period.
Examples:
How to prevent: Always filter to complete periods, or compare same-day-of-month / same-number-of-days.
The problem: The denominator changes between periods, making rates incomparable.
Examples:
How to prevent: Use consistent definitions across all compared periods. Note any definition changes.
The problem: Averaging pre-computed averages gives wrong results when group sizes differ.
Example:
How to prevent: Always aggregate from raw data. Never average pre-aggregated averages.
The problem: Different data sources use different timezones, causing misalignment.
Examples:
How to prevent: Standardize all timestamps to a single timezone (UTC recommended) before analysis. Document the timezone used.
The problem: Segments are defined by the outcome you're measuring, creating circular logic.
Examples:
How to prevent: Define segments based on pre-treatment characteristics, not outcomes.
For any key number in your analysis, verify it passes the "smell test":
| Metric Type | Sanity Check |
|---|---|
| User counts | Does this match known MAU/DAU figures? |
| Revenue | Is this in the right order of magnitude vs. known ARR? |
| Conversion rates | Is this between 0% and 100%? Does it match dashboard figures? |
| Growth rates | Is 50%+ MoM growth realistic, or is there a data issue? |
| Averages | Is the average reasonable given what you know about the distribution? |
| Percentages | Do segment percentages sum to ~100%? |
Every non-trivial analysis should include:
## Analysis: [Title]
### Question
[The specific question being answered]
### Data Sources
- Table: [schema.table_name] (as of [date])
- Table: [schema.other_table] (as of [date])
- File: [filename] (source: [where it came from])
### Definitions
- [Metric A]: [Exactly how it's calculated]
- [Segment X]: [Exactly how membership is determined]
- [Time period]: [Start date] to [end date], [timezone]
### Methodology
1. [Step 1 of the analysis approach]
2. [Step 2]
3. [Step 3]
### Assumptions and Limitations
- [Assumption 1 and why it's reasonable]
- [Limitation 1 and its potential impact on conclusions]
### Key Findings
1. [Finding 1 with supporting evidence]
2. [Finding 2 with supporting evidence]
### SQL Queries
[All queries used, with comments]
### Caveats
- [Things the reader should know before acting on this]
For any code (SQL, Python) that may be reused:
"""
Analysis: Monthly Cohort Retention
Author: [Name]
Date: [Date]
Data Source: events table, users table
Last Validated: [Date] -- results matched dashboard within 2%
Purpose:
Calculate monthly user retention cohorts based on first activity date.
Assumptions:
- "Active" means at least one event in the month
- Excludes test/internal accounts (user_type != 'internal')
- Uses UTC dates throughout
Output:
Cohort retention matrix with cohort_month rows and months_since_signup columns.
Values are retention rates (0-100%).
"""
/validate-data Review this quarterly revenue analysis before I send it to the exec team: [analysis]
/validate-data Check my churn analysis -- I'm comparing Q4 churn rates to Q3 but Q4 has a shorter measurement window
/validate-data Here's a SQL query and its results for our conversion funnel. Does the logic look right? [query + results]