From data-analysis
Profiles unfamiliar datasets: assesses schema structure, column distributions, data quality, null rates, cardinality, outliers, table relationships, temporal coverage. Use for onboarding data sources, auditing freshness, discovering foreign keys.
npx claudepluginhub vm0-ai/vm0-skills --plugin user-researchThis skill uses the workspace's default tool permissions.
Systematic approach to understanding a dataset's structure, quality, and analytical potential before diving into analysis.
Profiles tables or files (CSV, Excel, Parquet, JSON) to reveal shape, null rates, column distributions, top values, percentiles, data quality issues, and column categories.
Deep-dive data profiling for a specific table. Use when the user asks to profile a table, wants statistics about a dataset, asks about data quality, or needs to understand a table's structure and content. Requires a table name.
Deep-profiles active datasets for schema structure, value distributions, temporal patterns, correlations, completeness gaps, and anomalies. Use after connecting datasets or before analysis.
Share bugs, ideas, or general feedback.
Systematic approach to understanding a dataset's structure, quality, and analytical potential before diving into analysis.
Start every new dataset encounter by answering these questions:
Assign every column to one of these categories:
Minimum, maximum, mean, median
Standard deviation
Percentile ladder: p1, p5, p25, p75, p95, p99
Count of zeros
Count of negatives (flag if unexpected)
Shortest length, longest length, average length
Count of empty strings
Format regularity (do values follow a consistent pattern?)
Case consistency (uniform upper, uniform lower, or mixed?)
Count of values with leading or trailing whitespace
Earliest date, latest date
Count of nulls
Count of future dates (flag if the domain forbids them)
Distribution across months or weeks
Gaps in expected daily/weekly cadence
True count, false count, null count
True proportion
After examining columns individually, look for connections:
Assign each column a tier:
Scan for:
Patterns that suggest the data may be unreliable:
When profiling a numeric column, classify its shape:
For any temporal data, investigate:
Surface natural groupings by:
Across numeric columns:
## Table: [schema.table_name]
**Purpose**: [What this table captures]
**Grain**: [One row per...]
**Primary Key**: [column(s)]
**Approximate Rows**: [count, as of date]
**Refresh Cadence**: [real-time / hourly / daily / weekly]
**Responsible Team**: [owner]
### Important Columns
| Column | Type | Meaning | Sample Values | Notes |
|--------|------|---------|---------------|-------|
| user_id | STRING | Unique user handle | "usr_abc123" | References users.id |
| event_type | STRING | Action category | "click", "view", "purchase" | 15 distinct values |
| revenue | DECIMAL | USD transaction amount | 29.99, 149.00 | Null for non-purchases |
| created_at | TIMESTAMP | Event occurrence time | 2024-01-15 14:23:01 | Partition column |
### Join Paths
- Links to `users` via `user_id`
- Links to `products` via `product_id`
- Parent of `event_details` (one-to-many on event_id)
### Known Caveats
- [Document any quality issues]
- [Note analytical gotchas]
### Typical Query Use Cases
- [List common analytical patterns against this table]
When working directly against a warehouse, use these patterns:
-- Enumerate tables in a schema (PostgreSQL)
SELECT table_name, table_type
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY table_name;
-- Inspect column metadata (PostgreSQL)
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = 'target_table'
ORDER BY ordinal_position;
-- Rank tables by storage footprint (PostgreSQL)
SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;
-- Row count per table (general approach)
-- Execute individually: SELECT COUNT(*) FROM table_name
When navigating an unfamiliar warehouse: