From ai-analyst
Deep-profiles active datasets for schema structure, value distributions, temporal patterns, correlations, completeness gaps, and anomalies. Use after connecting datasets or before analysis.
npx claudepluginhub ai-analyst-lab/ai-analyst-plugin --plugin ai-analystThis skill uses the workspace's default tool permissions.
Deep-profile the active dataset to understand schema structure, value distributions, temporal patterns, correlations, completeness gaps, and anomalies. Produces a comprehensive profile report that serves as the foundation for analysis planning and data quality assessment.
Profiles unfamiliar datasets: assesses schema structure, column distributions, data quality, null rates, cardinality, outliers, table relationships, temporal coverage. Use for onboarding data sources, auditing freshness, discovering foreign keys.
Generates data profiles for pandas DataFrames with column stats, correlations, and missing patterns. Use for EDA and data discovery on new datasets.
Profiles tables or files (CSV, Excel, Parquet, JSON) to reveal shape, null rates, column distributions, top values, percentiles, data quality issues, and column categories.
Share bugs, ideas, or general feedback.
Deep-profile the active dataset to understand schema structure, value distributions, temporal patterns, correlations, completeness gaps, and anomalies. Produces a comprehensive profile report that serves as the foundation for analysis planning and data quality assessment.
last_profiled in manifest.yaml)from helpers.data_helpers import get_connection_for_profiling
from helpers.schema_profiler import profile_source
# Get connection (auto-detects DuckDB vs CSV from active dataset)
conn_info = get_connection_for_profiling()
# Run full schema profile — introspects all tables: column names, types,
# nullability, row counts, sample values, basic statistics, date detection
schema = profile_source(conn_info)
Record the output. schema contains the full table inventory with column-level metadata. Use this to identify:
For each table in the schema, load the data and run the deep profiling functions. Prioritize tables with the most rows and the most date/numeric columns.
from helpers.data_helpers import read_table
from helpers.deep_profiler import (
profile_distributions,
profile_temporal_patterns,
profile_completeness,
)
for table_info in schema["tables"]:
table_name = table_info["name"]
df = read_table(table_name)
# Distribution analysis on all numeric columns
distributions = profile_distributions(df)
# Completeness assessment — null rates, zeros, empty strings, constant cols
completeness = profile_completeness(df)
# Temporal pattern analysis (only if the table has date columns)
temporal = None
if table_info.get("date_columns"):
primary_date = table_info["date_columns"][0]
temporal = profile_temporal_patterns(df, primary_date, freq="D")
Important: For large tables (>50K rows), profile_source() already samples. But read_table() loads the full CSV. If a table has >100K rows, sample before running deep profiling:
if len(df) > 100_000:
df = df.sample(n=100_000, random_state=42)
Run correlation and anomaly detection on tables that contain key business metrics (revenue, counts, rates). Identify these tables by looking for columns with names like revenue, amount, total, count, rate, price, quantity.
from helpers.deep_profiler import profile_correlations, profile_anomalies
# Correlations — find relationships between numeric columns
correlations = profile_correlations(df, threshold=0.5)
# Anomaly detection — requires a date column and pre-aggregated data
# Aggregate to daily granularity first if the table has event-level rows
if table_info.get("date_columns"):
primary_date = table_info["date_columns"][0]
# Only run on tables with a clear date + metric pattern
metric_cols = [c for c in df.select_dtypes(include="number").columns
if c not in ("id", table_name.rstrip("s") + "_id")]
if metric_cols:
# Aggregate to daily for anomaly detection
daily = df.groupby(pd.to_datetime(df[primary_date]).dt.date)[metric_cols].sum().reset_index()
daily.rename(columns={daily.columns[0]: primary_date}, inplace=True)
anomalies = profile_anomalies(daily, date_col=primary_date,
metric_cols=metric_cols, window=14)
Write the full profile report to <workspace>/knowledge/datasets/{active}/last_profile.md. Use schema_to_markdown() for the schema portion, then append the deep profiling results.
from helpers.data_helpers import schema_to_markdown, detect_active_source
source = detect_active_source()
active_dataset = source["source"]
# Build the schema markdown section
schema_md = schema_to_markdown(schema)
Assemble the full report and write it to:
<workspace>/knowledge/datasets/{active_dataset}/last_profile.md
# Data Profile: {dataset_name}
**Profiled at:** {ISO timestamp}
**Source:** {connection type} ({path or schema prefix})
**Tables:** {count} | **Total rows:** {sum}
---
## Summary of Findings
| Severity | Count | Details |
|----------|-------|---------|
| BLOCKER | X | {brief list} |
| WARNING | X | {brief list} |
| INFO | X | {brief list} |
---
## Schema Overview
{output of schema_to_markdown()}
---
## Distribution Analysis
### {table_name}
| Column | Shape | Skewness | Outliers (IQR) | Recommended Transform |
|--------|-------|----------|----------------|----------------------|
| {col} | {shape} | {skew} | {n_outliers} | {transform or "none"} |
---
## Temporal Patterns
### {table_name} ({date_column})
- **Date range:** {min} to {max}
- **Coverage:** {actual}/{expected} periods ({pct}%)
- **Gaps:** {count} gaps found {list if any}
- **Trend:** {trend direction}
- **Seasonality:** {detected or not}
- **Day-of-week pattern:** {summary}
---
## Completeness
### {table_name}
| Column | Status | Null % | Zeros | Empty Strings | Constant? |
|--------|--------|--------|-------|---------------|-----------|
| {col} | {status} | {pct} | {count} | {count} | {yes/no} |
---
## Correlations
### {table_name}
| Column A | Column B | Correlation | Strength | Direction |
|----------|----------|-------------|----------|-----------|
| {col_a} | {col_b} | {r} | {strength} | {direction} |
---
## Anomalies
### {table_name}
{anomaly summary}
| Metric | Spikes | Drops | Details |
|--------|--------|-------|---------|
| {metric} | {count} | {count} | {top anomalies with dates} |
---
## Recommendations
- **BLOCKER items:** {must fix before analysis}
- **WARNING items:** {note as caveats}
- **Suggested analysis focus:** {tables/columns with most signal}
Apply these rules consistently across all sections:
| Severity | Condition |
|---|---|
| BLOCKER | >50% nulls in a key metric column; entire date ranges missing (coverage <50%); constant columns that should have variance; very strong correlations (r>0.95) suggesting duplicate columns |
| WARNING | 5-50% nulls; heavy-tailed or bimodal distributions in metric columns; date coverage 50-90%; moderate anomalies detected; skewness >3 suggesting data quality issues |
| INFO | <5% nulls; normal or mild skew distributions; full date coverage; no anomalies; expected correlations (e.g., quantity and revenue) |
read_table(). The schema profiler handles this internally, but deep profiling should also use the CSV path.profile_source() first (Step 1) so you know which columns are dates, which are numeric, and what the cardinality looks like before deep profiling.last_profile.md so future sessions can reference it without re-profiling.