Skill
data-profiler
Generate data profiles with column stats, correlations, and missing patterns for DataFrames. Use for EDA and data discovery.
From majestic-dataInstall
1
Run in your terminal$
npx claudepluginhub majesticlabs-dev/majestic-marketplace --plugin majestic-dataTool Access
This skill is limited to using the following tools:
Read Write Edit Bash
Supporting Assets
View in Repositoryscripts/profiling.pySkill Content
Data Profiler
Audience: Data engineers and analysts exploring new datasets.
Goal: Generate comprehensive profiles including statistics, correlations, and missing patterns.
Scripts
Execute profiling functions from scripts/profiling.py:
from scripts.profiling import (
profile_dataframe,
print_profile_summary,
profile_correlations,
profile_missing_patterns
)
Usage Examples
Basic Profiling
import pandas as pd
from scripts.profiling import profile_dataframe, print_profile_summary
df = pd.read_csv('data.csv')
profile = profile_dataframe(df)
print_profile_summary(profile)
Output:
Shape: 10,000 rows x 15 columns
Memory: 1.23 MB
Column Summary:
id (int64): 10,000 unique, no nulls
email (object): 9,847 unique, 1.53% null
revenue (float64): 3,421 unique, no nulls
created_at (datetime64[ns]): 365 unique, no nulls
Correlation Analysis
from scripts.profiling import profile_correlations
corr = profile_correlations(df, threshold=0.7)
if corr['high_correlations']:
print("Highly correlated columns:")
for c in corr['high_correlations']:
print(f" {c['col1']} <-> {c['col2']}: {c['correlation']}")
Missing Data Patterns
from scripts.profiling import profile_missing_patterns
missing = profile_missing_patterns(df)
for col, stats in missing.items():
if col != 'co_missing_columns':
print(f"{col}: {stats['percent']}% missing, max {stats['consecutive_max']} consecutive")
# Check for columns missing together
if 'co_missing_columns' in missing:
for col1, col2, pct in missing['co_missing_columns']:
print(f"{col1} and {col2} both missing {pct}% of time")
Profile Output Schema
shape: [rows, columns]
memory_mb: float
columns:
column_name:
dtype: string
null_count: int
null_pct: float
unique_count: int
unique_pct: float
# Numeric columns add:
min: float
max: float
mean: float
std: float
median: float
zeros: int
negatives: int
# String columns add:
min_length: int
max_length: int
top_values: {value: count}
# Datetime columns add:
min_date: string
max_date: string
date_range_days: int
Analysis Dimensions
Numeric Columns
- Min, max, range, mean, median, mode
- Standard deviation, variance, skewness, kurtosis
- Percentiles (5, 25, 50, 75, 95)
- Zero count, negative count
- Outlier detection (IQR method)
String Columns
- Min/max/avg length
- Pattern analysis (emails, phones, URLs)
- Top N frequent values
- Whitespace issues (leading/trailing)
- Case distribution (upper/lower/mixed)
- Empty string count
DateTime Columns
- Min/max dates, date range span
- Missing dates in sequence
- Day of week distribution
- Hour distribution (if timestamp)
Categorical Columns
- Cardinality, value distribution
- Imbalance ratio
- Rare categories (< 1%)
Correlation Analysis
correlation_matrix = df.select_dtypes(include=[np.number]).corr()
# Highly correlated pairs (> 0.8)
high_corr = []
for i in range(len(correlation_matrix.columns)):
for j in range(i+1, len(correlation_matrix.columns)):
if abs(correlation_matrix.iloc[i, j]) > 0.8:
high_corr.append((
correlation_matrix.columns[i],
correlation_matrix.columns[j],
correlation_matrix.iloc[i, j]
))
Quality Flags
Automatically flag:
- High nulls: > 50% missing values
- Constant column: Only 1 unique value
- High cardinality: Unique ratio > 95% (possible ID)
- Suspected duplicates: Based on key columns
- Data type mismatch: Numeric stored as string
- Future dates: Dates beyond today
- Negative values: In typically positive columns
Report Sections
Executive Summary
- Dataset shape (rows x columns), memory footprint
- Overall quality score, critical issues count
Column-by-Column Analysis
- Statistics table, distribution histogram (ASCII for terminal)
- Top values (for categorical), quality warnings
Relationships
- Correlation heatmap summary
- Potential foreign key relationships, column dependencies
Recommendations
- Suggested data type optimizations
- Columns to investigate, potential data quality rules
Output Formats
- Markdown Report: Full detailed report with tables
- JSON Summary: Machine-readable profile for programmatic use
- HTML Dashboard: Interactive report with charts (if ydata-profiling available)
Dependencies
pandas
numpy
Similar Skills
Stats
Parent Repo Stars30
Parent Repo Forks6
Last CommitMar 15, 2026