Skill

data-profiler

Generate data profiles with column stats, correlations, and missing patterns for DataFrames. Use for EDA and data discovery.

From majestic-data
Install
1
Run in your terminal
$
npx claudepluginhub majesticlabs-dev/majestic-marketplace --plugin majestic-data
Tool Access

This skill is limited to using the following tools:

Read Write Edit Bash
Supporting Assets
View in Repository
scripts/profiling.py
Skill Content

Data Profiler

Audience: Data engineers and analysts exploring new datasets.

Goal: Generate comprehensive profiles including statistics, correlations, and missing patterns.

Scripts

Execute profiling functions from scripts/profiling.py:

from scripts.profiling import (
    profile_dataframe,
    print_profile_summary,
    profile_correlations,
    profile_missing_patterns
)

Usage Examples

Basic Profiling

import pandas as pd
from scripts.profiling import profile_dataframe, print_profile_summary

df = pd.read_csv('data.csv')
profile = profile_dataframe(df)
print_profile_summary(profile)

Output:

Shape: 10,000 rows x 15 columns
Memory: 1.23 MB

Column Summary:
  id (int64): 10,000 unique, no nulls
  email (object): 9,847 unique, 1.53% null
  revenue (float64): 3,421 unique, no nulls
  created_at (datetime64[ns]): 365 unique, no nulls

Correlation Analysis

from scripts.profiling import profile_correlations

corr = profile_correlations(df, threshold=0.7)

if corr['high_correlations']:
    print("Highly correlated columns:")
    for c in corr['high_correlations']:
        print(f"  {c['col1']} <-> {c['col2']}: {c['correlation']}")

Missing Data Patterns

from scripts.profiling import profile_missing_patterns

missing = profile_missing_patterns(df)

for col, stats in missing.items():
    if col != 'co_missing_columns':
        print(f"{col}: {stats['percent']}% missing, max {stats['consecutive_max']} consecutive")

# Check for columns missing together
if 'co_missing_columns' in missing:
    for col1, col2, pct in missing['co_missing_columns']:
        print(f"{col1} and {col2} both missing {pct}% of time")

Profile Output Schema

shape: [rows, columns]
memory_mb: float
columns:
  column_name:
    dtype: string
    null_count: int
    null_pct: float
    unique_count: int
    unique_pct: float
    # Numeric columns add:
    min: float
    max: float
    mean: float
    std: float
    median: float
    zeros: int
    negatives: int
    # String columns add:
    min_length: int
    max_length: int
    top_values: {value: count}
    # Datetime columns add:
    min_date: string
    max_date: string
    date_range_days: int

Analysis Dimensions

Numeric Columns

  • Min, max, range, mean, median, mode
  • Standard deviation, variance, skewness, kurtosis
  • Percentiles (5, 25, 50, 75, 95)
  • Zero count, negative count
  • Outlier detection (IQR method)

String Columns

  • Min/max/avg length
  • Pattern analysis (emails, phones, URLs)
  • Top N frequent values
  • Whitespace issues (leading/trailing)
  • Case distribution (upper/lower/mixed)
  • Empty string count

DateTime Columns

  • Min/max dates, date range span
  • Missing dates in sequence
  • Day of week distribution
  • Hour distribution (if timestamp)

Categorical Columns

  • Cardinality, value distribution
  • Imbalance ratio
  • Rare categories (< 1%)

Correlation Analysis

correlation_matrix = df.select_dtypes(include=[np.number]).corr()

# Highly correlated pairs (> 0.8)
high_corr = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        if abs(correlation_matrix.iloc[i, j]) > 0.8:
            high_corr.append((
                correlation_matrix.columns[i],
                correlation_matrix.columns[j],
                correlation_matrix.iloc[i, j]
            ))

Quality Flags

Automatically flag:

  • High nulls: > 50% missing values
  • Constant column: Only 1 unique value
  • High cardinality: Unique ratio > 95% (possible ID)
  • Suspected duplicates: Based on key columns
  • Data type mismatch: Numeric stored as string
  • Future dates: Dates beyond today
  • Negative values: In typically positive columns

Report Sections

Executive Summary

  • Dataset shape (rows x columns), memory footprint
  • Overall quality score, critical issues count

Column-by-Column Analysis

  • Statistics table, distribution histogram (ASCII for terminal)
  • Top values (for categorical), quality warnings

Relationships

  • Correlation heatmap summary
  • Potential foreign key relationships, column dependencies

Recommendations

  • Suggested data type optimizations
  • Columns to investigate, potential data quality rules

Output Formats

  • Markdown Report: Full detailed report with tables
  • JSON Summary: Machine-readable profile for programmatic use
  • HTML Dashboard: Interactive report with charts (if ydata-profiling available)

Dependencies

pandas
numpy
Stats
Parent Repo Stars30
Parent Repo Forks6
Last CommitMar 15, 2026