From arthur0824hao-skills
Exploratory Data Analysis skill for CSV and parquet datasets with deterministic profiling, drift/anomaly scans, contract generation and validation, and optional memory writeback into skill-system-memory. The implementation is Polars-first (lazy scan for large files and early `--sample` head), includes high-cardinality guards for profile/importance/contract flows, and supports categorical correlation with Cramer's V. Use when building or reviewing tabular fraud/risk/data-quality workflows, profiling new datasets, checking leakage or drift, or saving/validating data contracts.
npx claudepluginhub arthur0824hao/skills --plugin document-skillsThis skill uses the workspace's default tool permissions.
Use `scripts/eda.py` for deterministic EDA artifacts. The current stable backend is tabular EDA, and the multimodal entrypoint is `explore`.
Creates isolated Git worktrees for feature branches with prioritized directory selection, gitignore safety checks, auto project setup for Node/Python/Rust/Go, and baseline verification.
Executes implementation plans in current session by dispatching fresh subagents per independent task, with two-stage reviews: spec compliance then code quality.
Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.
Use scripts/eda.py for deterministic EDA artifacts. The current stable backend is tabular EDA, and the multimodal entrypoint is explore.
python3 scripts/eda.py detect-modality --input data_root
python3 scripts/eda.py explore --input data_or_folder
python3 scripts/eda.py detect-modality --input data_root
python3 scripts/eda.py explore --input data_or_folder --modality graph
python3 scripts/eda.py graph-viz --input data.csv --features amount,score --id-column account_id --label Class --edge-mode knn --topk 50 --normalize l2 --similarity cosine --output /tmp/eda_graph
python3 scripts/eda.py profile-dataset --input data.csv --target Class --output /tmp/eda
python3 scripts/eda.py distribution-report --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py correlation-matrix --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py anomaly-profiling --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py feature-importance-scan --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py leakage-detector --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py save-contract --profile /tmp/eda/profile.yaml --output /tmp/eda/contract.yaml
python3 scripts/eda.py validate-contract --input new_data.csv --contract /tmp/eda/contract.yaml
profile-dataset creates profile.yaml and report.mdexplore reports detected modalities and selected backend; tabular mode may immediately route into existing tabular profilinggraph-viz emits graph_viz/index.html, graph_viz/graph.json, graph_viz/sliders.json, and appends graph-viz references into profile.yaml and report.mdgraph-viz records renderer_hint, preview_applied, and browser-edge limits so large-graph fallback is explicit rather than silentprofile.yaml and append sections to report.mdsave-contract emits contract.yamlvalidate-contract prints JSON PASS / FAIL with a violation list.head(N) when --sample is used.profile.yaml as the machine-readable source of truth; report.md is the human-readable companion.scan_csv/scan_parquet), with materialization delayed until needed.>50 unique skips one-hot in feature importance, and profile truncates categorical columns (>100 unique or >50% row cardinality) to top-20 values.skill-system-memory/scripts/mem.py store when available.EDA_DISABLE_MEM_PY=1 is set, write fallback payloads under .memory/pending/.--no-memory for deterministic tests or when no writeback is desired.save-contract derives column requirements from profile.yaml.cardinality_range rules instead of allowed_values.validate-contract fails closed and returns machine-readable violations.graph-viz is a tabular-to-graph visualization flow, not a replacement for graph-native modality EDA.graph.json is the viewer payload authority; sliders.json is the UI-control authority.renderer_hint=canvas means the full interactive force layout is expected to be browser-safe.renderer_hint=webgl means dataset scale or edge volume exceeded the canvas-friendly threshold; the shipped viewer still loads, but preview edges are preferred by default.--max-browser-edges controls when preview fallback is applied. Raising it may crash Chromium on very large graphs.Example: Esun-style feature-bank payload generalized into EDA input/output conventions:
python3 scripts/eda.py graph-viz \
--input Work/Study/GNN/FraudDetect/esun_data/combined_features.csv \
--features senior28_01,senior28_02,senior28_03,senior28_04 \
--id-column account_id \
--label is_fraud \
--edge-mode knn \
--topk 50 \
--normalize l2 \
--similarity cosine \
--max-browser-edges 60000 \
--output /tmp/esun_graph_viz
Example: generic customer risk dataset with a multi-class label column:
python3 scripts/eda.py graph-viz \
--input data/customer_risk.parquet \
--features amount,velocity_score,merchant_entropy,geo_distance \
--id-column customer_id \
--label segment \
--edge-mode knn \
--topk 25 \
--normalize l2 \
--similarity cosine \
--output /tmp/customer_graph_viz
{
"schema_version": "2.0",
"id": "skill-system-eda",
"version": "1.1.0",
"capabilities": [
"eda-detect",
"eda-graph-viz",
"eda-profile",
"eda-distribution",
"eda-correlation",
"eda-anomaly",
"eda-feature-importance",
"eda-leakage",
"eda-contract-save",
"eda-contract-validate"
],
"effects": ["fs.read", "fs.write", "proc.exec"],
"operations": {
"profile-dataset": {
"description": "Profile a CSV/parquet dataset and generate profile.yaml plus report.md.",
"input": {
"input": { "type": "string", "required": true },
"target": { "type": "string", "required": false },
"output": { "type": "string", "required": true },
"sample": { "type": "integer", "required": false },
"no_memory": { "type": "boolean", "required": false }
},
"output": {
"description": "Artifact paths for the generated EDA profile",
"fields": { "profile": "string", "report": "string" }
},
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "profile-dataset", "--input", "{input}", "--output", "{output}"]
}
},
"detect-modality": {
"description": "Detect dataset modality and return all matching modality tags.",
"input": {
"input": { "type": "string", "required": true }
},
"output": {
"description": "Detected modalities",
"fields": { "modalities": "array", "path": "string" }
},
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "detect-modality", "--input", "{input}"]
}
},
"explore-dataset": {
"description": "Detect dataset modality and route to the appropriate EDA backend.",
"input": {
"input": { "type": "string", "required": true },
"modality": { "type": "string", "required": false },
"output": { "type": "string", "required": false }
},
"output": {
"description": "Detected modalities, selected modality, and backend routing result",
"fields": { "detected_modalities": "array", "selected_modality": "string", "status": "string" }
},
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "explore", "--input", "{input}"]
}
},
"graph-viz": {
"description": "Build reusable graph visualization artifacts for tabular or graph datasets.",
"input": {
"input": { "type": "string", "required": true },
"features": { "type": "string", "required": false },
"id_column": { "type": "string", "required": false },
"label": { "type": "string", "required": false },
"edge_mode": { "type": "string", "required": false },
"edge_input": { "type": "string", "required": false },
"topk": { "type": "integer", "required": false },
"normalize": { "type": "string", "required": false },
"similarity": { "type": "string", "required": false },
"sample": { "type": "integer", "required": false },
"output": { "type": "string", "required": true }
},
"output": {
"description": "Graph visualization artifact paths and integration outputs",
"fields": { "html": "string", "graph_json": "string", "slider_config": "string", "renderer_hint": "string", "profile": "string", "report": "string" }
},
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "graph-viz", "--input", "{input}", "--output", "{output}"]
}
},
"distribution-report": {
"description": "Append distribution and class-conditional analysis to an existing profile/report.",
"input": {
"input": { "type": "string", "required": true },
"target": { "type": "string", "required": true },
"profile": { "type": "string", "required": true }
},
"output": { "description": "Updated profile/report paths", "fields": { "profile": "string", "report": "string" } },
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "distribution-report", "--input", "{input}", "--target", "{target}", "--profile", "{profile}"]
}
},
"correlation-matrix": {
"description": "Compute feature and target correlations and append them to profile/report.",
"input": {
"input": { "type": "string", "required": true },
"target": { "type": "string", "required": false },
"profile": { "type": "string", "required": true }
},
"output": { "description": "Updated profile/report paths", "fields": { "profile": "string", "report": "string" } },
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "correlation-matrix", "--input", "{input}", "--profile", "{profile}"]
}
},
"anomaly-profiling": {
"description": "Compare class-conditional distributions and effect sizes.",
"input": {
"input": { "type": "string", "required": true },
"target": { "type": "string", "required": true },
"profile": { "type": "string", "required": true }
},
"output": { "description": "Updated profile/report paths", "fields": { "profile": "string", "report": "string" } },
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "anomaly-profiling", "--input", "{input}", "--target", "{target}", "--profile", "{profile}"]
}
},
"feature-importance-scan": {
"description": "Rank features with mutual information and optional tree importances.",
"input": {
"input": { "type": "string", "required": true },
"target": { "type": "string", "required": true },
"profile": { "type": "string", "required": true }
},
"output": { "description": "Updated profile/report paths", "fields": { "profile": "string", "report": "string" } },
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "feature-importance-scan", "--input", "{input}", "--target", "{target}", "--profile", "{profile}"]
}
},
"leakage-detector": {
"description": "Detect high-correlation, target-encoding, and temporal leakage indicators.",
"input": {
"input": { "type": "string", "required": true },
"target": { "type": "string", "required": true },
"profile": { "type": "string", "required": true }
},
"output": { "description": "Updated profile/report paths", "fields": { "profile": "string", "report": "string" } },
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "leakage-detector", "--input", "{input}", "--target", "{target}", "--profile", "{profile}"]
}
},
"save-contract": {
"description": "Generate a data contract from a saved EDA profile.",
"input": {
"profile": { "type": "string", "required": true },
"output": { "type": "string", "required": true }
},
"output": { "description": "Contract path", "fields": { "contract": "string" } },
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "save-contract", "--profile", "{profile}", "--output", "{output}"]
}
},
"validate-contract": {
"description": "Validate a new dataset against a saved contract and emit PASS/FAIL JSON.",
"input": {
"input": { "type": "string", "required": true },
"contract": { "type": "string", "required": true }
},
"output": { "description": "Validation status and violations", "fields": { "status": "string", "violations": "array" } },
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "validate-contract", "--input", "{input}", "--contract", "{contract}"]
}
}
},
"stdout_contract": {
"last_line_json": true
}
}