Cleans raw credit risk data for pre-loan modeling: loads/formats data, filters abnormal periods/high-missing/low-IV/high-PSI/high-correlation variables, denoises with Null Importance, generates reports.
From awesome-copilotnpx claudepluginhub ctr26/dotfiles --plugin awesome-copilotThis skill uses the workspace's default tool permissions.
references/analysis.pyreferences/func.pyscripts/example.pyFetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.
Uses ctx7 CLI to fetch current library docs, manage AI coding skills (install/search/generate), and configure Context7 MCP for AI editors.
# Run the complete data cleaning pipeline
python ".github/skills/datanalysis-credit-risk/scripts/example.py"
The data cleaning pipeline consists of the following 11 steps, each executed independently without deleting the original data:
| Function | Purpose | Module |
|---|---|---|
get_dataset() | Load and format data | references.func |
org_analysis() | Organization sample analysis | references.func |
missing_check() | Calculate missing rate | references.func |
drop_abnormal_ym() | Filter abnormal months | references.analysis |
drop_highmiss_features() | Drop high missing rate features | references.analysis |
drop_lowiv_features() | Drop low IV features | references.analysis |
drop_highpsi_features() | Drop high PSI features | references.analysis |
drop_highnoise_features() | Null Importance denoising | references.analysis |
drop_highcorr_features() | Drop high correlation features | references.analysis |
iv_distribution_by_org() | IV distribution statistics | references.analysis |
psi_distribution_by_org() | PSI distribution statistics | references.analysis |
value_ratio_distribution_by_org() | Value ratio distribution statistics | references.analysis |
export_cleaning_report() | Export cleaning report | references.analysis |
DATA_PATH: Data file path (best are parquet format)DATE_COL: Date column nameY_COL: Label column nameORG_COL: Organization column nameKEY_COLS: Primary key column name listOOS_ORGS: Out-of-sample organization listmin_ym_bad_sample: Minimum bad sample count per month (default 10)min_ym_sample: Minimum total sample count per month (default 500)missing_ratio: Overall missing rate threshold (default 0.6)overall_iv_threshold: Overall IV threshold (default 0.1)org_iv_threshold: Single organization IV threshold (default 0.1)max_org_threshold: Maximum tolerated low IV organization count (default 2)psi_threshold: PSI threshold (default 0.1)max_months_ratio: Maximum unstable month ratio (default 1/3)max_orgs: Maximum unstable organization count (default 6)n_estimators: Number of trees (default 100)max_depth: Maximum tree depth (default 5)gain_threshold: Gain difference threshold (default 50)max_corr: Correlation threshold (default 0.9)top_n_keep: Keep top N features by original gain ranking (default 20)The generated Excel report contains the following sheets: