Help us improve
Share bugs, ideas, or general feedback.
From journalism-tools
Preprocessing workflow for journalistic data analysis emphasizing transparency, provenance, and human oversight. Use when: (1) Loading messy data files (Excel, CSV, JSON) into analysis-ready format, (2) Auditing data quality before analysis, (3) Cleaning data with full transformation documentation, (4) Preparing data for investigative journalism projects. Core principle: No silent transformations—every change is documented and approved.
npx claudepluginhub nhagar/claude-plugins-journalism --plugin journalism-toolsHow this skill is triggered — by the user, by Claude, or both
Slash command
/journalism-tools:structured-data-preprocessing-journalismThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Preprocessing data for journalism requires higher standards than typical data science: every transformation must be traceable, every decision documented, and the human must approve substantive changes.
Analyze preprocessed data for investigative journalism with full transparency. Use when a journalist has clean, preprocessed data ready for analysis and needs to identify patterns, anomalies, relationships, or statistical findings that support a story. Triggers include requests to analyze data, find patterns, identify outliers, cross-reference records, calculate statistics, or answer specific investigative questions. Complements the structured-data-preprocessing skill. Emphasizes simple, legible analyses over complex methods—every finding must be explainable to editors and defensible under scrutiny.
Writes clear, step-by-step instructions for cleaning messy datasets, specifying standardisation, correction, and removal steps for analysis readiness.
Validates CSV/TSV/Excel files and data analyses for quality, completeness, uniqueness, accuracy, consistency, outliers, and bias using qsv stats and frequency tools.
Share bugs, ideas, or general feedback.
Preprocessing data for journalism requires higher standards than typical data science: every transformation must be traceable, every decision documented, and the human must approve substantive changes.
1. LOAD → Ingest data, establish provenance columns
2. AUDIT → Systematically examine every column for issues
3. REPORT → Present findings, proposed fixes, questions to user
4. TRANSFORM → After approval, execute documented transformations
5. VALIDATE → Confirm transformations, output final dataset + audit trail
Before loading, clarify:
Always add these columns to loaded data:
'_source_file' # Original filename
'_source_sheet' # Sheet name (if Excel) or 'csv'
'_source_row' # 1-indexed row number in original file
'_load_timestamp' # When this record was loaded
Systematically examine every column.
| Category | What to Check |
|---|---|
| Type | Is inferred type correct? Mixed types? |
| Missing | How many nulls? Pattern to missingness? |
| Cardinality | Unique values vs total rows |
| Distribution | Outliers? Impossible values? |
| Text quality | Encoding issues? Entity variations? Typos? |
| Dates | Consistent format? Future or distant past dates? |
| Numeric | Scale consistent? Negative where unexpected? |
Generate a report for human review. See references/report-template.md for format.
# Data Quality Report: [Dataset Name]
## Summary
- Total rows / columns / columns with issues
## Critical Issues (Require Decision)
[Issues that could affect analysis validity]
## Warnings (Review Recommended)
[Issues that may or may not need fixing]
## Proposed Transformations
[Each transformation with rationale]
## Questions for Human Review
[Decisions that require domain knowledge]
After human approval, execute transformations with full documentation.
cleaned_[name].csv) - Provenance columns preservedtransformation_log.csv) - Every change documenteddata_audit_report.md) - Issues, decisions, resolutionsentity_mapping_[column].csv) - If standardization applied| Decision Type | Artifact to Generate |
|---|---|
| Entity variations | Frequency table + proposed mapping |
| Outliers | Distribution summary + flagged values |
| Missing data | Missingness by column summary |
| Duplicates | Sample duplicate groups |
references/report-template.md - Full report template with examples