Help us improve
Share bugs, ideas, or general feedback.
Writes clear, step-by-step instructions for cleaning messy datasets, specifying standardisation, correction, and removal steps for analysis readiness.
npx claudepluginhub ur-grue/autopunk-media-skills --plugin autopunk-media-skillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/autopunk-media-skills:data-cleaning-briefThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Writes clear, step-by-step instructions for cleaning a messy or inconsistent dataset — specifying exactly what needs to be standardised, corrected, or removed to make the data ready for analysis and publication.
Preprocessing workflow for journalistic data analysis emphasizing transparency, provenance, and human oversight. Use when: (1) Loading messy data files (Excel, CSV, JSON) into analysis-ready format, (2) Auditing data quality before analysis, (3) Cleaning data with full transformation documentation, (4) Preparing data for investigative journalism projects. Core principle: No silent transformations—every change is documented and approved.
Provides patterns for pre-model data preprocessing pipelines covering cleaning, validation, transformation, ETL orchestration. Use for raw data deduplication, schema validation, format conversion before EDA or modeling.
Cleans CSV/TSV/Excel files: normalizes headers, trims whitespace, removes duplicates, fixes row lengths, validates data using qsv tools.
Share bugs, ideas, or general feedback.
Writes clear, step-by-step instructions for cleaning a messy or inconsistent dataset — specifying exactly what needs to be standardised, corrected, or removed to make the data ready for analysis and publication.
Required: A description of the dataset and the problems you can see in it — specific examples of inconsistent values, formatting problems, missing data, or structural issues. Column names and a small sample of the messy rows. Optional: The intended analysis goal (what you will do with the data once it is clean); any cleaning decisions that have already been made; the tool the analyst will use (Excel, Python, R, SQL); the deadline.
A structured brief (300–500 words) with two sections: (1) Cleaning checklist — numbered steps in recommended order, each naming the column affected, describing the problem, and specifying the action to take. (2) Decisions requiring editorial sign-off — a separate list of any steps where reasonable people could disagree on the right approach, with the options and recommended default. Plain language throughout; minimal technical jargon, but tool-specific syntax can be included in parentheses where helpful.
Dataset: Procurement contracts database, 48,000 rows, obtained via FOI. Problems I can see in the data:
contractor_name field has inconsistent capitalisation: "BRIDGEPOINT CONSTRUCTION", "Bridgepoint Construction", "bridgepoint construction ltd" all appear to be the same companycontract_value field mixes formats: some entries are "€4,200,000", others are "4200000", others are "4.2M"award_date field has multiple date formats: "15/03/2024", "2024-03-15", "March 15 2024"contract_value is blankcontract_category field uses two different classification systems mixed together: some rows use a national category code, others use an EU CPV code
Goal: Analysis of corporate concentration — which companies are winning the most by value.Cleaning Checklist
Complete steps in this order. Do not change column names unless specified.
Remove duplicate rows (contract_id column)
Problem: 87 rows share identical values in contract_id, awarding_authority, contractor_name, and award_date.
Action: Deduplicate keeping one row per unique contract_id. Where contract_id is the same but any other field differs, flag for manual review rather than automatically deleting.
Expected result: Row count reduces from ~48,000 to approximately 47,913.
Standardise award_date to ISO 8601 (award_date column)
Problem: Three date formats in use: DD/MM/YYYY, YYYY-MM-DD, "Month DD YYYY".
Action: Convert all values to YYYY-MM-DD. Assume DD/MM/YYYY for ambiguous dates (e.g. "01/03/2024" = 1 March, not 3 January). Flag any dates where day/month are both ≤ 12 and format is ambiguous.
Standardise contract_value to numeric (contract_value column)
Problem: Values in three formats: "€4,200,000", "4200000", "4.2M".
Action: Strip currency symbols and commas. Convert "M" suffix notation to full number (4.2M → 4200000). Store as integer. Do not add decimal places.
Handle blank contract_value rows
Problem: 340 rows have no contract value.
Action: Do not impute or estimate. Leave as null/blank. Document the count (340 rows) and percentage (~0.7%) in the methodology note. Do not exclude these rows from the dataset — they may still be useful for non-value analyses.
Standardise contractor_name for grouping (contractor_name column)
Problem: Same company appears in multiple capitalisation variants.
Action: Convert all values to title case as a first pass (e.g. "BRIDGEPOINT CONSTRUCTION" → "Bridgepoint Construction"). Then run a deduplication check for near-matches (Levenshtein distance ≤ 2) and produce a review list of probable duplicates for manual confirmation. Do not automatically merge names — present the list for journalist sign-off.
Flag contract_category classification system (contract_category column)
Problem: National category codes and EU CPV codes are mixed in a single column with no flag indicating which system applies.
Action: Do not clean or merge these yet. Add a new column category_system with values "national" or "cpv" based on the pattern of the code (CPV codes follow a numeric XXXXXXXX format; national codes may differ). Preserve the original column unchanged.
Note: Analysis using this column will need to treat the two systems separately unless a mapping table is available.
Decisions Requiring Editorial Sign-Off