Help us improve
Share bugs, ideas, or general feedback.
From claudecode-research-harness-workflow
Generates and executes reproducible data cleaning, harmonization, reshaping, and merging scripts from a plan, preserving raw data.
npx claudepluginhub maxwell2732/claudecode-research-harness-workflow --plugin claudecode-research-harness-workflowHow this skill is triggered — by the user, by Claude, or both
Slash command
/claudecode-research-harness-workflow:research-harness-clean [--plan PATH] [--dry-run] [--task TASK-ID][--plan PATH] [--dry-run] [--task TASK-ID]This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Generate and run reproducible data cleaning, harmonization, reshaping, and merging scripts.
Generates an executable empirical analysis plan from study_spec.md, audit report, and cleaned data structure. Outputs analysis_plan.md for human approval before analysis execution.
Profiles and flags issues in clinical CSV/Excel data (missing values, outliers, duplicates, type mismatches) via a three-stage workflow with researcher approval gates. Does not auto-clean.
Writes clear, step-by-step instructions for cleaning messy datasets, specifying standardisation, correction, and removal steps for analysis readiness.
Share bugs, ideas, or general feedback.
Generate and run reproducible data cleaning, harmonization, reshaping, and merging scripts. Raw data is never modified. Every cleaning decision is documented in the cleaning report.
This skill runs after /research-harness-audit and before /research-harness-plan.
| Input | Action |
|---|---|
/research-harness-clean | Run all cleaning tasks in reports/data_cleaning_plan.md |
/research-harness-clean --task 3 | Run only cleaning task 3 from the plan |
/research-harness-clean --dry-run | Write scripts but do not execute them; report what would run |
/research-harness-clean --plan PATH | Use a cleaning plan at a custom path |
Before writing any script:
reports/data_audit_report.md. If it does not exist, stop — audit must run first.reports/data_cleaning_plan.md. If it does not exist, stop. Tell the user to fill in templates/data_cleaning_plan.md and save it as reports/data_cleaning_plan.md.data/raw/.data/processed/ or data/intermediate/, not data/raw/.If any pre-flight check fails: report which check failed and stop. Do not write scripts around the failure.
Read reports/data_cleaning_plan.md in full.
Identify:
If any merge task has unknown merge keys or unknown expected match rate: stop before writing any merge script. Report the ambiguity and ask the user for clarification.
Write a single cleaning script (or one script per major task if the cleaning plan is large).
Script requirements:
R skeleton:
# Project: <study name>
# Task: data cleaning
# Date: YYYY-MM-DD
# Analyst: Claude Code
library(here)
library(dplyr)
log_file <- here("logs", "clean_YYYYMMDD.log")
sink(log_file, append = FALSE, split = TRUE)
cat("=== Data Cleaning Log ===\n")
cat("Started:", format(Sys.time()), "\n\n")
# --- Load ---
df <- read.csv(here("data", "raw", "filename.csv"))
cat("Loaded:", nrow(df), "rows\n")
# --- Filter ---
n_before <- nrow(df)
df <- df |> filter(...)
cat("Dropped:", n_before - nrow(df), "rows —", "reason\n")
cat("Remaining:", nrow(df), "rows\n")
# --- Save ---
write.csv(df, here("data", "processed", "analysis_ready.csv"), row.names = FALSE)
cat("\nFinished:", format(Sys.time()), "\n")
sink()
Stata skeleton:
* Project: <study name>
* Task: data cleaning
* Date: YYYY-MM-DD
* Analyst: Claude Code
log using "logs/clean_YYYYMMDD.log", replace text
use "data/raw/filename.dta", clear
display "Loaded: `c(N)' rows"
* Filter
count
keep if ...
display "Remaining after filter: `c(N)'"
save "data/processed/analysis_ready.dta", replace
log close
Implement each task in the cleaning plan in order:
Renaming: rename variables exactly as specified in the plan. Do not rename variables not in the plan.
Type conversion: convert types as specified. Log any values that cannot be converted.
Date parsing: parse dates to ISO 8601 (YYYY-MM-DD). Log any unparseable dates.
Missing-value coding: recode non-standard missing codes to NA / .. Log the count of values recoded.
Duplicate check: identify and handle duplicates per the plan (drop first / drop last / flag / stop). Log duplicates found.
ID consistency: verify IDs are consistent per the plan. Log any inconsistencies.
Unit harmonization: apply conversion factors as specified. Do not guess conversion factors not in the plan.
Winsorization / outlier flags: apply only if explicitly specified in the plan. Log bounds and count of affected observations.
Reshaping: reshape as specified. Log the row count before and after.
Derived variables: generate each derived variable per the formula in the plan. Log the count of non-missing values produced.
For each merge task in the cleaning plan, in execution order:
Before merging:
Merge:
After merging:
Write merge_report.md entry:
For every merge, copy the templates/merge_report.md block and fill in all fields. Append to reports/merge_report.md.
If merge keys are ambiguous or missing from the plan: stop. Do not guess. Document the problem and ask for clarification.
Run the cleaning script. Capture the exit code.
cc:blocked in analysis_plan.md. Do not fabricate output.After the script runs:
data/raw/ was not modified (check that file sizes match or use git status data/raw/)Copy templates/data_cleaning_report.md to reports/data_cleaning_report.md and fill in all sections.
All of these must be present:
reports/merge_report.md)data/raw/logs/clean_YYYYMMDD.log existsdata/processed/ output file existsreports/data_cleaning_report.md exists with all sections populatedreports/merge_report.md exists with one entry per merge (if any merges occurred)data/raw/ were modifiedmerge_report.md entry with pre/post counts and match ratesreports/data_cleaning_report.md verification section: PASSdata/raw/ unmodifiedTell the user:
Cleaning complete. Review
reports/data_cleaning_report.mdandreports/merge_report.md.If any merge had unresolved problems: resolve them before continuing.
When satisfied with the cleaned data, run
/research-harness-planto generate the executable analysis plan.