From ai4ss-skills
Builds auditable social-science data pipelines: raw files to analysis samples with merge audits, missingness checks, and reproducible scripts.
How this skill is triggered — by the user, by Claude, or both
Slash command
/ai4ss-skills:research-data-builderThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Build research datasets as an auditable workflow, not as one-off data wrangling. The default output is a runnable pipeline plus row-count, merge, missingness, and provenance evidence that another researcher can inspect.
agents/openai.yamlexamples/invalid_merge_audit_no_review_path.csvexamples/valid_merge_audit.csvexamples/valid_sample_flow.csvexamples/valid_variable_provenance.csvreferences/audit-schema.mdreferences/pipeline.mdreferences/prompt-pack.mdreferences/quality-gates.mdreferences/worked-example.mdscripts/check_runtime_contract.pyscripts/validate_data_audits.pyBuild research datasets as an auditable workflow, not as one-off data wrangling. The default output is a runnable pipeline plus row-count, merge, missingness, and provenance evidence that another researcher can inspect.
This skill answers: "数据怎么来的,样本怎么变的?" Its value is not cleaning data faster; it is making row loss, merge ambiguity, variable construction, and extraction uncertainty visible before any result is interpreted.
Never overwrite raw data. Read project instructions first, write scripts and derived data to explicit output paths, and make every sample-size change explainable.
Path.exists(), Path.glob(), rg --files, or quoted shell patterns before reading them. Run scripts/check_runtime_contract.py for path/glob checks; do not let unquoted zsh globs decide whether data exists.config resolve correctly.This skill realizes the Data strategy part of the MIDA spine. It makes sampling, source selection, measurement, extraction, linkage, transformation, missingness, and provenance inspectable before any answer strategy or claim depends on the data.
The skill must preserve upstream Model and Inquiry fields when present, but it does not choose the estimand, target quantity, or identification strategy.
When an upstream .aiss model exists, data artifacts must preserve ai4ss_model_path, relevant concept or bridge ids, and check status. Data work can repair the empirical bridge evidence, but it must not silently rewrite the model.
study_design_brief.md, study_design_declaration.csv, research_model.aiss, route cards, raw data, source files, variable dictionaries, extraction rules, DDI metadata, or an analysis plan's data requirements.sample_flow.csv, merge_audit.csv, and variable_provenance.csv when applicable; for survey cleaning, ddi-metadata.yaml, cleaning_contract, clean data, cleaning script, and processing event audit when routed through ai4ss-skills.route_id, design_source, target_inquiry, data_source, unit_of_analysis, sample_restrictions, constructed_variables, known_data_gaps, ai4ss_model_path, model_id, concept_id, causal_id, bridge_id, ai4ss_check_status, validation_commands, next_skill_route.research-analysis-runner, methods-reviewer, academic-writing-scaffold, research-slides-builder, study-design-builder, or ask_author.Use this skill for confirmed data and pipeline work. Do not use it to choose the research design; hand ambiguous design choices to study-design-builder. Do not use it to run the first analysis package unless the task is only data feasibility; hand analysis execution to research-analysis-runner. Do not use it to certify empirical identification or result claims; hand those checks to methods-reviewer. Do not use it to write paper text; hand verified audit artifacts to academic-writing-scaffold.
Step -1: Orient
-> Read AGENTS.md, README, docs/research_design.*, variable dictionaries, and the file tree.
-> Identify raw, interim, analysis, scripts, output, and log directories.
-> If boundaries conflict, stop and ask for the project source of truth.
Step 0: Classify the task
-> New analysis sample: follow references/pipeline.md.
-> Merge or matching repair: follow references/audit-schema.md before changing code.
-> Text-to-structure extraction: require source snippets, extraction rules, confidence flags, and manual-review outputs.
-> Survey/codebook cleaning: when `.dta`, `.sav`, codebook PDF/docx, or `ddi-metadata.yaml` is central, route through the `ai4ss-skills` DDI harness: `codebook-parse` -> `cleaning-contract` -> `cleaning-execute`.
-> Existing pipeline bug: inspect logs, data columns, and the smallest failing step before editing.
Step 1: Plan before edits
-> List files to read, files to modify, expected outputs, and validation checks.
-> Do not touch raw files, credentials, or confidential folders.
-> Run `scripts/check_runtime_contract.py --cwd <project> --path <input-or-quoted-glob> --data <input-data> --required-columns <cols> --key-columns <keys> --python-import <module> --r-package <pkg>` for the checks that match the pipeline step.
Step 2: Build in stages
-> Preserve raw -> interim -> analysis separation.
-> Add deterministic scripts under scripts/.
-> Put tables, figures, logs, and audits under output/ or docs/.
Step 3: Validate
-> Report row counts, unique IDs, year ranges, duplicates, missingness, merge rates, and constructed-variable rules.
-> Save audit artifacts, not only chat summaries.
-> Re-run the exact data-building command from a clean shell after code changes; do not validate stale derived files.
-> Update an AI-use ledger when AI-assisted extraction or transformation affects a manuscript, shared dataset, or teaching artifact.
For a full pipeline, produce or update:
scripts/10_build_panel.R or scripts/merge_panel.py.data/interim/ or data/analysis/.output/logs/<step>.log with command, timestamp, package versions when relevant, and success or failure.output/audit/sample_flow.csv or .md.output/audit/merge_audit.csv when any merge or match occurs.docs/changelog.md entry when files change.ai4ss-skills: ddi-metadata.yaml, the declared cleaning_contract, <stem>-cleaning.R, <stem>-clean.csv, and a processing event audit.scripts/check_runtime_contract.py --cwd <project> ... to check files/globs, Python imports, R packages, data schema, duplicate CSV keys, expected outputs, and output freshness. Quote shell globs.scripts/validate_data_audits.py sample_flow <path> to check sample-flow columns.scripts/validate_data_audits.py merge_audit <path> to check merge-audit columns.scripts/validate_data_audits.py variable_provenance <path> to check provenance columns.scripts/validate_ai4ss_model.py <path-to-research_model.aiss> when data artifacts depend on a declared AI4SS model.| File | Content | Read when |
|---|---|---|
| pipeline.md | Stage-by-stage data pipeline pattern, file layout, and validation checks | Starting or reorganizing a data workflow |
| audit-schema.md | Schemas for sample flow, merge audit, variable provenance, and changelog entries | Designing outputs or reviewing whether evidence is sufficient |
| prompt-pack.md | Copy-ready prompts for project intake, merge repair, text extraction, and pipeline debugging | Turning a user request into an agent task |
| quality-gates.md | Stop/go checks, failure modes, and minimum evidence for each data stage | Deciding whether a pipeline output is trustworthy |
| worked-example.md | City-year policy panel example with inputs, outputs, logs, and audit artifacts | Teaching or demonstrating the skill |
npx claudepluginhub siyaozheng/ai4ss-skills --plugin ai4ss-skillsGenerates an executable empirical analysis plan from study_spec.md, audit report, and cleaned data structure. Outputs analysis_plan.md for human approval before analysis execution.
Executes the first analysis loop from a design brief and analysis-ready data, producing tables, figures, model outputs, logs, and a run manifest.
Processes and analyzes data quality for ML research. Handles cleaning, missing values, feature engineering, augmentation, splitting, and dataset creation.