Help us improve
Share bugs, ideas, or general feedback.
From medsci-presentation
Generates a citable data dictionary/codebook from tabular datasets (CSV/TSV/Excel/Parquet/Stata/SAS). Profiles each variable's role, type, missingness, and distributions, flagging coded values as [NEEDS DICTIONARY].
npx claudepluginhub aperivue/medsci-skills --plugin medsci-literatureHow this skill is triggered — by the user, by Claude, or both
Slash command
/medsci-presentation:generate-codebookinheritThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You help a medical researcher turn a raw tabular dataset into a structured,
Profiles and flags issues in clinical CSV/Excel data (missing values, outliers, duplicates, type mismatches) via a three-stage workflow with researcher approval gates. Does not auto-clean.
Profiles CSV/TSV/Excel files: detects format, counts rows/headers, computes basic/advanced statistics (kurtosis, Gini, outliers), shows top value distributions.
Audits Stata datasets for structure, missingness, labeling, suspicious values, duplicate identifiers, and documentation readiness. Useful for data QA, codebook reviews, sanity checks, and pre-analysis cleanup.
Share bugs, ideas, or general feedback.
You help a medical researcher turn a raw tabular dataset into a structured,
citable data dictionary (codebook). This is the generator side of the
dictionary-first workflow: it produces the artifact that /define-variables and
dictionary-first QC later consume. You generate code and review output — you do
not invent the meaning of coded values.
A codebook describes what is in the data, not what the codes mean. Column
distributions, types, and missingness are observable and safe to profile. The
meaning of a coded value (fatty_liver_grade = 0) is NOT observable from the
data — it lives in the authoritative data dictionary. This skill profiles the
former deterministically and explicitly flags the latter as [NEEDS DICTIONARY]
so a human fills it from the source. This is the generator counterpart to the
dictionary-first rule that /define-variables enforces on consumption.
${CLAUDE_SKILL_DIR}/references/codebook_schema.md — the
codebook.json schema, the role-inference heuristics, and how the output threads
into /define-variables and dictionary-first QC. Read this before interpreting output.Run the bundled profiler rather than describing columns from memory:
python "${CLAUDE_SKILL_DIR}/scripts/generate_codebook.py" data.csv --out-dir .
Supports .csv/.tsv/.xlsx/.parquet/.dta/.sas7bdat. Flags: --max-levels N
(categorical cutoff, default 20), --json-only, --md-only. The script is
pandas-only, runs locally, and never sends data anywhere.
Run generate_codebook.py on the dataset. It writes codebook.json (machine-
readable) and codebook.md (review table), reporting per variable: role
(id / continuous / categorical / binary / date / text), dtype, missingness,
unique count, level frequencies or quantile summary, and a needs_dictionary flag.
Present codebook.md and walk the user through it. Gate: the user confirms
the inferred roles (e.g., an integer-coded scale mis-read as continuous, or an id
column). Do not proceed to definition work until the user approves the role
assignments.
For every variable flagged needs_dictionary: true, the level codes are
uninterpretable without the authoritative source. Gate: ask the user to
supply the meaning of each code from the real data dictionary (file/sheet/row),
or to confirm none exists. Fill label, units, and per-level meanings into the
codebook only from that source — never from inference. If the user cannot
supply it, leave the [NEEDS DICTIONARY] marker in place; do not erase it.
The completed codebook.json becomes the input dictionary for /define-variables
(operationalization) and the citation source for dictionary-first QC. Gate:
confirm with the user that no needs_dictionary flags remain unresolved before
the codebook is treated as authoritative for downstream analysis.
.dta), SAS (.sas7bdat).[NEEDS DICTIONARY])./clean-data./deidentify before sharing./define-variables (this skill feeds it).codebook.json as its data dictionary input.codebook.json (schema in references) and codebook.md (review table with a
"Columns requiring dictionary lookup" section). Summarize the counts
(rows, columns, needs_dictionary_count) in chat; do not paste the full JSON.
Input cohort.csv:
patient_id,age,sex,fatty_liver_grade,smoking_status,visit_date
1001,54,1,0,never,2023-01-15
1002,61,2,2,former,2023-02-03
Run:
python "${CLAUDE_SKILL_DIR}/scripts/generate_codebook.py" cohort.csv --out-dir .
# -> {"n_rows": ..., "n_columns": 6, "needs_dictionary_count": 2, "outputs": [...]}
codebook.md (excerpt):
| Variable | Role | Missing % | Unique | Needs dictionary |
| `patient_id` | id | 0.0 | N | |
| `age` | continuous | 0.0 | ... | |
| `sex` | binary | 0.0 | 2 | ⚠️ YES |
| `fatty_liver_grade` | categorical | 0.0 | 5 | ⚠️ YES |
| `smoking_status` | categorical | 0.0 | 3 | |
| `visit_date` | date | 0.0 | ... | |
sex and fatty_liver_grade are flagged because their levels are bare codes
(1/2, 0..4). smoking_status is not flagged — its levels are already
human-readable. The reviewer then:
sex: 1 = male, 2 = female and fatty_liver_grade: 0 = none … 4 = suspected
into the codebook from that source (citing file > sheet > row).[NEEDS DICTIONARY] flags remain, then hands codebook.json to
/define-variables.What the skill must never do: write sex: 1 = male because "that is the
usual coding." If the dictionary is unavailable, the flag stays.
[NEEDS DICTIONARY];
the meaning is filled only from the authoritative data dictionary, then cited.