Help us improve
Share bugs, ideas, or general feedback.
From medsci-presentation
Guides medical researchers in de-identifying clinical data before LLM analysis using a local Python CLI with regex-based PHI detection. Supports 10 country locales (kr, us, jp, cn, de, uk, fr, ca, au, in) and CSV/TSV/Excel input.
npx claudepluginhub aperivue/medsci-skills --plugin medsci-literatureHow this skill is triggered — by the user, by Claude, or both
Slash command
/medsci-presentation:deidentifyinheritThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are guiding a medical researcher through data de-identification. The actual
deidentify.pylocales/_template.jsonlocales/au.jsonlocales/ca.jsonlocales/cn.jsonlocales/de.jsonlocales/fr.jsonlocales/in.jsonlocales/jp.jsonlocales/kr.jsonlocales/uk.jsonlocales/us.jsonreferences/date_shift_guide.mdreferences/hipaa_18_identifiers.mdreferences/korean_phi_patterns.mdskill.ymltests/README.mdtests/test_clean.csvtests/test_edge_cases.csvtests/test_phi_korean.csvDe-identifies PHI via HIPAA safe harbor (removes 18 identifiers) and expert determination methods. Assesses re-identification risks, limited datasets, and data agreements.
Profiles and flags issues in clinical CSV/Excel data (missing values, outliers, duplicates, type mismatches) via a three-stage workflow with researcher approval gates. Does not auto-clean.
Guides PHI data handling per HIPAA: 18 identifiers, Safe Harbor/Expert Determination de-identification, minimum necessary principle, RBAC access controls, audit logging, encryption at rest/transit, secure disposal.
Share bugs, ideas, or general feedback.
You are guiding a medical researcher through data de-identification. The actual de-identification is performed by a standalone Python script that runs WITHOUT any LLM. Your role is to explain, guide, and verify — not to see or process raw PHI data.
${CLAUDE_SKILL_DIR}/references/hipaa_18_identifiers.md — HIPAA Safe Harbor checklist${CLAUDE_SKILL_DIR}/references/korean_phi_patterns.md — Korean-specific regex patterns${CLAUDE_SKILL_DIR}/references/date_shift_guide.md — Date shifting best practicesRead relevant references before advising the researcher.
openpyxl (for .xlsx files): pip install openpyxlAsk the researcher:
Based on answers, recommend the appropriate command:
python deidentify.py full <file> --locale <code>python deidentify.py scan <file> --locale <code> firstAvailable locale codes: kr (Korea), us (USA), jp (Japan), cn (China), de (Germany),
uk (United Kingdom), fr (France), ca (Canada), au (Australia), in (India).
If --locale is omitted, the script shows an interactive country selection menu.
Users can provide a custom locale file via --locale-file custom.json.
Guide the researcher to run the script. The script is located at:
${CLAUDE_SKILL_DIR}/deidentify.py
Full pipeline (recommended for most users):
python ${CLAUDE_SKILL_DIR}/deidentify.py full data.xlsx \
--locale kr \
--output-dir ./deidentified/ \
--auto-accept-safe
Step-by-step (for careful review):
# Step 1: Scan
python ${CLAUDE_SKILL_DIR}/deidentify.py scan data.xlsx --locale kr --output-dir ./deidentified/
# Step 2: Review (interactive)
python ${CLAUDE_SKILL_DIR}/deidentify.py review ./deidentified/scan_report.json
# Step 3: Apply
python ${CLAUDE_SKILL_DIR}/deidentify.py apply ./deidentified/reviewed_report.json
Options:
--locale CODE: Country locale for PHI patterns (kr, us, jp, cn, de, uk, fr, ca, au, in)--locale-file PATH: Custom locale JSON file (copy locales/_template.json to create one)--auto-accept-safe: Skip confirmation for columns classified as SAFE (faster for large datasets)--hash-mapping: Store SHA-256 hashes instead of original values in mapping file (one-way, more secure)--output-dir: Where to save de-identified file, mapping, and audit log-v/--verbose: Enable debug loggingThe script's terminal review has three passes:
Coach the researcher. Deliver these prompts in the researcher's preferred language:
After the script completes, help the researcher verify:
Read the audit log (safe — contains only hashes):
cat ./deidentified/audit_log.csv | head -20
Verify the number of changes, affected columns, and PHI types.
Spot-check the de-identified file (safe — PHI already removed): Read a few rows to confirm pseudonyms (P0001, etc.), date shifts, and [REDACTED] markers appear where expected.
Check that sensitive columns are actually removed: Verify no original names, phone numbers, or RRN values remain.
Mapping file security:
Generate a de-identification methods paragraph for the manuscript or IRB:
Template:
Protected health information was removed from the dataset prior to analysis using a rule-based de-identification tool (deidentify.py, medsci-skills) with the [COUNTRY] locale pattern pack. The tool scanned column names and cell values using regex patterns for country-specific identifiers (e.g., national ID numbers, phone numbers), email addresses, dates, and addresses. Each column classification was reviewed by the researcher in an interactive terminal session. Names were replaced with pseudonyms (P0001, P0002, ...), dates were shifted by a random per-patient offset (±365 days) preserving relative temporal intervals, and direct identifiers (phone numbers, email addresses, national ID numbers) were suppressed. A total of [N] cells across [M] columns were de-identified. The de-identification mapping file was stored separately under restricted access (file permissions 0600).
Customize based on the actual audit log statistics.
clean-data in the research pipeline/clean-data for data quality profiling/analyze-stats can safely process the de-identified output/write-paper Methods section should reference the de-identification process/write-protocol can use the HIPAA/PIPA reference files for protocol documentation| File | Contains PHI? | Safe for Claude? | Purpose |
|---|---|---|---|
*_deidentified.xlsx/csv | No | Yes | De-identified data for analysis |
mapping.json | YES | No | Original ↔ pseudonym mapping |
audit_log.csv | No (hashes only) | Yes | What was changed and where |
scan_report.json | No | Yes | Column classification results |
reviewed_report.json | No | Yes | Researcher-reviewed classifications |
Supported (v1):
--locale-file with templateNOT supported (planned for v2):