From Claude-Data-Wrangler
Replace PII (or other sensitive values) in a dataset with synthetic but realistic substitutes, preserving statistical shape, formats, and referential integrity where needed. Use after pii-flag has identified sensitive cells and the user wants the dataset anonymised but still analytically usable.
npx claudepluginhub danielrosehill/claude-code-plugins --plugin Claude-Data-WranglerThis skill uses the workspace's default tool permissions.
Replace sensitive values with synthetic ones that preserve shape and joinability.
Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.
Share bugs, ideas, or general feedback.
Replace sensitive values with synthetic ones that preserve shape and joinability.
pii-flag has produced a report and the user wants remediation by substitution (rather than drop or hash).Choose per column based on the kind of data:
| Category | Strategy |
|---|---|
| Names | Generate synthetic names via Faker (locale-matched if possible). |
| Emails | Synthetic email using generated name + fake domain (@example.com). |
| Phone numbers | Faker generator with country locale matching the original. |
| Addresses | Faker address; preserve country/city-level granularity if needed for analysis. |
| Dates of birth | Shift by a random offset (per-row or constant), OR bucket to age range and resample. |
| Government IDs | Generate format-valid but non-real IDs (e.g. valid checksum, clearly-fake prefix). |
| Credit cards | Use PAN test ranges (e.g. 4111-1111-...) which are Luhn-valid but publicly known as test numbers. |
| Geocoordinates | Add uniform noise within a radius (e.g. ±500m) OR snap to city centroid. |
| Free text with PII | Presidio-anonymizer with per-entity strategies. |
| Arbitrary categorical | Sample from original distribution (preserves frequencies) or a uniform placeholder. |
| Numerical (non-PII but sensitive) | Add calibrated noise or use synthetic generation libraries (sdv, synthcity). |
pii-flag). Do not re-detect from scratch here._synthetic suffix. The original file is never modified.pii-flag on the output as a verification step. Report residual findings.hf-dataset-push, ensure the dataset card marks the data as synthetic-overlaid with a clear disclaimer.pip install pandas faker
# optional
pip install presidio-anonymizer sdv synthcity
sdv CTGAN, sampling from the original distribution) and document the chosen approach.