From claude-data-analyst
Scan a dataset and flag columns or values that appear to contain personally identifiable information (PII). Use when the user wants a quick privacy audit of a CSV/Parquet/Excel file before sharing, publishing, or ingesting into another system.
npx claudepluginhub danielrosehill/claude-code-plugins --plugin claude-data-analystThis skill uses the workspace's default tool permissions.
First-pass privacy scan of a dataset. Identifies columns likely to contain PII and samples matching rows.
Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.
Share bugs, ideas, or general feedback.
First-pass privacy scan of a dataset. Identifies columns likely to contain PII and samples matching rows.
standard default, strict also flags quasi-identifiers like ZIP, DOB, gender).duckdb — regex-based column scans at speed.uv run --with presidio-analyzer python -c '...' — Microsoft Presidio for ML-based entity detection when regex is insufficient.ripgrep (rg) — for ad-hoc text-file scans before structured analysis.Run column-level checks in two passes:
Match column headers (case-insensitive) against PII vocabulary:
name, first_name, last_name, email, phone, mobile, address, street, ssn, nino, passport, national_id, credit_card, iban, account_number, dob, date_of_birthzip, postcode, gender, ethnicity, ageSample up to 1000 rows per string column and test:
[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\+?\d[\d\s\-()]{7,}\d\d{3}-\d{2}-\d{4})Write <dataset>-pii-report.md containing:
| Column | Detection basis | Confidence | Sample matches (redacted) | Recommendation |
Confidence levels: high (regex + name match), medium (one of the two), low (value pattern only).
End the report with:
Never print raw PII values into the report — always mask (e.g. j***@example.com, ***-**-1234).