From Claude-Data-Wrangler
Assess whether a dataset uses a consistent Unicode character set and normalisation form across its text columns. Detects mixed scripts, mixed normalisation forms (NFC/NFD/NFKC/NFKD), mojibake, mixed encodings, zero-width characters, confusables (homoglyphs), and BOM issues. Produces a remediation script with proposed fixes. Use when downstream text processing, search, or storage depends on clean Unicode hygiene.
npx claudepluginhub danielrosehill/claude-code-plugins --plugin Claude-Data-WranglerThis skill uses the workspace's default tool permissions.
Audit and remediate Unicode issues in text columns.
Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.
Share bugs, ideas, or general feedback.
Audit and remediate Unicode issues in text columns.
vector-upsert) or search indexing.chardet or charset-normalizer; flag mismatches.аpple with Cyrillic а instead of Latin a). Detect via unicodedata.category / unicodedata.name script inspection.confusable_homoglyphs / uniseg) to flag values likely to be spoofed or accidentally copy-pasted.", ', ') vs ASCII quotes, em-dash / en-dash vs hyphen-minus.é for é, ’ for ', ä¸ for 中). Detect via statistical patterns.I/ı, German ß — flag if the dataset relies on case-insensitive matching.dtype == object and sample values.unicode_report.md:
unicodedata.normalize('NFC', s) (or the user's preferred form).ftfy.fix_text(...).<col>_nfc or <col>_clean) by default; overwrite only on explicit request — and follow the backup policy in CONVENTIONS.md.pip install pandas charset-normalizer ftfy
# optional
pip install confusable-homoglyphs
Python stdlib unicodedata covers most detection needs.
ftfy.fix_text is heuristic and occasionally over-corrects. Preview before applying at scale.Follow the backup policy in CONVENTIONS.md before any in-place mutation of text values.