Skill

unicode-consistency

Assess whether a dataset uses a consistent Unicode character set and normalisation form across its text columns. Detects mixed scripts, mixed normalisation forms (NFC/NFD/NFKC/NFKD), mojibake, mixed encodings, zero-width characters, confusables (homoglyphs), and BOM issues. Produces a remediation script with proposed fixes. Use when downstream text processing, search, or storage depends on clean Unicode hygiene.

npx claudepluginhub danielrosehill/claude-code-plugins --plugin Claude-Data-Wrangler

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Audit and remediate Unicode issues in text columns.

SKILL.md

Similar Skills

github-deep-research

63.9k

Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.

2 files

bytedance-deer-flow-1

Stats

Stars0

Forks0

Last CommitApr 23, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Unicode Consistency

Audit and remediate Unicode issues in text columns.

When to invoke

User suspects "weird characters" in the data.
Search / joins on text keys return inconsistent results (usually a normalisation or invisible-char issue).
Dataset came from multiple sources with different encodings / input methods.
Preparing text for embedding (see vector-upsert) or search indexing.

Checks performed

File / column encoding — detect declared vs actual encoding using chardet or charset-normalizer; flag mismatches.
Normalisation form — for each text column, sample values and check whether they are consistently NFC (or NFD/NFKC/NFKD). Mixed forms cause key mismatches that look identical to a human.
Script mixing within a value — e.g. Latin + Cyrillic in the same "word" (аpple with Cyrillic а instead of Latin a). Detect via unicodedata.category / unicodedata.name script inspection.
Confusables / homoglyphs — use the Unicode confusables table (confusable_homoglyphs / uniseg) to flag values likely to be spoofed or accidentally copy-pasted.
Invisible characters — zero-width space (U+200B), zero-width joiner (U+200D), soft hyphen (U+00AD), left-to-right / right-to-left marks (U+200E/F), BOM (U+FEFF) embedded mid-string.
Whitespace variants — non-breaking space (U+00A0), em/en spaces, thin space, ideographic space — all commonly confused with regular space.
Quote / dash variants — smart quotes (", ', ') vs ASCII quotes, em-dash / en-dash vs hyphen-minus.
Emoji and sequence completeness — broken zero-width-joiner sequences from lossy copy/paste.
Mojibake patterns — double-encoded UTF-8 (Ã© for é, â€™ for ', ä¸ for 中). Detect via statistical patterns.
Case-folding pitfalls — Turkish I/ı, German ß — flag if the dataset relies on case-insensitive matching.
Combining marks not attached — orphan combining characters.
RTL / bidi isolation — mixed LTR/RTL text without proper bidi markers for display.

Procedure

Profile text columns — list columns with dtype == object and sample values.
Run the checks above per column; aggregate counts of each issue category.
Produce a report unicode_report.md:
- Per column: issues detected, counts, sample rows (masked if PII), severity.
- Severity rubric: mismatched normalisation = high (silent joins break), mojibake = high (lossy), confusables = medium (often benign but can indicate spoofing), invisible chars = medium, cosmetic whitespace = low.
Propose a remediation script per column with:
- unicodedata.normalize('NFC', s) (or the user's preferred form).
- Strip invisible characters (whitelist of explicitly-allowed ones, remove the rest).
- Replace non-breaking spaces with regular spaces (if appropriate).
- Normalise quotes/dashes if the domain wants ASCII.
- Fix known mojibake via ftfy.fix_text(...).
Preview on sample before applying — show 10 before/after pairs and confirm with user.
Apply to a new column (<col>_nfc or <col>_clean) by default; overwrite only on explicit request — and follow the backup policy in CONVENTIONS.md.
Re-run the audit to confirm residual issues.
Update the data dictionary — record the normalisation form and cleaning rules applied per column.

Dependencies

pip install pandas charset-normalizer ftfy
# optional
pip install confusable-homoglyphs

Python stdlib unicodedata covers most detection needs.

Edge cases

Intentional script mixing — multilingual datasets (e.g. Japanese text with Latin product codes) are expected. Whitelist allowed script combinations rather than flagging everything.
Domain-specific punctuation — code columns (e.g. identifiers) may legitimately use non-ASCII; don't auto-normalise without user confirmation.
Round-trip-sensitive fields — cryptographic hashes, signatures, URLs with percent-encoding — never normalise; flag as "do not touch".
ftfy limits — ftfy.fix_text is heuristic and occasionally over-corrects. Preview before applying at scale.
Normalisation form downstream requirements — HF Datasets / JSON-LD / web APIs typically want NFC; some linguistic tools want NFD. Ask before picking.

Safety

Follow the backup policy in CONVENTIONS.md before any in-place mutation of text values.