Skill

synonym-cluster

Identify tokens or phrases that refer to the same concept but appear in different forms — transcription variants from voice notes, spelling variants, acronyms vs expansions, aliases. Use on any voice-note or STT-derived corpus before frequency/NER/topic work, or to deduplicate entity lists.

npx claudepluginhub danielrosehill/claude-code-plugins --plugin text-corpus-analysis

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Collapse variant surface forms to a canonical concept.

SKILL.md

Similar Skills

ui-ux-pro-max

72.7k

Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.

ui-ux-pro-max

context7-mcp

51.8k

Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.

context7-plugin

seo-cannibalization-detector

36.5k

Analyzes multiple pages for keyword overlap, SEO cannibalization risks, and content duplication. Suggests differentiation, consolidation, and resolution strategies when reviewing similar content.

antigravity-bundle-seo-specialist

Stats

Stars0

Forks0

Last CommitApr 23, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Synonym Cluster

Collapse variant surface forms to a canonical concept.

When to use

Voice notepad corpora: "GitHub" / "git hub" / "get hub" / "gitbub".
Entity lists with aliases: "Daniel Rosehill" / "Daniel R." / "DR".
Mixed acronym/expansion usage: "LLM" / "large language model".

Procedure

Candidate extraction: take unique tokens or NER spans from word-frequency / ner-extraction. Keep surface counts.
Pair-score candidates:
- Lexical: RapidFuzz token_sort_ratio, Jaro-Winkler, or Levenshtein normalized by length. Fast, handles typos and transcription slips well.
- Phonetic: Metaphone / Double Metaphone for STT variants ("get hub" ≈ "github" phonetically).
- Embedding: sentence-transformer similarity for semantic aliases ("LLM" vs "large language model") — needed when lexical/phonetic fails.
Cluster: agglomerative clustering with a similarity threshold (lexical ≥0.85, embedding ≥0.88). Union-find is fine for sparse pair sets.
Pick canonical: the highest-frequency member of each cluster, or the longest well-formed member for acronym cases.
Review: surface clusters with borderline similarity (0.80-0.90) for manual confirmation — automated merging at that band causes false positives. For expert judgment on a specific cluster, a single LLM call is fine.

Output:

[
  {"canonical": "GitHub", "variants": ["git hub", "get hub", "gitbub"], "total_count": 1423},
  {"canonical": "large language model", "variants": ["LLM", "LLMs"], "total_count": 892}
]

Apply: produce a replacement map and run over the corpus (re.sub with word boundaries), or keep as a lookup for downstream skills.

Cost control

No LLM needed for the clustering pass itself. Embeddings (if used) are a one-time cost on the unique-vocabulary set — typically 5-50k unique tokens, not N docs. Cheap.

Pitfalls

Don't merge across semantic categories (person names ≠ place names) — keep NER labels during clustering.
Acronyms collide ("MS" = Microsoft / Multiple Sclerosis / manuscript). Context required — escalate ambiguous clusters to an LLM.