Skill

suggest-categories

Derive N categories from the dominant themes of a corpus — the user says "give me 10 categories for these 1000 notes" or "propose 20 labels that would cover most of this data". Produces a proposed category list with definitions, coverage estimates, and example documents.

npx claudepluginhub danielrosehill/claude-code-plugins --plugin text-corpus-analysis

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Bottom-up category derivation. Given a corpus and a target *N*, produce a workable categorization scheme.

SKILL.md

Similar Skills

ui-ux-pro-max

72.7k

Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.

ui-ux-pro-max

context7-mcp

51.8k

Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.

context7-plugin

seo-cannibalization-detector

36.5k

Analyzes multiple pages for keyword overlap, SEO cannibalization risks, and content duplication. Suggests differentiation, consolidation, and resolution strategies when reviewing similar content.

antigravity-bundle-seo-specialist

Stats

Stars0

Forks0

Last CommitApr 23, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Suggest Categories

Bottom-up category derivation. Given a corpus and a target N, produce a workable categorization scheme.

When to use

"What are the 10 main themes in these 500 GitHub repo READMEs?"
"Give me 15 categories that would cover my voice notes."
"I need to extend my taxonomy from 12 to 18 — propose the 6 new ones."

Procedure

Sample, don't sweep. A stratified sample of 300-1000 documents gives essentially the same themes as the full corpus. For large corpora (>2k), always sample.
Embed + cluster (choose-approach recommends this):
- Embeddings: all-MiniLM-L6-v2 (local, free) or text-embedding-3-small (cloud, cheap).
- Cluster: HDBSCAN if N is flexible, k-means with k=N if exact count is required, or BERTopic for topic-word output.
Label clusters with an LLM (cheap model — one call per cluster):
- Input: top-10 c-TF-IDF terms + 3-5 exemplar docs.
- Ask for: {label, 1-sentence definition, distinguishing examples}.
Coverage check: run the proposed categories through categorize-corpus on the sample with a confidence threshold. Anything that lands in "none" or "low-confidence" tells you where the scheme has gaps.
Iterate: merge overlapping categories, split oversized ones, add catch-alls sparingly. Aim for roughly balanced cluster sizes unless the user explicitly wants long-tail granularity.

Output:

{
  "proposed_categories": [
    {
      "label": "Infrastructure & DevOps",
      "definition": "Notes about servers, containers, CI/CD, deployment, monitoring.",
      "estimated_share": 0.14,
      "exemplars": ["doc_id_123", "doc_id_456"],
      "top_terms": ["docker", "k8s", "deploy", ...]
    }
  ],
  "uncovered_share": 0.07,
  "notes": "..."
}

Extending an existing taxonomy (N → N+k)

Run categorize-corpus with current N categories.
Isolate low-confidence + "none" docs.
Run this skill on that residue with target k.

Cost control

One LLM call per cluster (typically 10-30 clusters) — trivially cheap.
Embeddings are the bulk of cost. Batch 100+/request. For 10k docs × 500 tokens: ~$0.10 at cloud prices, free locally.