From nickcrew-claude-ctx-plugin
Curates datasets for ML/LLM training: designs schemas, cleans/deduplicates data, handles class imbalance, creates stratified train/val/test splits, and writes dataset cards.
npx claudepluginhub nickcrew/claude-cortexThis skill uses the workspace's default tool permissions.
This skill covers the full lifecycle of dataset creation and curation for machine learning and LLM tasks. It addresses dataset schema design, data collection strategies, quality filtering, deduplication, class imbalance mitigation, stratified train/val/test splits, annotation guideline writing, and dataset card documentation. Good datasets are the foundation of reliable models — this skill help...
Create and manage Hugging Face Hub datasets: initialize repos, configure prompts/metadata, stream row updates, and query/transform data with DuckDB SQL.
Provides patterns for curating, versioning, validating quality, and integrating golden datasets into CI pipelines for AI/ML evaluations and LLM testing.
Sets up Label Studio for systematic data labeling in supervised ML projects, with quality controls, inter-annotator agreement, team management, and active learning integration. Use for labeling text, images, audio, or video.
Share bugs, ideas, or general feedback.
This skill covers the full lifecycle of dataset creation and curation for machine learning and LLM tasks. It addresses dataset schema design, data collection strategies, quality filtering, deduplication, class imbalance mitigation, stratified train/val/test splits, annotation guideline writing, and dataset card documentation. Good datasets are the foundation of reliable models — this skill helps teams avoid the most common data quality pitfalls that lead to poor generalization, evaluation leakage, and biased models.
| Task | Approach |
|---|---|
| Define dataset schema | List fields, types, required vs optional, allowed values, and examples |
| Remove duplicates | Hash-based exact dedup + MinHash/LSH for near-duplicate detection |
| Fix class imbalance | Oversample minority (SMOTE) or undersample majority; adjust loss weights |
| Create train/val/test splits | Stratified split by label; ensure no overlap of entities across splits |
| Document the dataset | Write a dataset card with provenance, schema, statistics, and limitations |
| Validate annotation quality | Compute inter-annotator agreement (Cohen's kappa or Krippendorff's alpha) |
| Handle missing values | Decide per-field: impute, drop row, or add "unknown" category |
| Detect label noise | Use confident learning (cleanlab) or cross-validation outlier detection |
Define the task and schema — Before collecting any data, write the schema: every field name, data type, allowed values, and whether it is required. For classification datasets, enumerate all valid labels and their definitions. Ambiguous schemas cause inconsistent annotations and training failures.
Establish collection strategy — Determine the data source: human-annotated, LLM-generated, web-scraped, synthetic, or a combination. Document collection date, source URLs, licenses, and any sampling decisions. Ensure the collection covers the full input distribution the model will encounter in production.
Write annotation guidelines — Create a guideline document for labelers that defines every label, provides positive and negative examples for each, and includes decision rules for edge cases. Pilot the guidelines with 2–3 annotators on a sample of 50 items and iterate before full annotation begins.
Run quality filtering — Remove items that are too short, too long, contain encoding errors, are in the wrong language, or fail domain-specific quality checks. Log how many items were removed at each filter step and why. Preserve a raw snapshot before filtering.
Deduplicate the dataset — Apply exact deduplication first (hash the text or key fields). Then apply near-duplicate detection using MinHash + LSH (e.g., datasketch library) or sentence embedding cosine similarity. Aim to remove items with >80% overlap. Keep the highest-quality copy when deduplicating.
Assess and address class imbalance — Compute class distribution. If any class has less than 5% of the majority class count, consider: (a) collecting more data for minority classes, (b) oversampling with augmentation, (c) applying class weights in the loss function, or (d) using stratified sampling. Document the chosen approach and its rationale.
Create stratified splits — Split data into train/val/test sets preserving class distribution in each split. Use an 80/10/10 or 70/15/15 ratio as a starting point. For datasets with identifiable entities (users, documents, companies), ensure the same entity never appears in multiple splits to prevent leakage. Test set must remain untouched until final evaluation.
Measure annotation quality — For human-labeled data, compute inter-annotator agreement on a sample (at least 10% of the dataset labeled by 2+ annotators). Cohen's kappa > 0.7 is acceptable; < 0.6 indicates guideline issues. Resolve disagreements through adjudication, not random selection.
Audit for biases and coverage gaps — Analyze the dataset across demographic attributes, time periods, domains, and edge cases. Identify over-represented and under-represented slices. Document known limitations in the dataset card. If possible, compare distribution to real production data.
Write the dataset card — Document: dataset name, version, task, schema, collection methodology, source licenses, size (rows and tokens), class distribution, split sizes, known limitations and biases, and citation. Use the Hugging Face dataset card format for compatibility with the ecosystem.
Input: A support team wants to train a model to classify customer emails into 8 intent categories: billing, refund, technical-issue, account-access, product-question, cancellation, complaint, and other.
Output:
Schema definition:
{
"id": "string (UUID)",
"text": "string (customer email body, min 10 chars, max 2000 chars)",
"label": "enum: [billing, refund, technical-issue, account-access, product-question, cancellation, complaint, other]",
"confidence": "float 0.0–1.0 (annotator confidence)",
"annotator_id": "string",
"annotation_date": "ISO 8601 date",
"source": "enum: [zendesk, email-archive, synthetic]"
}
Collection plan:
[NAME], emails with [EMAIL], account numbers with [ACCT]Annotation guidelines excerpt:
Label: "refund"
Definition: Customer is requesting money back for a purchase.
Positive examples:
- "I'd like a refund for my order from last week"
- "Please return the charge to my credit card"
Negative examples (do NOT label as refund):
- "I want to exchange my item" → label as product-question
- "Why was I charged twice?" → label as billing
Edge case: "I want to cancel and get my money back" → label as cancellation (intent is to cancel)
Deduplication approach:
from datasketch import MinHash, MinHashLSH
def get_minhash(text, num_perm=128):
m = MinHash(num_perm=num_perm)
for word in text.lower().split():
m.update(word.encode('utf8'))
return m
lsh = MinHashLSH(threshold=0.8, num_perm=128)
# Add items and query for near-duplicates; keep only one per cluster
Split strategy:
Total: 5,200 items (after dedup and filtering)
Train: 4,160 (80%) — stratified by label
Val: 520 (10%) — stratified by label
Test: 520 (10%) — stratified by label; set aside immediately, not used for development
Input: A manufacturing team wants to train a binary classifier to detect product defects from camera images. They have 10,000 raw images, but only ~400 are defective.
Output:
Schema definition:
image_id: string (factory_line + timestamp)
image_path: string (relative path to image file)
label: enum: [defective, non-defective]
defect_type: enum: [scratch, dent, discoloration, missing-component, none] (sub-type for defective)
image_width: int (pixels)
image_height: int (pixels)
capture_date: ISO 8601 date
production_line: string
annotated_by: string (annotator ID or "automated")
Quality filtering rules:
Class imbalance strategy:
Original: 9,600 non-defective, 400 defective (24:1 imbalance)
Option 1 — Data augmentation for minority class:
Apply: horizontal flip, ±15° rotation, brightness ±20%, add Gaussian noise
Target: 2,000 defective images (5× augmentation)
Result: 9,600 non-defective, 2,000 defective (4.8:1 ratio) — more manageable
Option 2 — Class weighting (simpler, use if augmentation is not feasible):
class_weight = {0: 1.0, 1: 24.0} # inverse frequency weighting
Apply in model training loss function
Recommendation: Use both — augment to 2,000 AND apply 4.8:1 class weight
Dataset card excerpt:
Dataset Name: Manufacturing Defect Detection v1.2
Task: Binary image classification (defective / non-defective)
Size: 11,600 images (9,600 non-defective, 2,000 defective after augmentation)
Source: Factory line cameras, Line A and Line B, 2024-01 to 2024-06
License: Internal use only (proprietary)
Known Limitations:
- Only covers Lines A and B; Line C has different lighting conditions
- Defective samples over-represent scratches (60% of defects)
- No samples from night shift (different ambient light)
Split: Train 80% / Val 10% / Test 10% (stratified by label and production line)
cleanlab to automatically detect likely mislabeled examples in existing datasetsdatasets library handles streaming, caching, and map operations efficiently for large datasetsdata_source field to every item — it's invaluable when debugging distribution shiftaugmented: true flag for traceability