Skill

textual-metadata-dataset-construction

Automatically constructs large-scale image datasets from web sources using semantic query expansion with N-grams and CNN-based filtering to reduce bias and boost generalization for image classification.

Python

ai-ml

npx claudepluginhub lunartech-x/superpowers --plugin superpowers

Tool Access

This skill is limited to using the following tools:

ReadWriteEditBashWebFetch

Preview

This skill provides a framework for automatically collecting diverse, high-quality image datasets from the web using semantic query expansion and progressive CNN-based filtering. The methodology addresses key challenges:

SKILL.md

Similar Skills

automated-image-dataset-generation

Generates large-scale labeled image datasets via web scraping and LMMs like Gemini Vision, achieving ~95% metadata accuracy for classification and object detection. Use for custom ML training data when manual labeling is impractical.

6 tools

superpowers

dataset-curator

Curates datasets for ML/LLM training: designs schemas, cleans/deduplicates data, handles class imbalance, creates stratified train/val/test splits, and writes dataset cards.

nickcrew-claude-ctx-plugin

hugging-face-datasets

36.4k

Create and manage Hugging Face Hub datasets: initialize repos, configure prompts/metadata, stream row updates, and query/transform data with DuckDB SQL.

antigravity-awesome-skills

Stats

Stars13

Forks2

Last CommitMar 29, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Automatic Image Dataset Construction with Multiple Textual Metadata

Overview

Dataset Bias: Reduces bias by expanding queries semantically
Noise Reduction: Filters irrelevant images through clustering and CNNs
Cross-Dataset Generalization: Creates datasets that generalize well to unseen domains

Key Innovation: Uses Google Books Ngrams Corpora for query expansion to capture richer semantic descriptions, then progressively filters using CNNs.

When to Use This Skill

Use this skill when:

Building training datasets that need to generalize across domains
Current datasets suffer from selection bias
Manual annotation budget is limited
Need diverse image coverage for a concept
Building datasets for image classification or object detection
Comparing performance with established datasets (STL-10, CIFAR-10)

Core Workflow

Phase 1: Query Expansion

Initial Query Definition:

initial_queries = ["dog", "car", "airplane"]

Semantic Expansion with N-gram Corpora:

def expand_query_with_ngrams(query, ngram_data):
    """
    Expand query using Google Books Ngrams:
    - Find co-occurring terms
    - Add synonyms and related concepts
    - Include descriptive modifiers
    
    Example: "dog" → ["dog breed", "puppy", "canine", 
                      "dog playing", "dog running", ...]
    """
    expansions = []
    
    # Get bigrams containing the query
    bigrams = get_ngrams(query, n=2, ngram_data=ngram_data)
    
    # Get trigrams for context
    trigrams = get_ngrams(query, n=3, ngram_data=ngram_data)
    
    # Combine and rank by frequency
    expansions = rank_by_relevance(bigrams + trigrams)
    
    return expansions

Visual Saliency Filtering:

def filter_expansions(expansions, visual_model):
    """
    Remove expansions that are:
    - Visually non-salient (abstract concepts)
    - Less relevant to visual domain
    - Too generic or too specific
    
    Use pre-trained visual model to score saliency
    """
    filtered = []
    for exp in expansions:
        # Check if expansion corresponds to visually identifiable concept
        saliency_score = visual_model.predict_saliency(exp)
        if saliency_score > threshold:
            filtered.append(exp)
    return filtered

Phase 2: Web Image Retrieval

Multi-Query Image Collection:

def collect_images(expanded_queries, images_per_query=500):
    """
    Retrieve images using expanded queries:
    - Use multiple search engines
    - Collect metadata (source URL, query used)
    - Diversify sources to reduce bias
    """
    all_images = []
    
    for query in expanded_queries:
        images = search_engine.image_search(
            query, 
            num_results=images_per_query
        )
        for img in images:
            img['source_query'] = query
        all_images.extend(images)
    
    return all_images

Initial Preprocessing:

def preprocess_images(images):
    """
    - Remove duplicates (perceptual hash)
    - Validate image format
    - Resize to standard dimensions
    - Remove corrupted files
    """
    pass

Phase 3: Clustering-Based Noise Filtering

Feature Extraction:

def extract_features(images, cnn_model):
    """
    Extract deep features using pre-trained CNN
    (e.g., VGG, ResNet features from penultimate layer)
    """
    features = []
    for img in images:
        feat = cnn_model.extract_features(img)
        features.append(feat)
    return np.array(features)

Cluster Analysis:

from sklearn.cluster import KMeans

def cluster_and_filter(features, images, n_clusters=10):
    """
    Cluster images by visual similarity:
    - Identify core clusters (likely relevant)
    - Remove outlier clusters (likely noise)
    - Keep images from dense, coherent clusters
    """
    kmeans = KMeans(n_clusters=n_clusters)
    clusters = kmeans.fit_predict(features)
    
    # Analyze cluster statistics
    cluster_stats = analyze_clusters(clusters, features)
    
    # Remove outlier clusters (low density, high variance)
    valid_clusters = [
        c for c in cluster_stats 
        if c['density'] > threshold and c['coherence'] > min_coherence
    ]
    
    filtered_images = [
        img for img, c in zip(images, clusters)
        if c in valid_clusters
    ]
    
    return filtered_images

Phase 4: Progressive CNN Filtering

Initial CNN Training:

def train_initial_classifier(clustered_images, num_classes):
    """
    Train initial CNN classifier on clustered data:
    - Use cluster assignments as pseudo-labels
    - Fine-tune pre-trained model
    """
    model = load_pretrained_cnn()
    model = fine_tune(model, clustered_images)
    return model

Progressive Refinement:

def progressive_filtering(images, model, iterations=3):
    """
    Iteratively refine dataset:
    1. Classify all images with current model
    2. Remove low-confidence predictions
    3. Retrain model on refined set
    4. Repeat
    """
    for i in range(iterations):
        # Predict on all images
        predictions = model.predict(images)
        
        # Filter by confidence
        confident_samples = [
            (img, pred) for img, pred in zip(images, predictions)
            if pred['confidence'] > confidence_threshold(i)
        ]
        
        # Retrain on refined set
        model = train_classifier(confident_samples)
        
        images = [s[0] for s in confident_samples]
    
    return images, model

Phase 5: Dataset Finalization

Quality Verification:

def verify_dataset_quality(dataset, test_set):
    """
    Evaluate dataset quality:
    - Cross-dataset generalization (test on STL-10, CIFAR-10)
    - Class balance analysis
    - Diversity metrics
    """
    # Train classifier on generated dataset
    model = train_classifier(dataset)
    
    # Test on external datasets
    stl10_accuracy = evaluate(model, stl10_test)
    cifar10_accuracy = evaluate(model, cifar10_test)
    
    return {
        'cross_dataset_acc': (stl10_accuracy + cifar10_accuracy) / 2,
        'class_balance': compute_balance(dataset),
        'diversity': compute_diversity(dataset)
    }

Export Dataset:

dataset/
├── train/
│   ├── class_1/
│   ├── class_2/
│   └── ...
├── val/
├── test/
├── metadata.json
└── dataset_stats.md

Key Techniques

Query Expansion Strategy

Use Google Books Ngrams (n=2,3,4) for semantic expansion
Filter by visual saliency to remove abstract terms
Balance specificity vs. diversity in expansions

Progressive Filtering

Start with loose confidence threshold (~0.5)
Increase threshold each iteration (~0.6, 0.7, 0.8)
Allows gradual refinement without losing good samples early

Reducing Dataset Bias

Use multiple search engines
Diversify query expansions
Cluster analysis ensures visual diversity
Cross-domain validation confirms generalization

Best Practices

Query Expansion:
- Start with 5-10 core concepts per category
- Generate 20-50 expansions per core concept
- Filter to top 10-20 based on visual saliency
Clustering:
- Use k-means with k = 10-20 clusters per category
- Validate cluster coherence manually on samples
- Remove clusters with < 5% of total images
Progressive Filtering:
- Use 3-5 iterations
- Start confidence threshold: 0.5
- Increase by 0.1 per iteration
- Stop when dataset size stabilizes
Validation:
- Always test on external datasets
- Compare with CIFAR-10, STL-10 baselines
- Report cross-dataset accuracy

Expected Results

Based on original research:

Generates datasets larger than manually labeled alternatives
Achieves comparable generalization to STL-10, CIFAR-10
Significant performance gains on:
- Image classification
- Cross-dataset generalization
- Object detection

Dependencies

# Deep learning
pip install torch torchvision  # or tensorflow

# Clustering
pip install scikit-learn

# Image processing
pip install pillow opencv-python

# N-gram data
# Download Google Books Ngrams: https://storage.googleapis.com/books/ngrams/books/datasetsv3.html

Integration with Other Skills

pytorch: Train classifiers during filtering
scikit-learn: Clustering and evaluation
matplotlib: Visualize cluster distributions
exploratory-data-analysis: Dataset statistics

References

"Automatic Image Dataset Construction with Multiple Textual Metadata" (IEEE ICME 2016)
Google Books Ngram Viewer: https://books.google.com/ngrams