From superpowers
Automatically constructs large-scale image datasets from web sources using semantic query expansion with N-grams and CNN-based filtering to reduce bias and boost generalization for image classification.
npx claudepluginhub lunartech-x/superpowers --plugin superpowersThis skill is limited to using the following tools:
This skill provides a framework for automatically collecting diverse, high-quality image datasets from the web using semantic query expansion and progressive CNN-based filtering. The methodology addresses key challenges:
Generates large-scale labeled image datasets via web scraping and LMMs like Gemini Vision, achieving ~95% metadata accuracy for classification and object detection. Use for custom ML training data when manual labeling is impractical.
Curates datasets for ML/LLM training: designs schemas, cleans/deduplicates data, handles class imbalance, creates stratified train/val/test splits, and writes dataset cards.
Create and manage Hugging Face Hub datasets: initialize repos, configure prompts/metadata, stream row updates, and query/transform data with DuckDB SQL.
Share bugs, ideas, or general feedback.
This skill provides a framework for automatically collecting diverse, high-quality image datasets from the web using semantic query expansion and progressive CNN-based filtering. The methodology addresses key challenges:
Key Innovation: Uses Google Books Ngrams Corpora for query expansion to capture richer semantic descriptions, then progressively filters using CNNs.
Use this skill when:
Initial Query Definition:
initial_queries = ["dog", "car", "airplane"]
Semantic Expansion with N-gram Corpora:
def expand_query_with_ngrams(query, ngram_data):
"""
Expand query using Google Books Ngrams:
- Find co-occurring terms
- Add synonyms and related concepts
- Include descriptive modifiers
Example: "dog" → ["dog breed", "puppy", "canine",
"dog playing", "dog running", ...]
"""
expansions = []
# Get bigrams containing the query
bigrams = get_ngrams(query, n=2, ngram_data=ngram_data)
# Get trigrams for context
trigrams = get_ngrams(query, n=3, ngram_data=ngram_data)
# Combine and rank by frequency
expansions = rank_by_relevance(bigrams + trigrams)
return expansions
Visual Saliency Filtering:
def filter_expansions(expansions, visual_model):
"""
Remove expansions that are:
- Visually non-salient (abstract concepts)
- Less relevant to visual domain
- Too generic or too specific
Use pre-trained visual model to score saliency
"""
filtered = []
for exp in expansions:
# Check if expansion corresponds to visually identifiable concept
saliency_score = visual_model.predict_saliency(exp)
if saliency_score > threshold:
filtered.append(exp)
return filtered
Multi-Query Image Collection:
def collect_images(expanded_queries, images_per_query=500):
"""
Retrieve images using expanded queries:
- Use multiple search engines
- Collect metadata (source URL, query used)
- Diversify sources to reduce bias
"""
all_images = []
for query in expanded_queries:
images = search_engine.image_search(
query,
num_results=images_per_query
)
for img in images:
img['source_query'] = query
all_images.extend(images)
return all_images
Initial Preprocessing:
def preprocess_images(images):
"""
- Remove duplicates (perceptual hash)
- Validate image format
- Resize to standard dimensions
- Remove corrupted files
"""
pass
Feature Extraction:
def extract_features(images, cnn_model):
"""
Extract deep features using pre-trained CNN
(e.g., VGG, ResNet features from penultimate layer)
"""
features = []
for img in images:
feat = cnn_model.extract_features(img)
features.append(feat)
return np.array(features)
Cluster Analysis:
from sklearn.cluster import KMeans
def cluster_and_filter(features, images, n_clusters=10):
"""
Cluster images by visual similarity:
- Identify core clusters (likely relevant)
- Remove outlier clusters (likely noise)
- Keep images from dense, coherent clusters
"""
kmeans = KMeans(n_clusters=n_clusters)
clusters = kmeans.fit_predict(features)
# Analyze cluster statistics
cluster_stats = analyze_clusters(clusters, features)
# Remove outlier clusters (low density, high variance)
valid_clusters = [
c for c in cluster_stats
if c['density'] > threshold and c['coherence'] > min_coherence
]
filtered_images = [
img for img, c in zip(images, clusters)
if c in valid_clusters
]
return filtered_images
Initial CNN Training:
def train_initial_classifier(clustered_images, num_classes):
"""
Train initial CNN classifier on clustered data:
- Use cluster assignments as pseudo-labels
- Fine-tune pre-trained model
"""
model = load_pretrained_cnn()
model = fine_tune(model, clustered_images)
return model
Progressive Refinement:
def progressive_filtering(images, model, iterations=3):
"""
Iteratively refine dataset:
1. Classify all images with current model
2. Remove low-confidence predictions
3. Retrain model on refined set
4. Repeat
"""
for i in range(iterations):
# Predict on all images
predictions = model.predict(images)
# Filter by confidence
confident_samples = [
(img, pred) for img, pred in zip(images, predictions)
if pred['confidence'] > confidence_threshold(i)
]
# Retrain on refined set
model = train_classifier(confident_samples)
images = [s[0] for s in confident_samples]
return images, model
Quality Verification:
def verify_dataset_quality(dataset, test_set):
"""
Evaluate dataset quality:
- Cross-dataset generalization (test on STL-10, CIFAR-10)
- Class balance analysis
- Diversity metrics
"""
# Train classifier on generated dataset
model = train_classifier(dataset)
# Test on external datasets
stl10_accuracy = evaluate(model, stl10_test)
cifar10_accuracy = evaluate(model, cifar10_test)
return {
'cross_dataset_acc': (stl10_accuracy + cifar10_accuracy) / 2,
'class_balance': compute_balance(dataset),
'diversity': compute_diversity(dataset)
}
Export Dataset:
dataset/
├── train/
│ ├── class_1/
│ ├── class_2/
│ └── ...
├── val/
├── test/
├── metadata.json
└── dataset_stats.md
Query Expansion:
Clustering:
Progressive Filtering:
Validation:
Based on original research:
# Deep learning
pip install torch torchvision # or tensorflow
# Clustering
pip install scikit-learn
# Image processing
pip install pillow opencv-python
# N-gram data
# Download Google Books Ngrams: https://storage.googleapis.com/books/ngrams/books/datasetsv3.html