From superpowers
Generates large-scale labeled image datasets via web scraping and LMMs like Gemini Vision, achieving ~95% metadata accuracy for classification and object detection. Use for custom ML training data when manual labeling is impractical.
npx claudepluginhub lunartech-x/superpowers --plugin superpowersThis skill is limited to using the following tools:
This skill provides a scalable, reusable framework for automatically generating labeled image datasets using web scraping combined with Large Multimodal Models (LMMs) for metadata generation. The methodology addresses the challenge of manual data collection being resource-intensive, error-prone, and time-consuming.
Automatically constructs large-scale image datasets from web sources using semantic query expansion with N-grams and CNN-based filtering to reduce bias and boost generalization for image classification.
Curates datasets for ML/LLM training: designs schemas, cleans/deduplicates data, handles class imbalance, creates stratified train/val/test splits, and writes dataset cards.
Create and manage Hugging Face Hub datasets: initialize repos, configure prompts/metadata, stream row updates, and query/transform data with DuckDB SQL.
Share bugs, ideas, or general feedback.
This skill provides a scalable, reusable framework for automatically generating labeled image datasets using web scraping combined with Large Multimodal Models (LMMs) for metadata generation. The methodology addresses the challenge of manual data collection being resource-intensive, error-prone, and time-consuming.
Key Capabilities:
Use this skill when:
Define Target Categories:
Design Search Queries:
# Generate diverse search queries
categories = ["structural steel beam", "steel column construction", "roof truss"]
query_variations = [
f"{cat} {mod}"
for cat in categories
for mod in ["photo", "site", "construction", "building"]
]
Set Collection Parameters:
Implement Multi-Source Scraping:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
def scrape_images(query, num_images=1000):
"""
Scrape images from multiple sources:
- Google Images
- Bing Images
- Domain-specific sites
"""
images = []
# Use appropriate rate limiting
# Respect robots.txt
# Store source URLs for attribution
return images
Image Download and Storage:
def download_images(image_urls, output_dir):
"""
Download images with:
- Duplicate detection (hash-based)
- Format validation
- Resolution filtering
- Metadata preservation
"""
pass
Initial Filtering:
Configure LMM (Gemini Vision or equivalent):
import google.generativeai as genai
genai.configure(api_key=os.environ['GOOGLE_API_KEY'])
model = genai.GenerativeModel('gemini-1.5-flash')
def generate_metadata(image_path, categories):
"""
Use LMM to analyze image and generate metadata
"""
image = PIL.Image.open(image_path)
prompt = f"""
Analyze this image and determine:
1. Does it contain any of these objects: {categories}?
2. If yes, which specific category?
3. Confidence level (high/medium/low)
4. Object location description (for detection tasks)
5. Image quality assessment
Return structured JSON response.
"""
response = model.generate_content([prompt, image])
return parse_response(response.text)
Batch Processing:
def process_dataset(image_dir, categories, batch_size=100):
"""
Process images in batches with:
- Rate limiting
- Error handling
- Progress tracking
- Checkpoint saving
"""
results = []
for batch in get_batches(image_dir, batch_size):
batch_results = [
generate_metadata(img, categories)
for img in batch
]
results.extend(batch_results)
save_checkpoint(results)
return results
Quality Metrics:
Apply Category Rules:
def filter_by_rules(metadata, rules):
"""
Apply domain-specific rules:
- Minimum confidence threshold (e.g., 0.8)
- Category-specific validation
- Cross-reference with search query
"""
filtered = []
for item in metadata:
if item['confidence'] >= rules['min_confidence']:
if validate_category(item, rules):
filtered.append(item)
return filtered
Handle Edge Cases:
Generate Dataset Structure:
dataset/
├── images/
│ ├── category_1/
│ ├── category_2/
│ └── ...
├── annotations/
│ ├── metadata.json
│ └── labels.csv
├── splits/
│ ├── train.txt
│ ├── val.txt
│ └── test.txt
└── README.md
Create Annotation Files:
def create_annotations(filtered_data, output_dir):
"""
Generate standard annotation formats:
- COCO format (for object detection)
- CSV with labels (for classification)
- YOLO format (if needed)
"""
pass
Split Dataset:
Based on the original research:
# Core
pip install requests beautifulsoup4 selenium pillow
# LMM
pip install google-generativeai # or openai for GPT-4V
# Image processing
pip install imagehash opencv-python
# Dataset tools
pip install pandas scikit-learn