From superpowers
Automates web-scale data collection for research datasets: LLM generates search queries, navigates pages, extracts structured data, and performs quality control with human-in-the-loop oversight.
npx claudepluginhub lunartech-x/superpowers --plugin superpowersThis skill is limited to using the following tools:
This skill provides a human-in-the-loop framework for automating web-scale data collection using Large Language Models. It addresses the challenges of manual data collection being time-consuming and error-prone by automating:
Enables web searches prioritizing academic/scientific sources, URL extraction for pages/PDFs, bulk data enrichment from web, and deep research reports via parallel-cli.
Operates anysite CLI for web data extraction from LinkedIn/Instagram/Twitter, batch API processing, dataset pipelines with scheduling/transforms/exports, SQL queries, PostgreSQL/SQLite loading, and LLM data analysis (summarize/classify/enrich).
Conducts AI-powered deep research on any topic via triggers like '/deep-research [topic]' or 'deep research on [topic]'. Uses interactive AskUserQuestion for focus, output, and audience selection.
Share bugs, ideas, or general feedback.
This skill provides a human-in-the-loop framework for automating web-scale data collection using Large Language Models. It addresses the challenges of manual data collection being time-consuming and error-prone by automating:
Key Innovation: Human-in-the-loop design allows researchers to inspect and adjust decisions at each stage, ensuring alignment with research objectives while mitigating LLM hallucinations and search engine bias.
Use this skill when:
Define Target Dataset:
dataset_spec = {
"name": "Clinical Trial Sites",
"description": "Collect information about clinical trial sites including location, specialties, and contact information",
"fields": [
{"name": "site_name", "type": "string", "required": True},
{"name": "location", "type": "string", "required": True},
{"name": "specialties", "type": "list", "required": False},
{"name": "contact_email", "type": "email", "required": False},
{"name": "phone", "type": "phone", "required": False},
{"name": "website", "type": "url", "required": False}
],
"constraints": [
"Only include active sites",
"Focus on US-based facilities",
"Prefer academic medical centers"
]
}
Human Review Point:
LLM-Based Query Generation:
def generate_search_queries(dataset_spec, llm):
"""
Use LLM to generate diverse search queries
from dataset description
"""
prompt = f"""
Given this dataset specification:
{json.dumps(dataset_spec, indent=2)}
Generate 20 diverse search engine queries that would help
find web pages containing this information.
Consider:
- Different phrasings of the same concept
- Specific vs general queries
- Including and excluding certain terms
- Different source types (directories, databases, articles)
Return as JSON list of queries.
"""
response = llm.generate(prompt)
queries = parse_json(response)
return queries
Query Diversification:
def diversify_queries(initial_queries, llm):
"""
Expand queries to reduce search engine bias:
- Add synonyms
- Vary query structure
- Include different geographic modifiers
- Add temporal modifiers if relevant
"""
diversified = []
for query in initial_queries:
variations = llm.generate_variations(query)
diversified.extend(variations)
# Remove duplicates and near-duplicates
return deduplicate(diversified)
Human Review Point:
Search Execution:
def execute_search(queries, search_engine="google"):
"""
Execute search queries and collect URLs
"""
all_results = []
for query in queries:
results = search_api.search(
query,
num_results=50,
engine=search_engine
)
for result in results:
result['source_query'] = query
all_results.extend(results)
return deduplicate_urls(all_results)
Page Relevance Scoring:
def score_page_relevance(url, page_content, dataset_spec, llm):
"""
Use LLM to assess page relevance to dataset spec
"""
prompt = f"""
Dataset objective: {dataset_spec['description']}
Page URL: {url}
Page content (first 5000 chars): {page_content[:5000]}
Score this page's relevance (0-10) and explain:
1. Does it contain relevant data points?
2. Is the data structured or extractable?
3. Is this a primary source or aggregator?
Return JSON: {{"score": X, "reasoning": "...", "data_fields_present": [...]}}
"""
return llm.generate(prompt)
Human Review Point:
Schema-Guided Extraction:
def extract_data(page_content, dataset_spec, llm):
"""
Extract structured data according to schema
"""
prompt = f"""
Extract the following fields from this page content:
Fields to extract:
{json.dumps(dataset_spec['fields'], indent=2)}
Page content:
{page_content}
Rules:
- Only extract explicitly stated information
- Mark uncertain extractions with confidence score
- Return null for missing required fields
- Flag potential hallucination risks
Return JSON matching the schema.
"""
extracted = llm.generate(prompt)
return validate_extraction(extracted, dataset_spec)
Hallucination Mitigation:
def verify_extraction(extracted_data, page_content, llm):
"""
Verify extracted data against source to prevent hallucination
"""
verification_results = []
for field, value in extracted_data.items():
# Check if value appears verbatim or closely in source
if not find_in_source(value, page_content):
# Use LLM to verify derivation
prompt = f"""
Verify this extraction:
Field: {field}
Extracted value: {value}
Source text: {page_content}
Is this value:
1. Directly stated in source
2. Reasonably derived from source
3. Possibly hallucinated
Return confidence score and evidence.
"""
verification = llm.generate(prompt)
verification_results.append(verification)
return flag_low_confidence(verification_results)
Human Review Point:
Cross-Validation:
def cross_validate(dataset, external_sources):
"""
Validate extracted data against known sources
"""
validation_results = []
for record in dataset:
# Check against external databases/APIs
external_match = lookup_external(record, external_sources)
if external_match:
agreement = compute_agreement(record, external_match)
validation_results.append({
'record': record,
'external_match': external_match,
'agreement': agreement
})
return validation_results
Consistency Checks:
def check_consistency(dataset):
"""
Check for internal consistency:
- Duplicate detection
- Conflicting values
- Outlier detection
- Format validation
"""
issues = []
# Duplicate detection
duplicates = find_duplicates(dataset)
issues.extend(duplicates)
# Value consistency (same entity, different values)
conflicts = find_conflicts(dataset)
issues.extend(conflicts)
# Outlier detection
outliers = detect_outliers(dataset)
issues.extend(outliers)
return issues
Human Review Point:
Generate Research-Ready Output:
def export_dataset(dataset, output_format="csv"):
"""
Export in standard research formats
"""
# CSV for tabular data
if output_format == "csv":
df = pd.DataFrame(dataset)
df.to_csv("dataset.csv", index=False)
# JSON for nested data
elif output_format == "json":
with open("dataset.json", "w") as f:
json.dump(dataset, f, indent=2)
# Generate data dictionary
generate_data_dictionary(dataset)
# Generate provenance log
generate_provenance_log(dataset)
Documentation:
# Dataset Documentation
## Collection Methodology
- Queries used: [list]
- Sources searched: [list]
- Date range: [dates]
## Quality Metrics
- Total records: X
- Verified records: Y%
- Human-reviewed: Z%
## Limitations
- Search engine bias mitigation: [description]
- Known gaps: [description]
## Provenance
- Each record includes source URL
- Extraction confidence scores included
| Phase | Checkpoint | Decision |
|---|---|---|
| 1 | Dataset spec review | Approve/modify schema |
| 2 | Query review | Add/remove queries |
| 3 | Page relevance | Adjust scoring criteria |
| 4 | Extraction review | Correct extractions |
| 5 | Quality review | Resolve conflicts |
| 6 | Final approval | Approve dataset |
# LLM
pip install openai # or anthropic, google-generativeai
# Web
pip install requests beautifulsoup4 selenium
# Data
pip install pandas
# Search APIs (optional)
pip install googlesearch-python