Automatically scrapes websites by analyzing page structure, handling pagination/anti-blocking, discovering article series using Playwright and Crawl4AI. Zero config needed.
npx claudepluginhub yrzhe/claude-skills --plugin intelligent-web-scraperThis skill is limited to using the following tools:
A self-aware, self-learning web scraping agent that accumulates experience over time.
experiences/README.mdexperiences/lessons_learned.mdexperiences/site_patterns.jsonreferences/anti-block-strategies.mdreferences/crawl4ai-reference.mdreferences/link-discovery-guide.mdreferences/pagination-patterns.mdrequirements.txtscripts/check_deps.pyscripts/concurrent_scraper.pyscripts/config.pyscripts/crawl4ai_wrapper.pyscripts/local_browser_scraper.pyscripts/progress_manager.pysetup.shBuilds production-ready web scrapers for any website using Bright Data APIs including Web Unlocker, Browser, and SERP. Guides site analysis, selector extraction, pagination handling, and code implementation in Python or Node.js.
Extracts structured data from websites like product listings, tables, or search results, generating executable Playwright scripts and JSON/CSV output.
Scrapes web pages, SERPs, e-commerce, YouTube, and ChatGPT via ScrapingBee CLI with smart path extraction and AI rules for LLMs. Handles JS, CAPTCHAs, anti-bot automatically.
Share bugs, ideas, or general feedback.
A self-aware, self-learning web scraping agent that accumulates experience over time.
IMPORTANT: Before using this skill, ensure these dependencies are installed.
# Navigate to skill directory and run setup
cd ~/.claude/skills/intelligent-web-scraper
chmod +x setup.sh && ./setup.sh
# Or for VPS/headless environment:
./setup.sh --headless
The setup script will:
# Check if everything is installed correctly
python3 ~/.claude/skills/intelligent-web-scraper/scripts/check_deps.py
# Auto-install missing dependencies
python3 ~/.claude/skills/intelligent-web-scraper/scripts/check_deps.py --install
# Output as JSON (for programmatic use)
python3 ~/.claude/skills/intelligent-web-scraper/scripts/check_deps.py --json
If you prefer manual installation:
# 1. Install Python dependencies
pip install -r ~/.claude/skills/intelligent-web-scraper/requirements.txt
# 2. Install Playwright browser
python -m playwright install chromium
# 3. Set up Crawl4AI
crawl4ai-setup
# 4. For VPS/headless Linux, also run:
python -m playwright install-deps chromium
| Tool | Purpose | Installation |
|---|---|---|
| Playwright MCP | Browser automation | Must be configured in Claude Code MCP settings |
| Python 3.9+ | Script execution | brew install python or system package manager |
| Crawl4AI | Advanced crawling | Included in setup.sh |
# Quick verification
python -c "from crawl4ai import AsyncWebCrawler; print('Crawl4AI OK')"
python -c "from playwright.sync_api import sync_playwright; print('Playwright OK')"
# Full check
python3 ~/.claude/skills/intelligent-web-scraper/scripts/check_deps.py
For servers without GUI (VPS, Docker, CI):
# 1. Run setup in headless mode
./setup.sh --headless
# 2. The scraper will automatically use headless browser
# No GUI required - all scraping works via headless Chromium
# 3. Check configuration
python3 ~/.claude/skills/intelligent-web-scraper/scripts/config.py
Note: local_browser_scraper.py (CDP mode) requires a GUI and is not available in headless environments. Use crawl4ai_wrapper.py instead.
Use this when the user wants to scrape using their local browser (Comet, Chrome, etc.)
This allows scraping while preserving user's login sessions and cookies!
# 1. Install websockets (one time)
pip install websockets
# 2. Launch Comet/Chrome with debugging (or use the script)
python ~/.claude/skills/intelligent-web-scraper/scripts/local_browser_scraper.py \
--launch comet \
--url "https://example.com"
# 3. Scrape data from current page
python ~/.claude/skills/intelligent-web-scraper/scripts/local_browser_scraper.py \
--extract articles \
--output data.json
Launch browser with debugging:
python scripts/local_browser_scraper.py --launch comet --url "https://douban.com"
python scripts/local_browser_scraper.py --launch chrome --url "https://example.com"
Scrape from current tab:
# Get page text
python scripts/local_browser_scraper.py --extract text
# Get all links
python scripts/local_browser_scraper.py --extract links
# Get articles
python scripts/local_browser_scraper.py --extract articles
# Custom JavaScript
python scripts/local_browser_scraper.py --extract "document.querySelectorAll('h1').length"
Scrape specific tab (by URL pattern):
python scripts/local_browser_scraper.py --url-pattern "douban.com" --extract articles
Built-in extractors: text, html, title, links, images, tables, articles, metadata
CRITICAL: Always use user's EXISTING profile to preserve login sessions!
# Comet - use existing profile
/Applications/Comet.app/Contents/MacOS/Comet \
--remote-debugging-port=9222 \
--user-data-dir="$HOME/Library/Application Support/Comet" \
"https://example.com" &
# Chrome - use existing profile
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
--remote-debugging-port=9222 \
--user-data-dir="$HOME/Library/Application Support/Google/Chrome" \
"https://example.com" &
| Browser | User Data Directory |
|---|---|
| Comet | ~/Library/Application Support/Comet |
| Chrome | ~/Library/Application Support/Google/Chrome |
| Edge | ~/Library/Application Support/Microsoft Edge |
| Brave | ~/Library/Application Support/BraveSoftware/Brave-Browser |
NEVER use temp directory like /tmp/browser-debug - this creates empty profile without logins!
Why use existing profile:
| Use Local Browser | Use Playwright |
|---|---|
| User requests specific browser | Automated bulk scraping |
| Need user's login session | Don't need authentication |
| User wants to see the page | Headless scraping OK |
| Site blocks headless browsers | Standard sites |
CRITICAL: Scroll Loading Requirements
browser_evaluate to execute scrolling:
window.scrollTo(0, document.body.scrollHeight)
CRITICAL: Detail Page Scraping Requirements
┌─────────────────────────────────────────────────────────────┐
│ SCRAPE REQUEST │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Step 1: Check Experience Database │
│ - Read experiences/site_patterns.json │
│ - Look for matching domain/URL pattern │
│ - If found: Use learned selectors and strategies │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Step 2: Execute Scraping │
│ - Use learned patterns OR discover new ones │
│ - Apply anti-blocking strategies │
│ - Extract data │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Step 3: Learn & Record │
│ - If successful: Save patterns to site_patterns.json │
│ - If failed: Record lesson in lessons_learned.md │
│ - Update success rate and timestamps │
└─────────────────────────────────────────────────────────────┘
Location: ~/.claude/skills/intelligent-web-scraper/experiences/
experiences/
├── site_patterns.json # Learned patterns for each domain
├── lessons_learned.md # Accumulated wisdom and gotchas
└── README.md # How the learning system works
{
"domain.com": {
"url_patterns": {
"/movie/*/comments": {
"selectors": {
"container": ".comment-list",
"item": ".comment-item",
"fields": {
"author": ".comment-info a",
"content": ".comment-content span",
"rating": ".rating"
}
},
"pagination": {
"type": "page_number",
"selector": ".paginator a"
},
"scroll_loading": {
"required": true,
"max_scrolls": 20,
"scroll_delay_ms": 1500,
"notes": "Page uses infinite scroll loading"
},
"detail_links": {
"required": true,
"selector": "a.read-more, a:has-text('→')",
"fields_in_detail": ["fullContent", "images", "comments"],
"notes": "List page only has titles, need to scrape detail pages for full content"
},
"anti_block": {
"min_delay": 3,
"max_delay": 8,
"needs_login": false
},
"success_count": 15,
"last_success": "2024-01-15",
"notes": "Rate limit after 50 requests"
}
}
}
}
# ALWAYS do this first
experiences_path = "~/.claude/skills/intelligent-web-scraper/experiences/site_patterns.json"
site_patterns.json# 1. Navigate to URL
browser_navigate(url)
# 2. Wait for load
browser_wait_for(time=3)
# 3. [IMPORTANT] Scroll to load all content
browser_evaluate('''async () => {
let prevHeight = 0;
let currentHeight = document.body.scrollHeight;
let noChangeCount = 0;
while (noChangeCount < 3) {
prevHeight = currentHeight;
window.scrollTo(0, document.body.scrollHeight);
await new Promise(r => setTimeout(r, 1500));
currentHeight = document.body.scrollHeight;
if (currentHeight === prevHeight) {
noChangeCount++;
} else {
noChangeCount = 0;
}
}
return document.querySelectorAll('article, [class*="item"]').length;
}''')
# 4. Get snapshot (DOM structure)
browser_snapshot()
# 5. Take screenshot for visual analysis
browser_take_screenshot()
Analyze and identify:
## Page Analysis Report
**URL**: [target]
**Domain Experience**: [Yes, used N times / No, first time]
### Structure Detected
- **Page Type**: [type]
- **Items Found**: ~[N] items on current page
- **Pagination**: [type] ([N] pages estimated)
### Extraction Plan
| Field | Selector | Sample |
|-------|----------|--------|
| title | .item-title | "Example Title" |
| price | .price | "$99.00" |
Continue scraping? [Y/n]
Apply appropriate strategy based on pagination type:
When list page content is incomplete, this step is required:
// 1. Extract all entries and their detail links
const entries = await browser_evaluate(() => {
const items = document.querySelectorAll('article, .item');
return Array.from(items).map(item => ({
title: item.querySelector('h2, h3, .title')?.textContent,
summary: item.querySelector('p, .desc')?.textContent,
detailUrl: item.querySelector('a[href*="detail"], a[href*="view"], a:has-text("→")')?.href
}));
});
// 2. Visit detail pages one by one
for (const entry of entries) {
if (entry.detailUrl) {
// Open new tab
browser_tabs({ action: 'new' });
browser_navigate(entry.detailUrl);
browser_wait_for({ time: 2 });
// Extract detail content
const detail = await browser_evaluate(() => ({
fullContent: document.querySelector('article, .content, main')?.innerText,
images: Array.from(document.querySelectorAll('img')).map(i => i.src),
metadata: { /* ... */ }
}));
entry.detail = detail;
// Close tab, return to list
browser_tabs({ action: 'close' });
await sleep(2000); // Polite delay
}
}
When detail page scraping is needed:
CRITICAL: After EVERY successful scrape, update the experience database.
# Update site_patterns.json with:
{
"selectors": discovered_selectors,
"pagination": detected_pagination,
"anti_block": {
"min_delay": actual_delay_used,
"max_delay": max_delay_used,
"blocked_at": request_count_when_blocked or null
},
"success_count": previous_count + 1,
"last_success": today,
"notes": any_special_observations
}
If scraping fails, add entry to lessons_learned.md:
## [Date] - domain.com - [URL Pattern]
**Issue**: [What went wrong]
**Cause**: [Why it happened]
**Solution**: [How to avoid next time]
| Level | Delay | Trigger |
|---|---|---|
| 0 (Normal) | 2-5s | Default |
| 1 (Caution) | 5-10s | Single 429/503 |
| 2 (Careful) | 10-20s | Repeated warnings |
| 3 (Critical) | 30-60s | Multiple blocks |
| 4 (Pause) | STOP | Captcha detected |
BLOCK_SIGNALS = [
# HTTP Status
(429, "rate_limited"),
(403, "forbidden"),
(503, "service_unavailable"),
# Page Content
("captcha", "captcha_required"),
("verification code", "captcha_required"),
("access denied", "blocked"),
("please wait", "rate_limited"),
]
Prev/Next Links
Table of Contents
URL Pattern Analysis
Title Similarity
score = (
url_similarity * 30 + # Same path structure
series_indicator * 25 + # Contains series keywords
position_weight * 20 + # In navigation area
title_similarity * 15 + # Similar to current title
same_domain * 10 # Must be same domain
)
# Include if score >= 50
| Format | Use Case | Extension |
|---|---|---|
| JSON | Structured data (products, lists) | .json |
| Markdown | Article content | .md |
| CSV | Tabular data | .csv |
| Auto | Let AI decide based on content | - |
User: Scrape movie reviews from https://movie.douban.com/subject/1292052/comments
Agent:
1. [Check] No existing experience for douban.com/subject/*/comments
2. [Analyze] Opening page, taking screenshot...
- Detected: Comment list with pagination
- Container: #comments
- Items: .comment-item
- Pagination: Page numbers (.paginator)
3. [Confirm] Found ~20 comments per page, 500+ pages
Continue? [Y/n]
4. [Execute] Scraping with 3-5s delays...
5. [Learn] Saved pattern to site_patterns.json
- Next time: Will use learned selectors automatically
User: Scrape reviews for another movie https://movie.douban.com/subject/26363254/comments
Agent:
1. [Check] Found experience for douban.com/subject/*/comments
- Last used: 2024-01-15 (15 successful runs)
- Known selectors: .comment-item, .comment-content
- Known rate limit: 3-8s delay recommended
2. [Execute] Using learned patterns directly...
3. [Update] Incremented success_count to 16
User: This is part 3 of a tutorial, get the whole series
https://blog.example.com/react-tutorial/part-3
Agent:
1. [Analyze] Detected series indicators:
- Prev link: /react-tutorial/part-2
- Next link: /react-tutorial/part-4
- Sidebar TOC: 8 chapters found
2. [Discover] Complete series (8 articles):
- Part 1: Introduction
- Part 2: Components
- Part 3: State (current)
- ...
3. [Confirm] Scrape all 8 articles?
4. [Execute] Scraping series in order...
5. [Output] Saved to react-tutorial/ directory
User: Scrape https://v0.app/changelog using Comet browser
Agent:
1. [Check] Found experience for v0.app/changelog
- scroll_loading: required (infinite scroll)
- detail_links: some entries have detail pages
2. [Navigate] Opening page in Comet browser...
3. [Scroll] Scrolling to load all content...
- Scroll 1: 50 articles found
- Scroll 2: 50 articles (no change)
- Scroll 3: 50 articles (confirmed complete)
4. [Extract] Extracting list page data...
- 50 changelog entries extracted
- 12 entries have detail links (View announcement →)
5. [Detail] Scraping detail pages...
- Entry 1: "Build with Glean via MCP" → fetching detail...
- Entry 2: "Linear MCP Integration" → fetching detail...
- ... (with 2s delay between each)
6. [Save] Merged data saved to v0_changelog.json
7. [Learn] Updated pattern:
- scroll_loading.max_scrolls: 3 (sufficient)
- detail_links: most are Twitter links (skip)
Scraping can be interrupted and resumed from where it left off.
~/.claude/skills/intelligent-web-scraper/progress/
├── example_com_abc123.json # Progress for task
├── another_task.json # Another task
└── ...
Each scraping task automatically saves:
# Start scraping (progress auto-saved)
python scripts/crawl4ai_wrapper.py https://example.com --output data.json
# If interrupted (Ctrl+C), progress is saved
# Resume by running the same command again
# List all progress files
python scripts/progress_manager.py --list
# Show resumable tasks
python scripts/progress_manager.py --resumable
# Check specific task status
python scripts/progress_manager.py --status example_com_abc123
# Delete progress file
python scripts/progress_manager.py --delete example_com_abc123
# Clean up old completed tasks
python scripts/progress_manager.py --cleanup 7 # Delete >7 days old
from progress_manager import ProgressManager
# Start or resume a task
progress = ProgressManager("my_task", start_url="https://example.com")
# Check if resuming
if progress.has_progress():
print(f"Resuming from {progress.state.completed_urls} URLs")
# During scraping
for url in urls:
if progress.is_completed(url):
continue # Skip already done
data = scrape(url)
progress.mark_completed(url, data)
# Finish
progress.finish()
Scrape multiple URLs in parallel with intelligent rate limiting.
# Scrape URLs from file
python scripts/concurrent_scraper.py urls.txt --output results.json
# High concurrency (10 parallel, 3 per domain)
python scripts/concurrent_scraper.py urls.txt -c 10 -d 3
# With rate limiting
python scripts/concurrent_scraper.py urls.txt --rate-limit 5 --domain-rate-limit 1
# Custom delays
python scripts/concurrent_scraper.py urls.txt --min-delay 2 --max-delay 5
| Option | Default | Description |
|---|---|---|
--concurrency, -c | 5 | Max parallel requests total |
--per-domain, -d | 2 | Max parallel per domain |
--rate-limit, -r | 10 | Global rate limit (req/s) |
--domain-rate-limit | 2 | Per-domain rate limit (req/s) |
--min-delay | 1.0 | Min delay between requests |
--max-delay | 3.0 | Max delay between requests |
--retries | 3 | Max retries per URL |
--no-progress | - | Disable resume capability |
--task-id | - | Custom task ID for progress |
from concurrent_scraper import ConcurrentScraper, ConcurrencyConfig
config = ConcurrencyConfig(
max_concurrent=5,
max_per_domain=2,
global_rate_limit=10.0,
per_domain_rate_limit=2.0,
)
scraper = ConcurrentScraper(config)
results = await scraper.scrape_urls(url_list)
# Print summary
scraper.print_summary()
┌─────────────────────────────────────────────────────────────┐
│ Request Flow │
└─────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────┐
│ Global Rate Limiter │ (10 req/s max)
│ (Token Bucket) │
└────────────────────────┘
│
▼
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Domain A │ │ Domain B │ │ Domain C │
│ 2 req/s │ │ 2 req/s │ │ 2 req/s │
└──────────┘ └──────────┘ └──────────┘
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Semaphore│ │ Semaphore│ │ Semaphore│
│ (max 2) │ │ (max 2) │ │ (max 2) │
└──────────┘ └──────────┘ └──────────┘
Location: ~/.claude/skills/intelligent-web-scraper/scripts/
| Script | Purpose | Usage |
|---|---|---|
crawl4ai_wrapper.py | Advanced Crawl4AI wrapper | <url> --output file.json |
concurrent_scraper.py | Concurrent multi-URL scraping | urls.txt -c 5 -o results.json |
local_browser_scraper.py | Local browser (Comet/Chrome) CDP scraping | --launch comet --url URL |
progress_manager.py | Progress management (resume support) | --list / --resumable |
check_deps.py | Dependency checking and installation | --check / --install |
config.py | Environment configuration | (displays current config) |
CRITICAL: After each scraping task is completed, the scraping script must be saved in the data output directory!
[data_output_directory]/
├── scrape_[site_name].py # Scraping script for this site
├── [site_name]_data.json # Scraped results
└── README.md # Optional: data description
Reasons:
Script naming convention: scrape_[domain_or_short_name].py
scrape_v0_changelog.pyscrape_douban_comments.pyScripts should include:
Detailed technical references:
ALWAYS follow this after every scraping task:
site_patterns.json before startingsite_patterns.json with new/refined patternslessons_learned.mdScroll loading fields:
scroll_loading.required: Whether scrolling is neededscroll_loading.max_scrolls: How many scrolls to reach bottomscroll_loading.scroll_delay_ms: Wait time after each scrollDetail link fields:
detail_links.required: Whether detail scraping is neededdetail_links.selector: Detail link selectordetail_links.fields_in_detail: Fields only available on detail page