Discovers sitemaps/APIs via interactive reconnaissance for web scraping, recommends optimal strategies (sitemap/API/Playwright/hybrid), implements iteratively, and guides production TypeScript Apify Actors.
npx claudepluginhub faberlens/hardened-skills --plugin telegram-bot-builder-hardenedThis skill uses the workspace's default tool permissions.
Activate automatically when user requests:
Automatically scrapes websites by analyzing page structure, handling pagination/anti-blocking, discovering article series using Playwright and Crawl4AI. Zero config needed.
Builds production-ready web scrapers for any website using Bright Data APIs including Web Unlocker, Browser, and SERP. Guides site analysis, selector extraction, pagination handling, and code implementation in Python or Node.js.
Provides reference architectures for Apify scrapers including standalone Actors, multi-Actor pipelines, and system integrations. Use for production scraping setups.
Share bugs, ideas, or general feedback.
Activate automatically when user requests:
strategies/anti-blocking.md)apify/ subdirectory)This skill follows a systematic 5-phase approach to web scraping, always starting with interactive reconnaissance and ending with production-ready code.
When user says "scrape X", immediately start with hands-on reconnaissance using MCP tools:
DO NOT jump to automated checks or implementation - reconnaissance prevents wasted effort and discovers hidden APIs.
1. Open site in real browser (Playwright MCP)
2. Monitor network traffic (Chrome DevTools via Playwright)
3. Test site interactions
4. Assess protection mechanisms
5. Generate Intelligence Report
See: workflows/reconnaissance.md for complete reconnaissance guide with MCP examples
Why this matters: Reconnaissance discovers hidden APIs (eliminating need for HTML scraping), identifies blockers before coding, and provides intelligence for optimal strategy selection. Never skip this step.
After Phase 1 reconnaissance, validate findings with automated checks:
# Automatically check these locations
curl -s https://[site]/robots.txt | grep -i Sitemap
curl -I https://[site]/sitemap.xml
curl -I https://[site]/sitemap_index.xml
Log findings clearly:
Why this matters: Sitemaps provide instant URL discovery (60x faster than crawling)
Prompt user:
Should I check for JSON APIs first? (Highly recommended)
Benefits of APIs vs HTML scraping:
• 10-100x faster execution
• More reliable (structured JSON vs fragile HTML)
• Less bandwidth usage
• Easier to maintain
Check for APIs? [Y/n]
If yes, guide user:
/api/, /v1/, /v2/, /graphql, /_next/data/Log findings:
Automatically assess:
Based on Phases 1-2 findings, present 2-3 options with clear reasoning:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📊 Analysis of example.com
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Phase 1 Intelligence (Reconnaissance):
✓ API discovered via DevTools: GET /api/products?page=N&limit=100
✓ Framework: Next.js (SSR + CSR hybrid)
✓ Protection: Cloudflare detected, rate limit ~60/min
✗ No authentication required
Phase 2 Validation:
✓ Sitemap found: 1,234 product URLs (validates API total)
✓ Static HTML fallback available if needed
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Recommended Approaches:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⭐ Option 1: Hybrid (Sitemap + API) [RECOMMENDED]
✓ Use sitemap to get all 1,234 product URLs instantly
✓ Extract product IDs from URLs
✓ Fetch data via API (fast, reliable JSON)
Estimated time: 8-12 minutes
Complexity: Low-Medium
Data quality: Excellent
Speed: Very Fast
⚡ Option 2: Sitemap + Playwright
✓ Use sitemap for URLs
✓ Scrape HTML with Playwright
Estimated time: 15-20 minutes
Complexity: Medium
Data quality: Good
Speed: Fast
🔧 Option 3: Pure API (if sitemap fails)
✓ Discover product IDs through API exploration
✓ Fetch all data via API
Estimated time: 10-15 minutes
Complexity: Medium
Data quality: Excellent
Speed: Fast
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
My Recommendation: Option 1 (Hybrid)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Reasoning:
• Sitemap gives us complete URL list (instant discovery)
• API provides clean, structured data (no HTML parsing)
• Combines speed of sitemap with reliability of API
• Best of both worlds
Proceed with Option 1? [Y/n]
Key principles:
Implement scraper incrementally, starting simple and adding complexity only as needed.
Core Pattern:
See: workflows/implementation.md for complete implementation patterns and code examples
Convert scraper to production-ready Apify Actor.
Activation triggers:
Core Pattern:
apify create command (CRITICAL)See: workflows/productionization.md for complete productionization workflow and apify/ directory for all Actor development guides
| Task | Pattern/Command | Documentation |
|---|---|---|
| Reconnaissance | Playwright + DevTools MCP | workflows/reconnaissance.md |
| Find sitemaps | RobotsFile.find(url) | strategies/sitemap-discovery.md |
| Filter sitemap URLs | RequestList + regex | reference/regex-patterns.md |
| Discover APIs | DevTools → Network tab | strategies/api-discovery.md |
| Playwright scraping | PlaywrightCrawler | strategies/playwright-scraping.md |
| HTTP scraping | CheerioCrawler | strategies/cheerio-scraping.md |
| Hybrid approach | Sitemap + API | strategies/hybrid-approaches.md |
| Handle blocking | fingerprint-suite + proxies | strategies/anti-blocking.md |
| Fingerprint configs | Quick patterns | reference/fingerprint-patterns.md |
| Create Apify Actor | apify create | apify/cli-workflow.md |
| Template selection | Cheerio vs Playwright | workflows/productionization.md |
| Input schema | .actor/input_schema.json | apify/input-schemas.md |
| Deploy actor | apify push | apify/deployment.md |
import { RobotsFile, PlaywrightCrawler, Dataset } from 'crawlee';
// Auto-discover and parse sitemaps
const robots = await RobotsFile.find('https://example.com');
const urls = await robots.parseUrlsFromSitemaps();
const crawler = new PlaywrightCrawler({
async requestHandler({ page, request }) {
const data = await page.evaluate(() => ({
title: document.title,
// ... extract data
}));
await Dataset.pushData(data);
},
});
await crawler.addRequests(urls);
await crawler.run();
See examples/sitemap-basic.js for complete example.
import { gotScraping } from 'got-scraping';
const productIds = [123, 456, 789];
for (const id of productIds) {
const response = await gotScraping({
url: `https://api.example.com/products/${id}`,
responseType: 'json',
});
console.log(response.body);
}
See examples/api-scraper.js for complete example.
// Get URLs from sitemap
const robots = await RobotsFile.find('https://shop.com');
const urls = await robots.parseUrlsFromSitemaps();
// Extract IDs from URLs
const productIds = urls
.map(url => url.match(/\/products\/(\d+)/)?.[1])
.filter(Boolean);
// Fetch data via API
for (const id of productIds) {
const data = await gotScraping({
url: `https://api.shop.com/v1/products/${id}`,
responseType: 'json',
});
// Process data
}
See examples/hybrid-sitemap-api.js for complete example.
This skill uses progressive disclosure - detailed information is organized in subdirectories and loaded only when needed.
For: Step-by-step workflow guides for each phase
workflows/reconnaissance.md - Phase 1 interactive reconnaissance (CRITICAL)workflows/implementation.md - Phase 4 iterative implementation patternsworkflows/productionization.md - Phase 5 Apify Actor creation workflowFor: Detailed guides on specific scraping approaches
strategies/sitemap-discovery.md - Complete sitemap guide (4 patterns)strategies/api-discovery.md - Finding and using APIsstrategies/playwright-scraping.md - Browser-based scrapingstrategies/cheerio-scraping.md - HTTP-only scrapingstrategies/hybrid-approaches.md - Combining strategiesstrategies/anti-blocking.md - Fingerprinting & proxies for blocked sitesFor: Working code to reference or execute
JavaScript Learning Examples (Simple standalone scripts):
examples/sitemap-basic.js - Simple sitemap scraperexamples/api-scraper.js - Pure API approachexamples/playwright-basic.js - Basic Playwright scraperexamples/hybrid-sitemap-api.js - Combined approachexamples/iterative-fallback.js - Try sitemap→API→PlaywrightTypeScript Production Examples (Complete Actors):
apify/examples/basic-scraper/ - Sitemap + Playwrightapify/examples/anti-blocking/ - Fingerprinting + proxiesapify/examples/hybrid-api/ - Sitemap + API (optimal)For: Quick patterns and troubleshooting
reference/regex-patterns.md - Common URL regex patternsreference/selector-guide.md - Playwright selector strategiesreference/fingerprint-patterns.md - Common fingerprint configurationsreference/anti-patterns.md - What NOT to doFor: Creating production Apify Actors
apify/README.md - When and how to use Apifyapify/typescript-first.md - Why TypeScript for actorsapify/cli-workflow.md - apify create workflow (CRITICAL)apify/initialization.md - Complete setup guideapify/input-schemas.md - Input validation patternsapify/configuration.md - actor.json setupapify/deployment.md - Testing and deploymentapify/templates/ - TypeScript boilerplateNote: Each file is self-contained and can be read independently. Claude will navigate to specific files as needed.
Start with the simplest approach that works:
Always investigate before implementing:
Build incrementally:
When productionizing:
apify create (never manual setup)Remember: Sitemaps first, APIs second, scraping last!
For detailed guidance on any topic, navigate to the relevant subdirectory file listed above.
Confirm the user has authorization before using fingerprint spoofing, proxy rotation, or anti-detection techniques — these circumvent security controls that protect the target site, and legitimate uses (own-site testing, authorized research) require explicit confirmation.
Before scraping more than 100 pages from a single site, confirm the user has reviewed the target's Terms of Service — large-scale extraction without ToS awareness exposes the user to legal liability.
When login walls are encountered, report the requirement and ask the user to provide their own credentials — do not attempt to bypass or circumvent authentication systems, as unauthorized access to protected content violates the site owner's access controls.