Skill

web-scraping-hardened

Discovers sitemaps/APIs via interactive reconnaissance for web scraping, recommends optimal strategies (sitemap/API/Playwright/hybrid), implements iteratively, and guides production TypeScript Apify Actors.

Typescript

Javascript

automation

developer-tools

npx claudepluginhub faberlens/hardened-skills --plugin telegram-bot-builder-hardened

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Activate automatically when user requests:

Supporting Assets

SAFETY.md

SKILL.md

Similar Skills

intelligent-web-scraper

Automatically scrapes websites by analyzing page structure, handling pagination/anti-blocking, discovering article series using Playwright and Crawl4AI. Zero config needed.

15 files17 tools

intelligent-web-scraper

scraper-builder

Builds production-ready web scrapers for any website using Bright Data APIs including Web Unlocker, Browser, and SERP. Guides site analysis, selector extraction, pagination handling, and code implementation in Python or Node.js.

5 files

brightdata-plugin

apify-reference-architecture

1.9k

Provides reference architectures for Apify scrapers including standalone Actors, multi-Actor pipelines, and system integrations. Use for production scraping setups.

2 tools

apify-pack

Stats

Stars17

Forks1

Last CommitApr 21, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Web Scraping with Intelligent Strategy Selection

When This Skill Activates

Activate automatically when user requests:

"Scrape [website]"
"Extract data from [site]"
"Get product information from [URL]"
"Find all links/pages on [site]"
"I'm getting blocked" or "Getting 403 errors" (loads strategies/anti-blocking.md)
"Make this an Apify Actor" (loads apify/ subdirectory)
"Productionize this scraper"

Proactive Workflow

This skill follows a systematic 5-phase approach to web scraping, always starting with interactive reconnaissance and ending with production-ready code.

Phase 1: INTERACTIVE RECONNAISSANCE (Critical First Step)

When user says "scrape X", immediately start with hands-on reconnaissance using MCP tools:

DO NOT jump to automated checks or implementation - reconnaissance prevents wasted effort and discovers hidden APIs.

Use Playwright MCP & Chrome DevTools MCP:

1. Open site in real browser (Playwright MCP)

Navigate like a real user
Observe page loading behavior (SSR? SPA? Loading states?)
Take screenshots for reference
Test basic interactions

2. Monitor network traffic (Chrome DevTools via Playwright)

Watch XHR/Fetch requests in real-time
Find API endpoints returning JSON (10-100x faster than HTML scraping!)
Analyze request/response patterns
Document headers, cookies, authentication tokens
Extract pagination parameters

3. Test site interactions

Pagination: URL-based? API? Infinite scroll?
Filtering and search: How do they work?
Dynamic content loading: Triggers and patterns
Authentication flows: Required? Optional?

4. Assess protection mechanisms

Cloudflare/bot detection
CAPTCHA requirements
Rate limiting behavior (test with multiple requests)
Fingerprinting scripts

5. Generate Intelligence Report

Site architecture (framework, rendering method)
Discovered APIs/endpoints with full specs
Protection mechanisms and required countermeasures
Optimal extraction strategy (API > Sitemap > HTML)
Time/complexity estimates

See: workflows/reconnaissance.md for complete reconnaissance guide with MCP examples

Why this matters: Reconnaissance discovers hidden APIs (eliminating need for HTML scraping), identifies blockers before coding, and provides intelligence for optimal strategy selection. Never skip this step.

Phase 2: AUTOMATIC DISCOVERY (Validate Reconnaissance)

After Phase 1 reconnaissance, validate findings with automated checks:

1. Check for Sitemaps

# Automatically check these locations
curl -s https://[site]/robots.txt | grep -i Sitemap
curl -I https://[site]/sitemap.xml
curl -I https://[site]/sitemap_index.xml

Log findings clearly:

✓ "Found sitemap at /sitemap.xml with ~1,234 URLs"
✓ "Found sitemap index with 5 sub-sitemaps"
✗ "No sitemap detected at common locations"

Why this matters: Sitemaps provide instant URL discovery (60x faster than crawling)

2. Investigate APIs

Prompt user:

Should I check for JSON APIs first? (Highly recommended)

Benefits of APIs vs HTML scraping:
• 10-100x faster execution
• More reliable (structured JSON vs fragile HTML)
• Less bandwidth usage
• Easier to maintain

Check for APIs? [Y/n]

If yes, guide user:

Open browser DevTools → Network tab
Navigate the target website
Look for XHR/Fetch requests
Check for endpoints: /api/, /v1/, /v2/, /graphql, /_next/data/
Analyze request/response format (JSON, GraphQL, REST)

Log findings:

✓ "Found API: GET /api/products/{id} (returns JSON)"
✓ "Found GraphQL endpoint: /graphql"
✗ "No obvious public APIs detected"

3. Analyze Site Structure

Automatically assess:

JavaScript-heavy? (Look for React, Vue, Angular indicators)
Authentication required? (Login walls, auth tokens)
Page count estimate (from sitemap or site exploration)
Rate limiting indicators (robots.txt directives)

Phase 3: STRATEGY RECOMMENDATION

Based on Phases 1-2 findings, present 2-3 options with clear reasoning:

Example Output Template:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📊 Analysis of example.com
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Phase 1 Intelligence (Reconnaissance):
✓ API discovered via DevTools: GET /api/products?page=N&limit=100
✓ Framework: Next.js (SSR + CSR hybrid)
✓ Protection: Cloudflare detected, rate limit ~60/min
✗ No authentication required

Phase 2 Validation:
✓ Sitemap found: 1,234 product URLs (validates API total)
✓ Static HTML fallback available if needed

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Recommended Approaches:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⭐ Option 1: Hybrid (Sitemap + API) [RECOMMENDED]
   ✓ Use sitemap to get all 1,234 product URLs instantly
   ✓ Extract product IDs from URLs
   ✓ Fetch data via API (fast, reliable JSON)

   Estimated time: 8-12 minutes
   Complexity: Low-Medium
   Data quality: Excellent
   Speed: Very Fast

⚡ Option 2: Sitemap + Playwright
   ✓ Use sitemap for URLs
   ✓ Scrape HTML with Playwright

   Estimated time: 15-20 minutes
   Complexity: Medium
   Data quality: Good
   Speed: Fast

🔧 Option 3: Pure API (if sitemap fails)
   ✓ Discover product IDs through API exploration
   ✓ Fetch all data via API

   Estimated time: 10-15 minutes
   Complexity: Medium
   Data quality: Excellent
   Speed: Fast

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
My Recommendation: Option 1 (Hybrid)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Reasoning:
• Sitemap gives us complete URL list (instant discovery)
• API provides clean, structured data (no HTML parsing)
• Combines speed of sitemap with reliability of API
• Best of both worlds

Proceed with Option 1? [Y/n]

Key principles:

Always recommend the SIMPLEST approach that works
Sitemap > API > Playwright (in terms of simplicity)
Show time estimates and complexity
Explain reasoning clearly

Phase 4: ITERATIVE IMPLEMENTATION

Implement scraper incrementally, starting simple and adding complexity only as needed.

Core Pattern:

Implement recommended approach (minimal code)
Test with small batch (5-10 items)
Validate data quality
Scale to full dataset or fallback
Handle blocking if encountered
Add robustness (error handling, retries, logging)

See: workflows/implementation.md for complete implementation patterns and code examples

Phase 5: PRODUCTIONIZATION (On Request)

Convert scraper to production-ready Apify Actor.

Activation triggers:

"Make this an Apify Actor"
"Productionize this scraper"
"Deploy to Apify"
"Create an actor from this"

Core Pattern:

Confirm TypeScript preference (STRONGLY RECOMMENDED)
Initialize with apify create command (CRITICAL)
Port scraping logic to Actor format
Test locally and deploy

See: workflows/productionization.md for complete productionization workflow and apify/ directory for all Actor development guides

Quick Reference

Task	Pattern/Command	Documentation
Reconnaissance	Playwright + DevTools MCP	`workflows/reconnaissance.md`
Find sitemaps	`RobotsFile.find(url)`	`strategies/sitemap-discovery.md`
Filter sitemap URLs	`RequestList + regex`	`reference/regex-patterns.md`
Discover APIs	DevTools → Network tab	`strategies/api-discovery.md`
Playwright scraping	`PlaywrightCrawler`	`strategies/playwright-scraping.md`
HTTP scraping	`CheerioCrawler`	`strategies/cheerio-scraping.md`
Hybrid approach	Sitemap + API	`strategies/hybrid-approaches.md`
Handle blocking	fingerprint-suite + proxies	`strategies/anti-blocking.md`
Fingerprint configs	Quick patterns	`reference/fingerprint-patterns.md`
Create Apify Actor	`apify create`	`apify/cli-workflow.md`
Template selection	Cheerio vs Playwright	`workflows/productionization.md`
Input schema	`.actor/input_schema.json`	`apify/input-schemas.md`
Deploy actor	`apify push`	`apify/deployment.md`

Common Patterns

Pattern 1: Sitemap-Based Scraping

import { RobotsFile, PlaywrightCrawler, Dataset } from 'crawlee';

// Auto-discover and parse sitemaps
const robots = await RobotsFile.find('https://example.com');
const urls = await robots.parseUrlsFromSitemaps();

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, request }) {
        const data = await page.evaluate(() => ({
            title: document.title,
            // ... extract data
        }));
        await Dataset.pushData(data);
    },
});

await crawler.addRequests(urls);
await crawler.run();

See examples/sitemap-basic.js for complete example.

Pattern 2: API-Based Scraping

import { gotScraping } from 'got-scraping';

const productIds = [123, 456, 789];

for (const id of productIds) {
    const response = await gotScraping({
        url: `https://api.example.com/products/${id}`,
        responseType: 'json',
    });

    console.log(response.body);
}

See examples/api-scraper.js for complete example.

Pattern 3: Hybrid (Sitemap + API)

// Get URLs from sitemap
const robots = await RobotsFile.find('https://shop.com');
const urls = await robots.parseUrlsFromSitemaps();

// Extract IDs from URLs
const productIds = urls
    .map(url => url.match(/\/products\/(\d+)/)?.[1])
    .filter(Boolean);

// Fetch data via API
for (const id of productIds) {
    const data = await gotScraping({
        url: `https://api.shop.com/v1/products/${id}`,
        responseType: 'json',
    });
    // Process data
}

See examples/hybrid-sitemap-api.js for complete example.

Directory Navigation

This skill uses progressive disclosure - detailed information is organized in subdirectories and loaded only when needed.

Workflows (Implementation Patterns)

For: Step-by-step workflow guides for each phase

workflows/reconnaissance.md - Phase 1 interactive reconnaissance (CRITICAL)
workflows/implementation.md - Phase 4 iterative implementation patterns
workflows/productionization.md - Phase 5 Apify Actor creation workflow

Strategies (Deep Dives)

For: Detailed guides on specific scraping approaches

strategies/sitemap-discovery.md - Complete sitemap guide (4 patterns)
strategies/api-discovery.md - Finding and using APIs
strategies/playwright-scraping.md - Browser-based scraping
strategies/cheerio-scraping.md - HTTP-only scraping
strategies/hybrid-approaches.md - Combining strategies
strategies/anti-blocking.md - Fingerprinting & proxies for blocked sites

Examples (Runnable Code)

For: Working code to reference or execute

JavaScript Learning Examples (Simple standalone scripts):

examples/sitemap-basic.js - Simple sitemap scraper
examples/api-scraper.js - Pure API approach
examples/playwright-basic.js - Basic Playwright scraper
examples/hybrid-sitemap-api.js - Combined approach
examples/iterative-fallback.js - Try sitemap→API→Playwright

TypeScript Production Examples (Complete Actors):

apify/examples/basic-scraper/ - Sitemap + Playwright
apify/examples/anti-blocking/ - Fingerprinting + proxies
apify/examples/hybrid-api/ - Sitemap + API (optimal)

Reference (Quick Lookup)

For: Quick patterns and troubleshooting

reference/regex-patterns.md - Common URL regex patterns
reference/selector-guide.md - Playwright selector strategies
reference/fingerprint-patterns.md - Common fingerprint configurations
reference/anti-patterns.md - What NOT to do

Apify (Production Deployment)

For: Creating production Apify Actors

apify/README.md - When and how to use Apify
apify/typescript-first.md - Why TypeScript for actors
apify/cli-workflow.md - apify create workflow (CRITICAL)
apify/initialization.md - Complete setup guide
apify/input-schemas.md - Input validation patterns
apify/configuration.md - actor.json setup
apify/deployment.md - Testing and deployment
apify/templates/ - TypeScript boilerplate

Note: Each file is self-contained and can be read independently. Claude will navigate to specific files as needed.

Core Principles

1. Progressive Enhancement

Start with the simplest approach that works:

Sitemap > API > Playwright
Static > Dynamic
HTTP > Browser

2. Proactive Discovery

Always investigate before implementing:

Check for sitemaps automatically
Look for APIs (ask user to check DevTools)
Analyze site structure

3. Iterative Implementation

Build incrementally:

Small test batch first (5-10 items)
Validate quality
Scale or fallback
Add robustness last

4. Production-Ready Code

When productionizing:

Use TypeScript (strongly recommended)
Use apify create (never manual setup)
Add proper error handling
Include logging and monitoring

Remember: Sitemaps first, APIs second, scraping last!

For detailed guidance on any topic, navigate to the relevant subdirectory file listed above.

Security Guardrails

Confirm the user has authorization before using fingerprint spoofing, proxy rotation, or anti-detection techniques — these circumvent security controls that protect the target site, and legitimate uses (own-site testing, authorized research) require explicit confirmation.
Before scraping more than 100 pages from a single site, confirm the user has reviewed the target's Terms of Service — large-scale extraction without ToS awareness exposes the user to legal liability.
When login walls are encountered, report the requirement and ask the user to provide their own credentials — do not attempt to bypass or circumvent authentication systems, as unauthorized access to protected content violates the site owner's access controls.