From agent-almanac
Extracts data from JS-rendered or anti-bot-protected web pages using scrapling Python library's tiered fetchers (HTTP, stealth Chromium, browser automation) and CSS selectors. Use when WebFetch fails for dynamic or protected content.
npx claudepluginhub pjt222/agent-almanacThis skill is limited to using the following tools:
Extract data from web pages that resist simple HTTP requests — JS-rendered content,
Scrapes and processes web data using Puppeteer for dynamic sites, Cheerio for static HTML, CSS selectors, XPath, rate limiting, and error handling. Useful for extracting product info, articles, or lists ethically.
Extracts structured data from websites like product listings, tables, or search results, generating executable Playwright scripts and JSON/CSV output.
Scrapes web pages via Scrape.do API to bypass blocks, CAPTCHA, and WebFetch errors like 403, 401, 429, timeouts, access denied, Cloudflare. Auto-activates on failures.
Share bugs, ideas, or general feedback.
Extract data from web pages that resist simple HTTP requests — JS-rendered content, Cloudflare-protected sites, and dynamic SPAs — using scrapling's three-tier fetcher architecture and CSS-based data extraction.
WebFetch or requests.get() returns empty or blocked responsesDetermine which scrapling fetcher matches the target site's defenses.
# Decision matrix:
# 1. Fetcher — static HTML, no JS, no anti-bot (fastest)
# 2. StealthyFetcher — Cloudflare/Turnstile, TLS fingerprint checks
# 3. DynamicFetcher — JS-rendered SPAs, click/scroll interactions
# Quick probe: try Fetcher first, escalate on failure
from scrapling import Fetcher
fetcher = Fetcher()
response = fetcher.get("https://example.com/target-page")
if response.status == 200 and response.get_all_text():
print("Fetcher tier sufficient")
else:
print("Escalate to StealthyFetcher or DynamicFetcher")
| Signal | Recommended Tier |
|---|---|
| Static HTML, no protection | Fetcher |
| 403/503, Cloudflare challenge page | StealthyFetcher |
| Page loads but content area is empty | DynamicFetcher |
| Need to click buttons or scroll | DynamicFetcher |
| altcha CAPTCHA present | None (cannot be automated) |
Expected: One of the three tiers is identified. For most modern sites, StealthyFetcher is the correct starting point.
On failure: If all three tiers return blocked responses, check whether the site uses altcha CAPTCHA (proof-of-work challenge that cannot be bypassed). If so, document the limitation and provide manual extraction instructions instead.
Set up the selected fetcher with appropriate options.
from scrapling import Fetcher, StealthyFetcher, DynamicFetcher
# Tier 1: Fast HTTP with TLS fingerprint impersonation
fetcher = Fetcher()
fetcher.configure(
timeout=30,
retries=3,
follow_redirects=True
)
# Tier 2: Headless Chromium with anti-detection
fetcher = StealthyFetcher()
fetcher.configure(
headless=True,
timeout=60,
network_idle=True # wait for all network requests to settle
)
# Tier 3: Full browser automation
fetcher = DynamicFetcher()
fetcher.configure(
headless=True,
timeout=90,
network_idle=True,
wait_selector="div.results" # wait for specific element before extracting
)
Expected: Fetcher instance is configured and ready. No errors on instantiation. For StealthyFetcher and DynamicFetcher, a Chromium binary is available (scrapling manages this automatically on first run).
On failure:
playwright or browser binary not found -- run python -m playwright install chromiumconfigure() -- increase timeout value or check network connectivitypip install scraplingNavigate to the target URL and extract structured data using CSS selectors.
# Fetch the page
response = fetcher.get("https://example.com/target-page")
# Single element extraction
title = response.find("h1.page-title")
if title:
print(title.get_all_text())
# Multiple elements
items = response.find_all("div.result-item")
for item in items:
name = item.find("span.name")
price = item.find("span.price")
print(f"{name.get_all_text()}: {price.get_all_text()}")
# Get attribute values
links = response.find_all("a.product-link")
urls = [link.get("href") for link in links]
# Get raw HTML content of an element
detail_html = response.find("div.description").html_content
Key API reference:
| Method | Purpose |
|---|---|
response.find("selector") | First matching element |
response.find_all("selector") | All matching elements |
element.get("attr") | Attribute value (href, src, data-*) |
element.get_all_text() | All text content, recursively |
element.html_content | Raw inner HTML |
Expected: Extracted data matches the visible page content. Elements are non-None and text content is non-empty for populated pages.
On failure:
find() returns None -- inspect the actual HTML (response.html_content) to verify the selector; the page may use different class names than expectedget_all_text() -- content may be inside shadow DOM or an iframe; try DynamicFetcher with a wait_selector.css_first() -- this is not part of the scrapling API (common confusion with other libraries)Implement fallback logic for CAPTCHA detection, empty responses, and session requirements.
import time
def scrape_with_fallback(url, selector):
"""Try each fetcher tier in order, with CAPTCHA detection."""
tiers = [
("Fetcher", Fetcher),
("StealthyFetcher", StealthyFetcher),
("DynamicFetcher", DynamicFetcher),
]
for tier_name, tier_class in tiers:
fetcher = tier_class()
fetcher.configure(headless=True, timeout=60)
try:
response = fetcher.get(url)
except Exception as error:
print(f"{tier_name} failed: {error}")
continue
# Detect CAPTCHA / challenge pages
page_text = response.get_all_text().lower()
if "altcha" in page_text or "proof of work" in page_text:
print(f"altcha CAPTCHA detected -- cannot automate")
return None
if response.status == 403 or response.status == 503:
print(f"{tier_name} blocked (HTTP {response.status}), escalating")
continue
result = response.find(selector)
if result and result.get_all_text().strip():
return result.get_all_text()
print(f"{tier_name} returned empty content, escalating")
print("All tiers exhausted. Manual extraction required.")
return None
Expected: Function returns extracted text on success, or None with a diagnostic message when all tiers fail. CAPTCHA pages are detected and reported rather than retried indefinitely.
On failure:
Implement delays and respect site policies before running at scale.
import time
import urllib.robotparser
def check_robots_txt(base_url, target_path):
"""Check if scraping is allowed by robots.txt."""
rp = urllib.robotparser.RobotFileParser()
rp.set_url(f"{base_url}/robots.txt")
rp.read()
return rp.can_fetch("*", f"{base_url}{target_path}")
def scrape_urls(urls, selector, delay=1.0):
"""Scrape multiple URLs with rate limiting."""
results = []
fetcher = StealthyFetcher()
fetcher.configure(headless=True, timeout=60)
for url in urls:
response = fetcher.get(url)
data = response.find(selector)
if data:
results.append(data.get_all_text())
time.sleep(delay) # respect the server
return results
Ethical scraping checklist:
robots.txt before scraping -- respect Disallow directivesExpected: Scraping runs at a controlled rate. robots.txt is checked before bulk operations. No 429 responses are triggered.
On failure:
robots.txt disallows the path -- respect the directive; do not override itconfigure() method is used (not deprecated constructor kwargs).find() / .find_all() API is used (not .css_first() or other library methods)robots.txt is checked before bulk operations.css_first() instead of .find(): scrapling uses .find() and .find_all() for element selection -- .css_first() belongs to a different library and will raise AttributeErrorFetcher first, then escalate -- DynamicFetcher is 10-50x slower due to full browser startupconfigure(): scrapling v0.4.x deprecated passing options to the constructor; always use the configure() method