From agent-almanac
Rotates proxies (datacenter, residential, mobile) for web scraping when stealth techniques fail. Integrates with scrapling, sets session stickiness, monitors cost/health ethically for legitimate traffic.
npx claudepluginhub pjt222/agent-almanacThis skill is limited to using the following tools:
Network-layer escalation for scraping campaigns where client-side stealth has
Enforces Firecrawl scraping policies with domain blocklists, credit budgets, content filtering, and robots.txt compliance. Use for compliant, cost-controlled web crawls.
Provides TypeScript patterns for Bright Data proxy integrations: singleton axios client, retry wrappers for scraping with session, country, and error handling.
Extracts data from JS-rendered or anti-bot-protected web pages using scrapling Python library's tiered fetchers (HTTP, stealth Chromium, browser automation) and CSS selectors. Use when WebFetch fails for dynamic or protected content.
Share bugs, ideas, or general feedback.
Network-layer escalation for scraping campaigns where client-side stealth has already been exhausted. Proxy rotation is a last resort, not a default — it is expensive, ethically charged, and easily misused. This skill teaches when not to use it as much as how to use it well.
headless-web-scraping (Fetcher → StealthyFetcher → DynamicFetcher) has
been tried and the target still returns 403/429/geo-blocksrobots.txt permits
the pathpython-requests)Do not use when: a public API exists (use it), the site's ToS forbids automated access, you would be circumventing geo-licensing, or the goal is fraud / credential stuffing / sneaker bots / content piracy.
Gate the entire workflow on a documented legal and ethical review. Skipping this step is the single biggest source of harm.
# Inputs to confirm before writing any code:
# 1. Is the data public (no login required)?
# 2. Does robots.txt permit the path?
# 3. Does the site's ToS prohibit automated access? (read it)
# 4. Would the scraping process personal data? If yes, what is the legal basis?
# 5. Could this access circumvent geo-licensing, paywalls, or auth?
# 6. Is there a public API or data dump that would make scraping unnecessary?
# 7. Have you contacted the site owner if scope is large?
Expected: Every question has a defensible written answer. The first "no" or "unknown" stops the procedure until resolved.
On failure:
Different pool types have different cost, detectability, and ethical profiles. Pick the cheapest tier that actually solves your block.
| Pool type | Detectability | Cost | Best for |
|---|---|---|---|
| Datacenter | High (easily blocked by Cloudflare/Akamai) | $ | Sites with no real anti-bot, geo-shifting only |
| Residential | Low (real ISP IPs) | $$$ | Sites that block datacenter ASNs |
| Mobile | Very low (carrier-grade NAT, shared with thousands) | $$$$ | Sites that even block residential (rare) |
Ethical caveat for residential and mobile: these pools route your traffic through real consumer connections. The pool operator's consent model varies — some pay users, some bundle exit-node consent into "free VPN" EULAs that users do not read. Prefer providers with audited, opt-in consent. If you would not be comfortable with a stranger sending your scraping traffic through your home router, do not send yours through theirs.
Expected: A documented choice with the cheapest viable tier and a brief note on why higher tiers were rejected (or why a higher tier is needed).
On failure:
Wire the proxy into scrapling fetchers. Read credentials from environment
variables — never hard-code, never commit a .env to git.
import os
import random
from scrapling import Fetcher, StealthyFetcher
# Pattern A: provider-managed rotating endpoint (one URL, provider rotates per request)
PROXY_URL = os.environ["SCRAPING_PROXY_URL"] # http://user:pass@gateway.example:7777
fetcher = StealthyFetcher()
fetcher.configure(
headless=True,
timeout=60,
network_idle=True,
proxy=PROXY_URL,
)
# Pattern B: explicit pool, rotate yourself
POOL = os.environ["SCRAPING_PROXY_POOL"].split(",") # comma-separated URLs
def fetch_with_rotation(url):
proxy = random.choice(POOL)
fetcher = StealthyFetcher()
fetcher.configure(headless=True, timeout=60, proxy=proxy)
return fetcher.get(url)
Expected: Requests succeed and the egress IP varies between calls.
Confirm by hitting an IP-echo endpoint (e.g. https://api.ipify.org) before
running the real scrape.
On failure:
-rotating or per-request flagDecide rotation granularity per workload, then keep the pool healthy.
# Sticky session for stateful flows (login, multi-page checkout-like crawls)
# Most providers expose a session ID via the username:
# user-session-abc123:pass@gateway.example:7777
# All requests with the same session ID exit through the same IP for ~10 min.
# Per-request rotation for anonymous bulk scraping (default)
# Pool health check — call before bulk run
def check_pool(pool, sample_size=5):
sample = random.sample(pool, min(sample_size, len(pool)))
alive = []
for proxy in sample:
try:
r = StealthyFetcher().configure(proxy=proxy, timeout=10).get(
"https://api.ipify.org"
)
if r.status == 200:
alive.append(proxy)
except Exception:
pass
return alive
# Backoff on transient proxy failures
def fetch_with_backoff(url, max_attempts=3):
for attempt in range(max_attempts):
try:
r = fetch_with_rotation(url)
if r.status not in (407, 502, 503):
return r
except Exception:
pass
time.sleep(2 ** attempt)
return None
Expected: Stateful flows preserve cookies across requests; bulk anonymous scraping shows IP variance across requests; dead proxies are skipped instead of looping.
On failure:
Proxy traffic has a per-GB cost and a per-request cost. Runaway scrapers generate runaway invoices. Always include limits and an abort.
import time
class ScrapeBudget:
def __init__(self, max_requests, max_duration_seconds, max_failures):
self.max_requests = max_requests
self.max_duration = max_duration_seconds
self.max_failures = max_failures
self.requests = 0
self.failures = 0
self.start = time.monotonic()
def allow(self):
if self.requests >= self.max_requests:
return False, "request cap reached"
if time.monotonic() - self.start >= self.max_duration:
return False, "time cap reached"
if self.failures >= self.max_failures:
return False, "failure cap reached (circuit breaker)"
return True, None
def record(self, success):
self.requests += 1
if not success:
self.failures += 1
budget = ScrapeBudget(max_requests=1000, max_duration_seconds=3600, max_failures=20)
for url in target_urls:
ok, reason = budget.allow()
if not ok:
print(f"Aborting: {reason}")
break
response = fetch_with_backoff(url)
budget.record(success=response is not None)
time.sleep(1) # rate limiting still applies even with rotation
Expected: Budget caps trigger before runaway cost. Logs show per-proxy success rate so a bad egress IP can be identified and excluded.
On failure:
gateway., proxy=, the provider hostname).env (or equivalent) is in .gitignorerobots.txt is still respected — rotation does not override itStealthyFetcher and rate limiting first; rotation is
expensive and unethical to deploy unnecessarily.robots.txt because "we have rotation now": rotation does
not grant permission. The directive is the directive.