From Local-Web-Capture
General-purpose web page scraper via the user's localhost (preserves Israeli IP for geo-restricted sites). Handles any page that is NOT a clean article — SPAs, branch locators, product catalogs, search result pages, government portals, dashboards, API endpoints discovered via devtools, or anything requiring raw HTML, rendered DOM, network/XHR responses, or structured extraction. Supports scripted user interaction (click "load more", scroll, type into search, iterate a city dropdown) for sites that hide data behind UI actions — a naive "scrape this URL" will return nothing on those. Always tries to identify and call an unauthenticated backend JSON API first — UI automation is the fallback, not the default. Canonical user invocation provides a triple: (URL, interactive target to click/type/scroll, data to extract) — the skill loads the URL, activates the target in a loop until it stops producing new content, then extracts the requested data. Trigger phrases: "scrape this page", "dump the HTML", "find the API behind this page", "capture this SPA", "get all the X from this page", "extract the JSON", "scrape branch list / store list / catalog", "load URL, click this, then extract that", "I had to click to load everything", any non-article URL. For clean article extraction use `scrape-article` instead.
npx claudepluginhub danielrosehill/claude-code-plugins --plugin Local-Web-CaptureThis skill uses the workspace's default tool permissions.
Flexible scraper for arbitrary web pages. Unlike `scrape-article`, does not assume a readability-friendly article body exists. Supports four output modes: raw HTML, rendered DOM HTML, extracted structured data (JSON/CSV), and network-response capture (finding the backing API of an SPA).
Prevents silent decimal mismatch bugs in EVM ERC-20 tokens via runtime decimals lookup, chain-aware caching, bridged-token handling, and normalization. For DeFi bots, dashboards using Python/Web3, TypeScript/ethers, Solidity.
Share bugs, ideas, or general feedback.
Flexible scraper for arbitrary web pages. Unlike scrape-article, does not assume a readability-friendly article body exists. Supports four output modes: raw HTML, rendered DOM HTML, extracted structured data (JSON/CSV), and network-response capture (finding the backing API of an SPA).
Requests must originate from this machine. Do not route through a hosted reader (Jina, Firecrawl SaaS, ScrapingBee, etc.). The whole reason this plugin exists is to use the user's Israeli IP.
Before reaching for any browser automation, try to identify and call the site's backend API directly. Most modern SPAs are a thin shell around a JSON API, and that API is usually reachable unauthenticated from the same origin — once you know the endpoint and any query params, a plain Fetcher.get (rung 1) replaces everything else.
Workflow:
network mode (Playwright with a response listener) and list every JSON endpoint it hits. Also watch for interactions (search, filter, paginate) that trigger additional endpoints — run them once, capture the URL templates.curl or Fetcher.get. Confirm:
sites.yaml under api_endpoints: and use it going forward. Skip browser automation entirely on future runs. This is dramatically faster, more reliable, and handles pagination without clicking.interactive mode).Report both attempts in the output: "Tried API at <url> → . Fell back to UI interaction because ." This way the user can see whether the API route is worth pushing on (e.g. via a captured auth token) before accepting the slower UI path.
Do not skip the API check because "this SPA probably needs Playwright." Check first — the check costs one page load.
Ask (or infer from the goal):
raw — plain HTML as served. Use for server-rendered sites, or to inspect the initial payload of an SPA before JS runs.rendered — HTML after JS execution. Use for SPAs where content is injected client-side on load with no user interaction required.interactive — rendered + scripted user interactions (click "load more" buttons, scroll, type into a search box, select a dropdown, paginate). Many SPAs hide data behind a "load all" or per-city filter — e.g. the Israel Post branch locator requires clicking through cities or a "show all branches" button before the full list appears. A naive one-shot scrape of the landing URL will return an empty or partial dataset. If the user says anything like "I had to click to load all the branches", the mode is interactive.extract — structured data pulled from the DOM or embedded JSON blobs. User specifies what to extract (e.g. "all branch ids and names", "product name + price", "table rows"). Often pairs with interactive or network.network — capture XHR / fetch responses while loading (and optionally interacting with) the page. Use when the goal is to discover the backing API of an SPA — this is what you want for branch locators, store finders, autocomplete endpoints. Often the right first move for SPAs: catch the API once, then skip the browser on future runs.Decision order for an unknown SPA:
network first — if the page fires a clean JSON endpoint, use that directly from here on.interactive + network combined: script the interactions while capturing responses.extract off a rendered / interactive capture) if no JSON endpoint is exposed.Do not default to rendered — it is the slowest. Start at raw unless the user's goal obviously needs JS.
interactive mode — scripted user actionsUse when the data only appears after the user does something.
The cleanest way to drive this mode is for the user to provide three things:
Given that triple, the recipe is fixed:
load URL
→ repeat: activate the interactive target, wait for new content
(stop when the target disappears, is disabled, or row count stops growing)
→ once settled, extract the requested data from the final DOM
→ save as JSON (with a sidecar raw HTML dump in case extraction needs tweaking)
Resolve the target by kind:
<option> values, load each, concatenate+dedupe results.page.type(sel, query, delay=120), wait for suggestions, click the match.page.mouse.wheel until document.body.scrollHeight stops growing.Always pair with network capture so you keep raw JSON payloads even if the DOM extraction is brittle.
Echo back what you resolved before running: "Loading X, will click selector Y until it stops producing new rows, then extract Z." One-line confirmation, then proceed under auto mode.
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_context(locale="he-IL").new_page()
page.goto(url, wait_until="networkidle")
# Click "load more" until it disappears or stops adding rows
while True:
btn = page.query_selector("button:has-text('טען עוד')") # or data-testid, etc.
if not btn or not btn.is_visible():
break
before = page.eval_on_selector_all("ul.branches li", "els => els.length")
btn.click()
page.wait_for_timeout(800)
after = page.eval_on_selector_all("ul.branches li", "els => els.length")
if after == before:
break
html = page.content()
browser.close()
If the page filters by city and there's no "all" option, enumerate the city <option> values and load each in turn, concatenating results. Deduplicate by the id field the API returns.
Some sites only fire the search on per-character input (e.g. Maccabi medicines). Use page.type(selector, query, delay=120) rather than fill, then wait for the suggestion dropdown.
prev_height = 0
while True:
page.mouse.wheel(0, 4000)
page.wait_for_timeout(600)
h = page.evaluate("document.body.scrollHeight")
if h == prev_height:
break
prev_height = h
network captureRun the interaction script with the response listener from network mode attached. That way even if the DOM scrape is brittle, you have the raw JSON payloads saved.
If the interaction required is ambiguous ("click through all the cities" — are there 50 or 500? captcha? rate-limit?), do one city end-to-end first, show the user the result, confirm before looping. Auto mode does not license a 10-minute headless browser run without checkpointing.
Same rungs as scrape-article. Consult sites.yaml first.
Fetcher (raw)from scrapling.fetchers import Fetcher
page = Fetcher.get(url, stealthy_headers=True, follow_redirects=True, timeout=20)
# page.status, page.html_content
StealthyFetcher (Camoufox, rendered + anti-bot)from scrapling.fetchers import StealthyFetcher
page = StealthyFetcher.fetch(url, headless=True, network_idle=True, humanize=True)
PlayWrightFetcher (rendered, no stealth)from scrapling.fetchers import PlayWrightFetcher
page = PlayWrightFetcher.fetch(url, headless=True, network_idle=True)
Defer to scrape-authenticated.
network mode — capturing API callsFor SPAs the fastest win is usually to identify the JSON endpoint powering the UI and call it directly. Two approaches:
from playwright.sync_api import sync_playwright
import json, pathlib
responses = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
ctx = browser.new_context(locale="he-IL")
page = ctx.new_page()
def on_response(resp):
ct = resp.headers.get("content-type", "")
if "json" in ct or resp.url.endswith(".json"):
try:
responses.append({"url": resp.url, "status": resp.status, "body": resp.json()})
except Exception:
pass
page.on("response", on_response)
page.goto(url, wait_until="networkidle")
# Optionally interact: page.fill("input#city", "ירושלים"); page.wait_for_timeout(2000)
browser.close()
pathlib.Path(out_path).write_text(json.dumps(responses, ensure_ascii=False, indent=2))
After capture, report the distinct endpoint hosts + paths so the user can decide which to call directly next time.
curl or Fetcher.get, with headers copied from devtools. Record the endpoint under the domain's entry in sites.yaml as api_endpoints:.
extract mode — structured data out of the DOMPrefer, in order:
<script type="application/ld+json">, <script id="__NEXT_DATA__">, window.__INITIAL_STATE__ = .... Parse that directly; it's the cleanest source.page.css("selector").attrib["data-id"] + .text for each row. Preserve Hebrew verbatim.Do not scrape heuristically if a JSON source exists on the page.
curl -sL -A "Mozilla/5.0" "<url>" — sanity check for raw mode when Scrapling acts up.wget -qO- "<url>" — same.jq — for slicing captured JSON.Save under the resolved capture root (per reference/save-location.md), subdirectory pages/YYYY/MM/.
Filename: YYYY-MM-DD--HHMM--<domain>--<slug>--<shorthash>.<ext>
Extensions:
.html for raw / rendered.json for network / extract.md only if the user explicitly wants a markdown renderingWrite a sidecar .meta.json next to every capture:
{
"url": "...",
"domain": "...",
"captured_at": "<ISO now, Israel time>",
"mode": "raw|rendered|extract|network",
"extractor": "scrapling-static|scrapling-stealth|scrapling-playwright|playwright-direct|curl",
"rung": 1,
"status": 200,
"notes": "optional — flags like 'captcha hit', 'login wall', 'found API at /api/branches'"
}
Same file as scrape-article. Record per domain:
doar.israelpost.co.il:
strategy: playwright # SPA
notes: "Branch locator; backing API at /api/..."
api_endpoints:
- https://.../api/branches
network mode: list distinct JSON endpoints hit, one per line.extract mode: show a 3-row preview of the extracted data.network capture to an HTML dump — if no JSON was seen, say so.