Skill

scrape-web-page

General-purpose web page scraper via the user's localhost (preserves Israeli IP for geo-restricted sites). Handles any page that is NOT a clean article — SPAs, branch locators, product catalogs, search result pages, government portals, dashboards, API endpoints discovered via devtools, or anything requiring raw HTML, rendered DOM, network/XHR responses, or structured extraction. Supports scripted user interaction (click "load more", scroll, type into search, iterate a city dropdown) for sites that hide data behind UI actions — a naive "scrape this URL" will return nothing on those. Always tries to identify and call an unauthenticated backend JSON API first — UI automation is the fallback, not the default. Canonical user invocation provides a triple: (URL, interactive target to click/type/scroll, data to extract) — the skill loads the URL, activates the target in a loop until it stops producing new content, then extracts the requested data. Trigger phrases: "scrape this page", "dump the HTML", "find the API behind this page", "capture this SPA", "get all the X from this page", "extract the JSON", "scrape branch list / store list / catalog", "load URL, click this, then extract that", "I had to click to load everything", any non-article URL. For clean article extraction use `scrape-article` instead.

npx claudepluginhub danielrosehill/claude-code-plugins --plugin Local-Web-Capture

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Flexible scraper for arbitrary web pages. Unlike `scrape-article`, does not assume a readability-friendly article body exists. Supports four output modes: raw HTML, rendered DOM HTML, extracted structured data (JSON/CSV), and network-response capture (finding the backing API of an SPA).

SKILL.md

Similar Skills

evm-token-decimals

175.2k

Prevents silent decimal mismatch bugs in EVM ERC-20 tokens via runtime decimals lookup, chain-aware caching, bridged-token handling, and normalization. For DeFi bots, dashboards using Python/Web3, TypeScript/ethers, Solidity.

everything-claude-code

Stats

Stars0

Forks0

Last CommitApr 23, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

scrape-web-page

Flexible scraper for arbitrary web pages. Unlike scrape-article, does not assume a readability-friendly article body exists. Supports four output modes: raw HTML, rendered DOM HTML, extracted structured data (JSON/CSV), and network-response capture (finding the backing API of an SPA).

Hard constraint

Requests must originate from this machine. Do not route through a hosted reader (Jina, Firecrawl SaaS, ScrapingBee, etc.). The whole reason this plugin exists is to use the user's Israeli IP.

API-first, UI as fallback

Before reaching for any browser automation, try to identify and call the site's backend API directly. Most modern SPAs are a thin shell around a JSON API, and that API is usually reachable unauthenticated from the same origin — once you know the endpoint and any query params, a plain Fetcher.get (rung 1) replaces everything else.

Workflow:

Load the page once in network mode (Playwright with a response listener) and list every JSON endpoint it hits. Also watch for interactions (search, filter, paginate) that trigger additional endpoints — run them once, capture the URL templates.
For each candidate endpoint, try calling it directly with curl or Fetcher.get. Confirm:
- It returns the real data unauthenticated (no session cookie / CSRF token required, or the required headers are static and copyable).
- It supports pagination / filtering via query params you can enumerate.
- It isn't rate-limited into uselessness.
If yes → record the endpoint in sites.yaml under api_endpoints: and use it going forward. Skip browser automation entirely on future runs. This is dramatically faster, more reliable, and handles pagination without clicking.
Only if the API is authenticated, signed, rate-limited, or non-existent → fall back to UI automation (interactive mode).

Report both attempts in the output: "Tried API at <url> → . Fell back to UI interaction because ." This way the user can see whether the API route is worth pushing on (e.g. via a captured auth token) before accepting the slower UI path.

Do not skip the API check because "this SPA probably needs Playwright." Check first — the check costs one page load.

Pick a mode before you start

Ask (or infer from the goal):

raw — plain HTML as served. Use for server-rendered sites, or to inspect the initial payload of an SPA before JS runs.
rendered — HTML after JS execution. Use for SPAs where content is injected client-side on load with no user interaction required.
interactive — rendered + scripted user interactions (click "load more" buttons, scroll, type into a search box, select a dropdown, paginate). Many SPAs hide data behind a "load all" or per-city filter — e.g. the Israel Post branch locator requires clicking through cities or a "show all branches" button before the full list appears. A naive one-shot scrape of the landing URL will return an empty or partial dataset. If the user says anything like "I had to click to load all the branches", the mode is interactive.
extract — structured data pulled from the DOM or embedded JSON blobs. User specifies what to extract (e.g. "all branch ids and names", "product name + price", "table rows"). Often pairs with interactive or network.
network — capture XHR / fetch responses while loading (and optionally interacting with) the page. Use when the goal is to discover the backing API of an SPA — this is what you want for branch locators, store finders, autocomplete endpoints. Often the right first move for SPAs: catch the API once, then skip the browser on future runs.

Decision order for an unknown SPA:

Try network first — if the page fires a clean JSON endpoint, use that directly from here on.
If the endpoint requires interaction to fire (e.g. only triggers on city search), fall through to interactive + network combined: script the interactions while capturing responses.
Only resort to DOM scraping (extract off a rendered / interactive capture) if no JSON endpoint is exposed.

Do not default to rendered — it is the slowest. Start at raw unless the user's goal obviously needs JS.

`interactive` mode — scripted user actions

Use when the data only appears after the user does something.

Preferred invocation: user supplies the triple

The cleanest way to drive this mode is for the user to provide three things:

URL — the page to load.
Interactive target — the thing to click / type / scroll to reveal the data. A CSS selector is ideal; plain text ("the button labelled 'טען עוד'", "the city dropdown", "scroll to bottom") is fine — resolve it to a selector on the loaded page.
Data to extract — what the user actually wants out ("all branch ids and their Hebrew names", "every row in the results table", "the price and product name of each card").

Given that triple, the recipe is fixed:

load URL
→ repeat: activate the interactive target, wait for new content
  (stop when the target disappears, is disabled, or row count stops growing)
→ once settled, extract the requested data from the final DOM
→ save as JSON (with a sidecar raw HTML dump in case extraction needs tweaking)

Resolve the target by kind:

Button / link (pagination, "load more", "show all"): click in a loop until it's gone / disabled / row count plateaus.
Dropdown / select: enumerate <option> values, load each, concatenate+dedupe results.
Search box: page.type(sel, query, delay=120), wait for suggestions, click the match.
Scroll: repeated page.mouse.wheel until document.body.scrollHeight stops growing.

Always pair with network capture so you keep raw JSON payloads even if the DOM extraction is brittle.

Echo back what you resolved before running: "Loading X, will click selector Y until it stops producing new rows, then extract Z." One-line confirmation, then proceed under auto mode.

Typical recipes

Click-through "load more" / pagination

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_context(locale="he-IL").new_page()
    page.goto(url, wait_until="networkidle")

    # Click "load more" until it disappears or stops adding rows
    while True:
        btn = page.query_selector("button:has-text('טען עוד')")  # or data-testid, etc.
        if not btn or not btn.is_visible():
            break
        before = page.eval_on_selector_all("ul.branches li", "els => els.length")
        btn.click()
        page.wait_for_timeout(800)
        after = page.eval_on_selector_all("ul.branches li", "els => els.length")
        if after == before:
            break

    html = page.content()
    browser.close()

Iterate a dropdown / city list

If the page filters by city and there's no "all" option, enumerate the city <option> values and load each in turn, concatenating results. Deduplicate by the id field the API returns.

Autocomplete / physical typing

Some sites only fire the search on per-character input (e.g. Maccabi medicines). Use page.type(selector, query, delay=120) rather than fill, then wait for the suggestion dropdown.

Scroll-triggered lazy load

prev_height = 0
while True:
    page.mouse.wheel(0, 4000)
    page.wait_for_timeout(600)
    h = page.evaluate("document.body.scrollHeight")
    if h == prev_height:
        break
    prev_height = h

Always pair with `network` capture

Run the interaction script with the response listener from network mode attached. That way even if the DOM scrape is brittle, you have the raw JSON payloads saved.

When to stop automating and ask

If the interaction required is ambiguous ("click through all the cities" — are there 50 or 500? captcha? rate-limit?), do one city end-to-end first, show the user the result, confirm before looping. Auto mode does not license a 10-minute headless browser run without checkpointing.

Escalation ladder

Same rungs as scrape-article. Consult sites.yaml first.

Rung 1 — Scrapling `Fetcher` (raw)

from scrapling.fetchers import Fetcher
page = Fetcher.get(url, stealthy_headers=True, follow_redirects=True, timeout=20)
# page.status, page.html_content

Rung 2 — Scrapling `StealthyFetcher` (Camoufox, rendered + anti-bot)

from scrapling.fetchers import StealthyFetcher
page = StealthyFetcher.fetch(url, headless=True, network_idle=True, humanize=True)

Rung 2b — Scrapling `PlayWrightFetcher` (rendered, no stealth)

from scrapling.fetchers import PlayWrightFetcher
page = PlayWrightFetcher.fetch(url, headless=True, network_idle=True)

Rung 3 — authenticated

Defer to scrape-authenticated.

`network` mode — capturing API calls

For SPAs the fastest win is usually to identify the JSON endpoint powering the UI and call it directly. Two approaches:

A. Passive capture via Playwright

from playwright.sync_api import sync_playwright
import json, pathlib

responses = []
with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    ctx = browser.new_context(locale="he-IL")
    page = ctx.new_page()

    def on_response(resp):
        ct = resp.headers.get("content-type", "")
        if "json" in ct or resp.url.endswith(".json"):
            try:
                responses.append({"url": resp.url, "status": resp.status, "body": resp.json()})
            except Exception:
                pass

    page.on("response", on_response)
    page.goto(url, wait_until="networkidle")
    # Optionally interact: page.fill("input#city", "ירושלים"); page.wait_for_timeout(2000)
    browser.close()

pathlib.Path(out_path).write_text(json.dumps(responses, ensure_ascii=False, indent=2))

After capture, report the distinct endpoint hosts + paths so the user can decide which to call directly next time.

B. Direct API call once endpoint is known

curl or Fetcher.get, with headers copied from devtools. Record the endpoint under the domain's entry in sites.yaml as api_endpoints:.

`extract` mode — structured data out of the DOM

Prefer, in order:

Embedded JSON — look for <script type="application/ld+json">, <script id="__NEXT_DATA__">, window.__INITIAL_STATE__ = .... Parse that directly; it's the cleanest source.
Backing API (network mode, above) — if the DOM is a rendering of a JSON blob, fetch the JSON.
CSS selectors — page.css("selector").attrib["data-id"] + .text for each row. Preserve Hebrew verbatim.

Do not scrape heuristically if a JSON source exists on the page.

CLI fallbacks

curl -sL -A "Mozilla/5.0" "<url>" — sanity check for raw mode when Scrapling acts up.
wget -qO- "<url>" — same.
jq — for slicing captured JSON.
Playwright MCP — fallback for rendered / network modes when Scrapling's Playwright integration misbehaves.

Output

Save under the resolved capture root (per reference/save-location.md), subdirectory pages/YYYY/MM/.

Filename: YYYY-MM-DD--HHMM--<domain>--<slug>--<shorthash>.<ext>

Extensions:

.html for raw / rendered
.json for network / extract
.md only if the user explicitly wants a markdown rendering

Write a sidecar .meta.json next to every capture:

{
  "url": "...",
  "domain": "...",
  "captured_at": "<ISO now, Israel time>",
  "mode": "raw|rendered|extract|network",
  "extractor": "scrapling-static|scrapling-stealth|scrapling-playwright|playwright-direct|curl",
  "rung": 1,
  "status": 200,
  "notes": "optional — flags like 'captcha hit', 'login wall', 'found API at /api/branches'"
}

Strategy cache — sites.yaml

Same file as scrape-article. Record per domain:

doar.israelpost.co.il:
  strategy: playwright     # SPA
  notes: "Branch locator; backing API at /api/..."
  api_endpoints:
    - https://.../api/branches

After saving

Report: file path, mode, rung, status code, byte / row count.
For network mode: list distinct JSON endpoints hit, one per line.
For extract mode: show a 3-row preview of the extracted data.
If the capture looks empty/thin, say so and recommend the next rung or mode.

Do not

Do not fall back to hosted readers.
Do not silently downgrade a network capture to an HTML dump — if no JSON was seen, say so.
Do not invent API endpoints. Report only what the browser actually requested.
Do not scrape images/media unless explicitly asked.

scrape-web-page

Tool Access

Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

scrape-web-page

Tool Access

Preview

SKILL.md

scrape-web-page

Hard constraint

API-first, UI as fallback

Pick a mode before you start

interactive mode — scripted user actions

Preferred invocation: user supplies the triple

Typical recipes

Click-through "load more" / pagination

Iterate a dropdown / city list

Autocomplete / physical typing

Scroll-triggered lazy load

Always pair with network capture

When to stop automating and ask

Escalation ladder

Rung 1 — Scrapling Fetcher (raw)

Rung 2 — Scrapling StealthyFetcher (Camoufox, rendered + anti-bot)

Rung 2b — Scrapling PlayWrightFetcher (rendered, no stealth)

Rung 3 — authenticated

network mode — capturing API calls

A. Passive capture via Playwright

B. Direct API call once endpoint is known

extract mode — structured data out of the DOM

CLI fallbacks

Output

Strategy cache — sites.yaml

After saving

Do not

Similar Skills

Help us improve

scrape-web-page

Hard constraint

API-first, UI as fallback

Pick a mode before you start

interactive mode — scripted user actions

Preferred invocation: user supplies the triple

Typical recipes

Click-through "load more" / pagination

Iterate a dropdown / city list

Autocomplete / physical typing

Scroll-triggered lazy load

Always pair with network capture

When to stop automating and ask

Escalation ladder

Rung 1 — Scrapling Fetcher (raw)

Rung 2 — Scrapling StealthyFetcher (Camoufox, rendered + anti-bot)

Rung 2b — Scrapling PlayWrightFetcher (rendered, no stealth)

Rung 3 — authenticated

network mode — capturing API calls

A. Passive capture via Playwright

B. Direct API call once endpoint is known

extract mode — structured data out of the DOM

CLI fallbacks

Output

Strategy cache — sites.yaml

After saving

Do not

`interactive` mode — scripted user actions

Always pair with `network` capture

Rung 1 — Scrapling `Fetcher` (raw)

Rung 2 — Scrapling `StealthyFetcher` (Camoufox, rendered + anti-bot)

Rung 2b — Scrapling `PlayWrightFetcher` (rendered, no stealth)

`network` mode — capturing API calls

`extract` mode — structured data out of the DOM

`interactive` mode — scripted user actions

Always pair with `network` capture

Rung 1 — Scrapling `Fetcher` (raw)

Rung 2 — Scrapling `StealthyFetcher` (Camoufox, rendered + anti-bot)

Rung 2b — Scrapling `PlayWrightFetcher` (rendered, no stealth)

`network` mode — capturing API calls

`extract` mode — structured data out of the DOM