Skill

web-scraping

Web scraping guide for sub-agents. Covers Firecrawl CLI fallback scraping when WebFetch fails (JS-heavy sites, anti-bot walls, 403 errors, empty content) and advanced capabilities like structured data extraction with Zod schemas, multi-page crawls, and search-plus-scrape. Use when WebFetch returns garbage or empty pages, when you need typed data from a page (prices, features, specs), or when you need to ingest multiple pages from a site.

Install

npx claudepluginhub nathanvale/side-quest-plugins --plugin newsroom

Tool Access

This skill uses the workspace's default tool permissions.

Preview

**Required tools for consuming agents**: WebFetch, Bash(bunx firecrawl-cli *), Read

Supporting Assets

references/crawling.mdreferences/structured-extraction.md

SKILL.md

Similar Skills

cache-components

Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.

cache-components

139.1k

claude-opus-4-5-migration

2 files

Migrates code, prompts, and API calls from Claude Sonnet 4.0/4.5 or Opus 4.1 to Opus 4.5, updating model strings on Anthropic, AWS, GCP, Azure platforms.

claude-opus-4-5-migration

83.2k

claude-code-plugin-release

1 file

Automates semantic versioning and release workflow for Claude Code plugins: bumps versions in package.json, marketplace.json, plugin.json; verifies builds; creates git tags, GitHub releases, changelogs.

claude-mem

64.7k

Stats

Parent Repo Stars0

Parent Repo Forks0

Last CommitFeb 17, 2026

Actions

View Source View Plugin View on GitHub View README

Web Scraping Field Card

Required tools for consuming agents: WebFetch, Bash(bunx firecrawl-cli *), Read

Integration: Any newsroom sub-agent should consult this skill when WebFetch fails or when structured/multi-page scraping is needed.

What Do You Need?

Need	Tool	Details
Page content as markdown	WebFetch first, then Firecrawl CLI	See below
Structured data from a page (prices, features, specs)	Firecrawl extract	Read references/structured-extraction.md
Multiple pages from one site	Firecrawl crawl	Read references/crawling.md
Search the web + scrape results	Firecrawl search	Read references/crawling.md

Getting Page Content

Step 1: Try WebFetch First

WebFetch is free, fast, and already available. Use it by default.

Works for: blogs, news articles, documentation, static pages, most forum threads.

Step 2: Recognize Failure

Switch to Firecrawl CLI when WebFetch returns:

Empty or near-empty content (page requires JavaScript rendering)
403/429 errors (anti-bot protection)
Mangled HTML with no useful text (client-side rendered SPA)
Login walls or cookie consent overlays blocking content

Do NOT retry WebFetch on the same URL -- it will fail again.

Step 3: Firecrawl CLI Scrape

Requires: firecrawl-cli (install: npm install -g firecrawl-cli or use via bunx firecrawl-cli). Authenticates via FIRECRAWL_API_KEY env var or firecrawl auth --api-key <key>.

If firecrawl-cli is not installed or FIRECRAWL_API_KEY is unset, skip to Step 4 (Report Gaps). Do not retry or attempt workarounds.

Output to stdout (default -- pipe or capture as needed):

bunx firecrawl-cli scrape "<url>"

Output to file (more token-efficient -- read from disk instead of context):

bunx firecrawl-cli scrape "<url>" -o /tmp/scrape-output.md

Then use the Read tool on /tmp/scrape-output.md to pull only what you need into context.

Handles: JS rendering, dynamic content, basic anti-bot bypass, clean Markdown output (strips nav, headers, footers with --only-main-content).

Does NOT handle: login-gated content, CAPTCHAs, form filling, aggressive Cloudflare Turnstile.

For multiple URLs, scrape each separately to different files:

bunx firecrawl-cli scrape "<url1>" -o /tmp/scrape-1.md
bunx firecrawl-cli scrape "<url2>" -o /tmp/scrape-2.md

The CLI is beta (released Jan 2026) -- expect quirks and flag changes. Run bunx firecrawl-cli scrape --help for current options.

Step 4: Report Gaps Honestly

If both WebFetch and Firecrawl fail:

Note which URL was inaccessible and why
Do not fabricate content or silently skip the source
Move on to other sources