Skill

scrape

Scrapes web content as clean markdown, HTML, JSON, or screenshots using Bright Data CLI (bdata scrape). For single pages, URL lists, paginated listings; verifies CLI setup and hands off to specialized skills.

Bash

cli-tools

automation

npx claudepluginhub brightdata/skills --plugin brightdata-plugin

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Get clean content (markdown, HTML, JSON, screenshot) from one or more URLs via the Bright Data CLI. This skill owns the "fetch raw or lightly-structured content" job. For platform-specific structured data (Amazon, LinkedIn, TikTok, etc.), **stop and use `data-feeds` instead** — you'll get clean JSON without selector logic.

Supporting Assets

references/examples.mdreferences/flags.mdreferences/patterns.md

SKILL.md

Similar Skills

brightdata

586

Scrapes any webpage as markdown bypassing bot detection/CAPTCHA or searches Google with structured JSON results via Bright Data API. Requires API key and Unlocker zone.

2 files

sundial-org-awesome-openclaw-skills-4

Firecrawl Web Scraper Skill

Scrapes single pages or crawls sites using Firecrawl v2.5 API to LLM-ready markdown and structured data. Handles JS rendering, bot bypass, browser automation for dynamic content extraction.

6 files

firecrawl-scraper

brightdata-cli

Guides Bright Data CLI usage for scraping URLs, web searches on Google/Bing/Yandex, structured data extraction from 40+ platforms like Amazon/LinkedIn/Instagram, proxy zone management, and account budget checks.

2 files

brightdata-plugin

Stats

Stars67

Forks11

Last CommitApr 19, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Bright Data — Scrape

Get clean content (markdown, HTML, JSON, screenshot) from one or more URLs via the Bright Data CLI. This skill owns the "fetch raw or lightly-structured content" job. For platform-specific structured data (Amazon, LinkedIn, TikTok, etc.), stop and use data-feeds instead — you'll get clean JSON without selector logic.

Setup gate (run first)

Before any scrape, verify the CLI is installed and authenticated:

if ! command -v bdata >/dev/null 2>&1; then
    echo "bdata CLI not installed — see bright-data-best-practices/references/cli-setup.md"
elif ! bdata zones >/dev/null 2>&1; then
    echo "bdata not authenticated — run: bdata login  (or: bdata login --device for SSH)"
fi

If either check fails, halt and route the user to skills/bright-data-best-practices/references/cli-setup.md. Do not attempt the legacy curl fallback silently — ask the user first.

Pick your path

Situation	Action
Single URL	`bdata scrape <url> -f markdown`
Small list (≤ ~20 URLs)	shell loop, 1 at a time (see `references/patterns.md`)
Larger list (dozens+)	`xargs -P 4` with parallelism cap (see `references/patterns.md`)
Paginated listing	scrape page 1 → extract next-page URL → append → repeat (see `references/examples.md`)
JS-heavy / login-gated / interaction-required	escalate to `bdata browser` (see `brightdata-cli` skill)
Amazon, LinkedIn, TikTok, Instagram, YouTube, Reddit, …	stop — hand off to `data-feeds`
No URL yet, just a topic	hand off to `search`

Action

Core commands:

# Clean markdown (default)
bdata scrape "https://example.com/article" -f markdown -o article.md

# Raw HTML (when you need the DOM)
bdata scrape "https://example.com" -f html -o page.html

# Structured JSON (when the Unlocker returns parsed fields)
bdata scrape "https://example.com" -f json --pretty -o page.json

# Visual snapshot (saves PNG)
bdata scrape "https://example.com" -f screenshot -o page.png

# Geo-targeted (override the exit country)
bdata scrape "https://example.com" --country de -f markdown

Full flag reference: references/flags.md.

Verification gate (run before claiming success)

Non-empty output: test -s "$out_path" — or, for stdout, at least 200 bytes of content.
Not a block page — grep the output for any of these signatures (case-insensitive):
- Access Denied
- Just a moment
- Attention Required
- Checking your browser
- captcha
- cf-browser-verification
- cloudflare (with < 2KB total body)
Expected markers present for the task: e.g., a product page should contain a price pattern (\$\d); an article should contain at least one <h1> or # heading.
On failure, escalation ladder:
- Retry with a different --country (e.g., --country de if the origin site is US)
- Escalate to bdata browser for full JS rendering (hand off to brightdata-cli skill)

Do not report success until all checks above pass.

Red flags

Claiming success without inspecting the output.
Silencing errors with 2>/dev/null — you'll miss auth failures and rate-limit errors.
Running bdata scrape on Amazon/LinkedIn/TikTok/Instagram/YouTube/Reddit URLs — these are supported by data-feeds and return structured data directly. Scraping loses the structure.
Scraping the same URL repeatedly in the same task — cache the first result.
Looping bdata scrape sequentially for large lists instead of using xargs -P 4 (or similar) with a parallelism cap.
Using curl against api.brightdata.com directly — legacy path; only when the CLI isn't available.

References

references/flags.md — every flag with when-to-use notes.
references/patterns.md — shell-loop batching, xargs parallelism, pagination recipe, retry/backoff, block-page recovery chain, legacy curl fallback.
references/examples.md — (1) single page → markdown, (2) batch a list of URLs with parallelism cap, (3) paginated listing, (4) block-page recovery.