From brightdata-plugin
Builds AI-generated Bright Data scrapers from natural language via `bdata scraper create` and runs collectors against URLs with `bdata scraper run`. Activates when users describe data they want extracted from a URL.
How this skill is triggered — by the user, by Claude, or both
Slash command
/brightdata-plugin:scraper-studioThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Build a scraper from natural language, then run it. Two commands live in this skill:
Build a scraper from natural language, then run it. Two commands live in this skill:
bdata scraper create <url> <description> — describe what you want in plain English; Bright Data's AI Flow generates a scraper template and returns a collector_id.bdata scraper run <collector_id> <url> — run that collector (or any existing one from the Bright Data web UI) against a URL and get the extracted data back.The bridge between the two is collector_id. It is printed by create and consumed by run. Always save it.
For pre-built scrapers on platforms like Amazon, LinkedIn, TikTok, Instagram, YouTube, Reddit, etc., stop and use the data-feeds skill instead — those scrapers already exist, are faster, cheaper, and more reliable than building a new one. Use Scraper Studio when no pre-built scraper covers the target site, or when the user wants a custom shape of output for an existing platform.
if ! command -v bdata >/dev/null 2>&1; then
echo "bdata CLI not installed — see bright-data-best-practices/references/cli-setup.md"
elif ! bdata zones >/dev/null 2>&1; then
echo "bdata not authenticated — run: bdata login (or: bdata login --device for SSH)"
fi
Halt and route to setup if either check fails. Both commands require an authenticated CLI.
| Situation | Action |
|---|---|
| User describes data they want from a URL, no scraper exists yet | bdata scraper create <url> "<description>" → save the collector_id |
User has a collector_id and wants data from one URL | bdata scraper run <collector_id> <url> (default async + poll) |
User has a collector_id and wants data from many URLs | bdata scraper run <collector_id> --urls "u1,u2,..." or --input-file urls.txt (single batch call) |
| Page is small and you want fast feedback (≤ ~50 s, single URL) | bdata scraper run … --sync |
| Scraper ran but returned wrong / empty / partial data | inspect the output, then bdata scraper heal <collector_id> "<what's wrong>" → review preview → approve → re-run to verify |
| Site is a known platform (Amazon, LinkedIn, TikTok, …) | stop — use data-feeds skill |
| You want SERP / discovery, not extraction | use search skill |
| You want a one-off raw page fetch | use scrape skill |
scraper createGenerate a scraper from a URL + plain-English description.
bdata scraper create <url> "<description>" [--name <name>] \
[--deliver-webhook <url>] [--timeout <seconds>] \
[--json | --pretty] [-o <path>] [--timing] [-k <api-key>]
The description is the most important input. A good description names every field you want and any conditions on how to find them. See references/prompts.md for examples of strong vs. weak descriptions.
# Minimal
bdata scraper create https://example.com/product/1 \
"Extract title, price, currency, image URL, and availability \
from this product page. If the price has a strike-through \
original price, capture both as price and original_price."
# Save the full AI output for inspection
bdata scraper create https://example.com/product/1 \
"Extract title, price, and image URL" \
--name product-scraper-v1 \
--pretty -o create.json
create chains three Bright Data API calls — surface this to the user so they can debug from logs:
POST /dca/collector — creates an empty scraper template with a stub webhook delivery target (https://example.com/webhook by default). Returns a collector_id like c_mp3tuab31lswoxvpws.POST /dca/collectors/{collector_id}/automate_template — triggers Bright Data's AI Flow with the description + URL.GET .../automate_template/progress (polled) — waits for status: "done". Generation typically takes 5–10 minutes for moderately complex pages.collector_idEvery failure path in create (AI trigger fails, polling times out, generation finishes with status: "failed") still leaves a partially-built collector at the printed collector_id. Always tell the user the id is recoverable — they can:
https://brightdata.com/cp/scrapers/{collector_id} to finish or inspect it in the web UI.Never claim "create failed, start over" without surfacing the collector_id from the response.
--timeout — default 600 sAI generation can run 5–10 min for complex pages. If the page is simple, the default is plenty. For an elaborate description on a heavy site, raise it:
bdata scraper create https://complex-site.com/page \
"Extract all 30 fields …" \
--timeout 1200
On Timeout after N seconds waiting for AI generation, the collector_id is still printed. Re-check progress in the web UI rather than re-running create (which builds a new collector).
--deliver-webhook — placeholder by defaultThe CLI sets a stub webhook (https://example.com/webhook). This satisfies the API contract but does not deliver anything. For CLI use, leave the stub — you'll fetch results synchronously via scraper run. The real delivery target can be reconfigured in the Bright Data web UI if the user wants webhook delivery for production runs.
scraper runExecute a scraper against one or more URLs.
bdata scraper run <collector_id> [url] \
[--urls "u1,u2,..." | --input-file <path>] \
[--sync [--sync-timeout 25-50]] \
[--timeout <seconds>] [--name <name>] [--version <version>] \
[--json | --pretty] [-o <path>] [--timing] [-k <api-key>]
Pick exactly one input source: positional <url>, --urls, or --input-file. Combining them errors with only one input source.
Multi-URL routes through /dca/trigger as a single API call with an array body — the canonical pattern from the Scraper Studio Node and Python reference projects (triggerWithUrls / trigger_with_urls). One snapshot, one poll loop, one merged result array. Do not hand-roll a for url in $(cat urls.txt); do bdata scraper run ... loop — that's N API calls for what should be one.
┌─────────────────────────────────────────────────┐
│ Expected to finish in ≤ ~50 seconds? │
└─────────────────────────────────────────────────┘
yes no
│ │
▼ ▼
┌──────────────────────┐ ┌────────────────────────┐
│ --sync │ │ default (no flag) │
│ /dca/crawl │ │ trigger_immediate + │
│ one-shot, 25–50 s │ │ poll get_result │
└──────────────────────┘ └────────────────────────┘
│
│ on 202 timeout: response_id is printed —
│ re-run WITHOUT --sync to poll for it
▼
(cleanly fall back to async)
| Mode | Endpoint | When to use |
|---|---|---|
| Default (single URL, async + poll) | /dca/trigger_immediate → poll /dca/get_result | Anything you expect to take more than ~50 s, anything paginated, anything you're not sure about. This is the safe default for one URL. |
--sync (single URL) | /dca/crawl | Single-page extractions you expect to complete in under 50 s. Faster path: one request, no polling. Incompatible with multi-URL — /dca/crawl accepts only one URL. |
Multi-URL (--urls / --input-file) | /dca/trigger (array body) → poll /dca/dataset | 2+ URLs through the same collector. One API call, one snapshot ID, one merged result array. Use the longer --timeout (default 3600s) for big batches. |
# Default — async + poll (recommended for most cases)
bdata scraper run c_mp3tuab31lswoxvpws https://www.amazon.com/dp/B08N5WRWNW
# Pretty-printed, saved to disk
bdata scraper run c_mp3tuab31lswoxvpws https://www.amazon.com/dp/B08N5WRWNW \
--pretty -o product.json
# Sync mode for a fast page
bdata scraper run c_mp3tuab31lswoxvpws https://example.com/p/1 --sync
# Sync with a tighter server timeout
bdata scraper run c_mp3tuab31lswoxvpws https://example.com/p/1 \
--sync --sync-timeout 30
# Multi-URL via comma-separated list — one batch call
bdata scraper run c_mp3tuab31lswoxvpws \
--urls "https://example.com/p/1,https://example.com/p/2,https://example.com/p/3" \
--pretty -o products.json
# Multi-URL via file (one URL per line; # comments and blanks ignored)
bdata scraper run c_mp3tuab31lswoxvpws --input-file urls.txt -o products.json
# Multi-URL via JSON array
echo '["https://example.com/p/1","https://example.com/p/2"]' > urls.json
bdata scraper run c_mp3tuab31lswoxvpws --input-file urls.json
--sync-timeout is bounded to 25–50Anything outside that range exits with --sync-timeout must be between 25 and 50 seconds. Default is 50. Use values < 50 only when you want to fail fast and fall back to async on slow pages.
--sync timeout recoveryIf /dca/crawl server-side times out, it returns 202 with {"error":"crawl_results_timeout","response_id":"r_late_..."}. The CLI surfaces the response_id and exits with a "Re-run without --sync" message. Do exactly that — drop --sync and re-run the same command; the default async path will poll /dca/get_result for the existing response_id. Do not treat sync timeout as a hard failure.
When a single URL expands to more pages than the realtime job limit allows (paginated listings, infinite scroll, "all reviews" pages, etc.), the realtime endpoints return an error like:
Request generated 501 pages and exceeded realtime job limit of 51 pages
The CLI detects this and automatically falls back to the batch flow (/dca/trigger → poll /dca/dataset). A one-line notice is printed. No flag is needed. Batch jobs use a longer poll interval (10 s) and a longer default timeout (1 hour).
If the user only sees the notice and a long wait, that is expected — large jobs take time. Do not "fix" the fallback by switching APIs or restructuring the scraper; let it run.
--name and --version--name <name> — tag this run in the Bright Data dashboard. Useful when you're scripting many runs and want to find one later.--version <version> — pin a specific scraper version (commonly dev when iterating in the web UI). Omit to use the latest published version.scraper healFix an existing scraper in place when it ran but returned wrong, empty, or partial data. The scraper's collector_id stays the same — it is improved, not replaced.
bdata scraper heal <collector_id> "<what's wrong>" [--url <verify-url>] \
[--timeout <seconds>] [--max-retries <n>] [--no-retry] \
[--json | --pretty] [-o <path>] [--timing] [-k <api-key>]
You are the detector. The CLI never decides on its own that a scraper is broken — you inspect the run output and decide. A heal is slow and billable, so only heal when the data is actually wrong, not just legitimately empty.
The <prompt> is required and is the most important input. Name exactly what is wrong and what the correct output should be: "The price field returns null — the selector moved into a <span data-testid=...>. Capture price and currency again." Vague prompts ("fix it") produce vague heals. The prompt is capped at 1000 characters.
POST /dca/collectors/{collector_id}/refactor_template with {prompt, custom_input: []} — triggers the AI self-healing job.GET .../refactor_template/progress (polled) — waits for status: "done", same job shape and timing as automate_template.heal (without --auto-approve) usually ends at an approval gate rather
than completing immediately. The response carries status: "awaiting_approval"
with two key fields:
preview_result — sample rows the fixed scraper would produce, so you can
judge whether the fix is correct before committing it.diff_summary — a summary of what changed in the template.next_step — points at bdata scraper approve <collector_id> to commit
(or --reject to discard).{
"collector_id": "c_mp3tuab31lswoxvpws",
"status": "awaiting_approval",
"preview_result": [{"price": "29.99", "currency": "USD"}],
"diff_summary": "Updated price selector from .price to span[data-testid='price']",
"next_step": "bdata scraper approve c_mp3tuab31lswoxvpws"
}
Review preview_result, then run bdata scraper approve <collector_id> to
commit the fix — or pass --reject to discard it and re-heal with a sharper
prompt. approve polls to done and hands back a next_step =
bdata scraper run <id> <url> to verify.
Pass --url <verify-url> to heal so the approve and run steps are concrete.
The full self-healing loop is now:
run → inspect → heal → review preview → approve → run → verify
To skip the gate entirely, use heal --auto-approve — it approves automatically
and polls through to done.
If a heal fails (429 cap exhausted, timeout, terminal failed), the existing scraper is unchanged and still works as it did before. The CLI says so and prints the collector_id. Unlike a failed create, nothing half-built is left behind.
scraper approveA scraper heal (without --auto-approve) stops at an approval gate:
status: "awaiting_approval", with preview_result (sample rows the fixed
scraper would produce) and a diff_summary. Review the preview, then commit
the fix:
bdata scraper approve <collector_id> [--reject] [--url <verify-url>] \
[--timeout <seconds>] [--json | --pretty] [-o <path>] [-k <api-key>]
POST /dca/collectors/{id}/resume_automation_job {"message": true}), then polls to done and hands back a
next_step = bdata scraper run <id> <url> to verify.--reject sends {"message": false} to discard the proposed fix; re-heal
with a sharper prompt to try again.approve may stop at
awaiting_approval again — just run it again.awaiting_approval is not a failure — it means the fix is ready and
waiting for your decision.
Capture the collector_id cleanly with jq, then chain into run with the multi-URL batch path:
# 1. Create the scraper, save the AI output, extract the collector_id
bdata scraper create https://example.com/product/1 \
"Extract title, price, currency, image URL, and availability" \
--pretty -o create.json
# 2. Pull the collector_id (or copy from the human-readable summary)
COLLECTOR_ID=$(jq -r '.collector_id // .id' create.json)
echo "Built: $COLLECTOR_ID"
# 3. Run it on every URL in one batch — single API call, merged result array
bdata scraper run "$COLLECTOR_ID" --input-file urls.txt \
--pretty -o out/results.json
Do not wrap bdata scraper run in a for url in $(cat urls.txt); do ... done loop — that's N API calls and N snapshots. Use --input-file urls.txt (or --urls "...") instead; the CLI POSTs all of them to /dca/trigger in a single array body and returns one merged result array.
For more end-to-end recipes (batch input file shapes, error recovery, web-UI handoff), see references/recipes.md.
Inventing command names. The commands are exactly bdata scraper create and bdata scraper run. There is no bdata generate, no bdata scrape-batch, no bdata data, no bdata build. If you're tempted to use one of those, you're hallucinating — run bdata scraper --help to verify.
Re-running create after a timeout. Generation creates a fresh collector every time. If polling times out, the half-built collector is still printed in the error output. Resume it in the web UI or wait and re-poll — don't burn another collector.
Defaulting to --sync for everything. Sync caps at 50 s server-side. Any paginated page or heavy SPA will time out. Default async is the right choice unless you specifically know the page is small.
Skipping Scraper Studio for known platforms. If the target is Amazon, LinkedIn, TikTok, Instagram, YouTube, Reddit, etc., the data-feeds skill exposes pre-built scrapers that are faster, cheaper, and more reliable. Only build a custom scraper when no pre-built one exists for the site, or when the user explicitly wants a custom output shape.
Treating sync timeout as failure. A --sync 202 timeout returns a valid response_id — the job is still running on the backend. Re-run the same command without --sync to pick it up.
Throwing away collector_id / response_id on error. The CLI prints them in every failure path on purpose. Both are recoverable via the API and the web UI. Always surface them to the user.
Trying to disable the batch auto-fallback. It has no flag. The fallback is the correct behavior when a URL expands past the realtime page limit. Let it run.
Hand-rolling a for url in $(cat urls.txt); do bdata scraper run ... loop. That's N API calls, N snapshot IDs, and N poll loops for what the API natively treats as one batch. Use --input-file urls.txt or --urls "u1,u2,..." — the CLI posts the whole array to /dca/trigger in a single request and returns one merged result array. This mirrors the canonical triggerWithUrls / trigger_with_urls helpers from the Scraper Studio reference SDKs.
Vague descriptions in create. A description like "scrape the page" produces a generic scraper. Name every field, name conditions ("if there's a sale price, capture both"), name disambiguators ("the price near the title, not in the recommendations sidebar"). See references/prompts.md.
Re-running create to fix a broken scraper. That builds a new
collector and orphans the old one. To fix an existing scraper, use
bdata scraper heal <collector_id> "<what's wrong>" — it mutates the
scraper in place so your saved collector_id keeps working and improves.
Treating awaiting_approval as a failure. It is the normal end state
of a heal — the fix is computed and waiting for your decision. Review
preview_result, then bdata scraper approve <id> (or --reject). Use
heal --auto-approve to skip the gate.
| Symptom | Cause | Fix |
|---|---|---|
bdata: command not found | CLI not installed | See brightdata-cli for install. |
Invalid or expired API key | Not logged in | bdata login (or bdata login --device for SSH). |
create returns no id | API call failed before template was created | Check --timing output for the failing request; verify network and account status. |
Timeout after 600 seconds waiting for AI generation | Page is complex; default poll exceeded | The collector_id is still printed. Open it in the web UI; or re-run with --timeout 1200. |
status: "failed" from progress poll | AI Flow couldn't build the template | Improve the description — be more specific about fields and selectors. Try again with a cleaner URL (e.g. a canonical product page, not a search result). |
--sync 202 with crawl_results_timeout | Page took > sync server cap | Re-run without --sync to poll /dca/get_result for the printed response_id. |
--sync-timeout must be between 25 and 50 seconds | Out-of-range value | Use a value in [25, 50]. |
Request generated N pages and exceeded realtime job limit | URL expanded too far | This is handled automatically — wait for the batch fallback notice and let it poll. |
No data returned, just [] or {} | Selectors didn't match | The scraper template ran but extracted nothing. Open https://brightdata.com/cp/scrapers/{collector_id} to inspect/edit; or rebuild with a tighter description. |
scraper create that the AI Flow can act on. Examples of strong vs. weak descriptions, field-listing patterns, and conditional rules.collector_id from create, batch run a list of URLs, handle sync→async fallback, recover from a failed create, list scrapers from the dashboard./dca/collector, /dca/collectors/{id}/automate_template, /dca/trigger_immediate, /dca/get_result, /dca/crawl, /dca/trigger, /dca/dataset). Read this when debugging unexpected CLI output or when the user wants to hit the API directly.npx claudepluginhub brightdata/skills --plugin brightdata-pluginBuilds production-ready web scrapers for any site using Bright Data infrastructure. Guides site analysis, API selection, selector extraction, pagination, and implementation.
Scrapes web pages, SERPs, e-commerce, YouTube, and ChatGPT via ScrapingBee CLI with smart path extraction and AI rules for LLMs. Handles JS, CAPTCHAs, anti-bot automatically.
Runs Apify web scraping Actors, manages datasets, creates tasks, and retrieves crawl results via the Composio Apify integration.