Skill

scraper-studio

Builds AI-generated Bright Data scrapers from natural language via `bdata scraper create` and runs collectors against URLs with `bdata scraper run`. Activates when users describe data they want extracted from a URL.

automation

Popularity

Stars

201

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/brightdata-plugin:scraper-studio

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Build a scraper from natural language, then run it. Two commands live in this skill:

Supporting Files

references/api-flow.mdreferences/prompts.mdreferences/recipes.md

SKILL.md

369 lines · ~5.7k tokens(exceeds 5k compaction limit)

Stats

LanguageShell

Stars201

Forks30

MaintenanceExcellent

Last CommitJun 11, 2026

Actions

View Source View Plugin View on GitHub View README

Bright Data — Scraper Studio

Build a scraper from natural language, then run it. Two commands live in this skill:

bdata scraper create <url> <description> — describe what you want in plain English; Bright Data's AI Flow generates a scraper template and returns a collector_id.
bdata scraper run <collector_id> <url> — run that collector (or any existing one from the Bright Data web UI) against a URL and get the extracted data back.

The bridge between the two is collector_id. It is printed by create and consumed by run. Always save it.

For pre-built scrapers on platforms like Amazon, LinkedIn, TikTok, Instagram, YouTube, Reddit, etc., stop and use the data-feeds skill instead — those scrapers already exist, are faster, cheaper, and more reliable than building a new one. Use Scraper Studio when no pre-built scraper covers the target site, or when the user wants a custom shape of output for an existing platform.

Setup gate (run first)

if ! command -v bdata >/dev/null 2>&1; then
    echo "bdata CLI not installed — see bright-data-best-practices/references/cli-setup.md"
elif ! bdata zones >/dev/null 2>&1; then
    echo "bdata not authenticated — run: bdata login  (or: bdata login --device for SSH)"
fi

Halt and route to setup if either check fails. Both commands require an authenticated CLI.

Pick your path

Situation	Action
User describes data they want from a URL, no scraper exists yet	`bdata scraper create <url> "<description>"` → save the `collector_id`
User has a `collector_id` and wants data from one URL	`bdata scraper run <collector_id> <url>` (default async + poll)
User has a `collector_id` and wants data from many URLs	`bdata scraper run <collector_id> --urls "u1,u2,..."` or `--input-file urls.txt` (single batch call)
Page is small and you want fast feedback (≤ ~50 s, single URL)	`bdata scraper run … --sync`
Scraper ran but returned wrong / empty / partial data	inspect the output, then `bdata scraper heal <collector_id> "<what's wrong>"` → review preview → approve → re-run to verify
Site is a known platform (Amazon, LinkedIn, TikTok, …)	stop — use `data-feeds` skill
You want SERP / discovery, not extraction	use `search` skill
You want a one-off raw page fetch	use `scrape` skill

Action 1 — `scraper create`

Generate a scraper from a URL + plain-English description.

bdata scraper create <url> "<description>" [--name <name>] \
    [--deliver-webhook <url>] [--timeout <seconds>] \
    [--json | --pretty] [-o <path>] [--timing] [-k <api-key>]

The description is the most important input. A good description names every field you want and any conditions on how to find them. See references/prompts.md for examples of strong vs. weak descriptions.

# Minimal
bdata scraper create https://example.com/product/1 \
    "Extract title, price, currency, image URL, and availability \
     from this product page. If the price has a strike-through \
     original price, capture both as price and original_price."

# Save the full AI output for inspection
bdata scraper create https://example.com/product/1 \
    "Extract title, price, and image URL" \
    --name product-scraper-v1 \
    --pretty -o create.json

What happens under the hood

create chains three Bright Data API calls — surface this to the user so they can debug from logs:

POST /dca/collector — creates an empty scraper template with a stub webhook delivery target (https://example.com/webhook by default). Returns a collector_id like c_mp3tuab31lswoxvpws.
POST /dca/collectors/{collector_id}/automate_template — triggers Bright Data's AI Flow with the description + URL.
GET .../automate_template/progress (polled) — waits for status: "done". Generation typically takes 5–10 minutes for moderately complex pages.

Critical: hold the `collector_id`

Every failure path in create (AI trigger fails, polling times out, generation finishes with status: "failed") still leaves a partially-built collector at the printed collector_id. Always tell the user the id is recoverable — they can:

Open https://brightdata.com/cp/scrapers/{collector_id} to finish or inspect it in the web UI.
Re-trigger generation programmatically against the same id.
Delete it from the UI if they want a clean slate.

Never claim "create failed, start over" without surfacing the collector_id from the response.

`--timeout` — default 600 s

AI generation can run 5–10 min for complex pages. If the page is simple, the default is plenty. For an elaborate description on a heavy site, raise it:

bdata scraper create https://complex-site.com/page \
    "Extract all 30 fields …" \
    --timeout 1200

On Timeout after N seconds waiting for AI generation, the collector_id is still printed. Re-check progress in the web UI rather than re-running create (which builds a new collector).

`--deliver-webhook` — placeholder by default

The CLI sets a stub webhook (https://example.com/webhook). This satisfies the API contract but does not deliver anything. For CLI use, leave the stub — you'll fetch results synchronously via scraper run. The real delivery target can be reconfigured in the Bright Data web UI if the user wants webhook delivery for production runs.

Action 2 — `scraper run`

Execute a scraper against one or more URLs.

bdata scraper run <collector_id> [url] \
    [--urls "u1,u2,..." | --input-file <path>] \
    [--sync [--sync-timeout 25-50]] \
    [--timeout <seconds>] [--name <name>] [--version <version>] \
    [--json | --pretty] [-o <path>] [--timing] [-k <api-key>]

Pick exactly one input source: positional <url>, --urls, or --input-file. Combining them errors with only one input source.

Multi-URL routes through /dca/trigger as a single API call with an array body — the canonical pattern from the Scraper Studio Node and Python reference projects (triggerWithUrls / trigger_with_urls). One snapshot, one poll loop, one merged result array. Do not hand-roll a for url in $(cat urls.txt); do bdata scraper run ... loop — that's N API calls for what should be one.

Choosing the run mode

                ┌─────────────────────────────────────────────────┐
                │ Expected to finish in ≤ ~50 seconds?            │
                └─────────────────────────────────────────────────┘
                 yes                                       no
                  │                                         │
                  ▼                                         ▼
       ┌──────────────────────┐                ┌────────────────────────┐
       │   --sync             │                │  default (no flag)     │
       │   /dca/crawl         │                │  trigger_immediate +   │
       │   one-shot, 25–50 s  │                │  poll get_result       │
       └──────────────────────┘                └────────────────────────┘
                  │
                  │  on 202 timeout: response_id is printed —
                  │  re-run WITHOUT --sync to poll for it
                  ▼
            (cleanly fall back to async)

Mode	Endpoint	When to use
Default (single URL, async + poll)	`/dca/trigger_immediate` → poll `/dca/get_result`	Anything you expect to take more than ~50 s, anything paginated, anything you're not sure about. This is the safe default for one URL.
`--sync` (single URL)	`/dca/crawl`	Single-page extractions you expect to complete in under 50 s. Faster path: one request, no polling. Incompatible with multi-URL — `/dca/crawl` accepts only one URL.
Multi-URL (`--urls` / `--input-file`)	`/dca/trigger` (array body) → poll `/dca/dataset`	2+ URLs through the same collector. One API call, one snapshot ID, one merged result array. Use the longer `--timeout` (default 3600s) for big batches.

# Default — async + poll (recommended for most cases)
bdata scraper run c_mp3tuab31lswoxvpws https://www.amazon.com/dp/B08N5WRWNW

# Pretty-printed, saved to disk
bdata scraper run c_mp3tuab31lswoxvpws https://www.amazon.com/dp/B08N5WRWNW \
    --pretty -o product.json

# Sync mode for a fast page
bdata scraper run c_mp3tuab31lswoxvpws https://example.com/p/1 --sync

# Sync with a tighter server timeout
bdata scraper run c_mp3tuab31lswoxvpws https://example.com/p/1 \
    --sync --sync-timeout 30

# Multi-URL via comma-separated list — one batch call
bdata scraper run c_mp3tuab31lswoxvpws \
    --urls "https://example.com/p/1,https://example.com/p/2,https://example.com/p/3" \
    --pretty -o products.json

# Multi-URL via file (one URL per line; # comments and blanks ignored)
bdata scraper run c_mp3tuab31lswoxvpws --input-file urls.txt -o products.json

# Multi-URL via JSON array
echo '["https://example.com/p/1","https://example.com/p/2"]' > urls.json
bdata scraper run c_mp3tuab31lswoxvpws --input-file urls.json

`--sync-timeout` is bounded to 25–50

Anything outside that range exits with --sync-timeout must be between 25 and 50 seconds. Default is 50. Use values < 50 only when you want to fail fast and fall back to async on slow pages.

`--sync` timeout recovery

If /dca/crawl server-side times out, it returns 202 with {"error":"crawl_results_timeout","response_id":"r_late_..."}. The CLI surfaces the response_id and exits with a "Re-run without --sync" message. Do exactly that — drop --sync and re-run the same command; the default async path will poll /dca/get_result for the existing response_id. Do not treat sync timeout as a hard failure.

Silent auto-fallback to batch (paginated / large pages)

When a single URL expands to more pages than the realtime job limit allows (paginated listings, infinite scroll, "all reviews" pages, etc.), the realtime endpoints return an error like:

Request generated 501 pages and exceeded realtime job limit of 51 pages

The CLI detects this and automatically falls back to the batch flow (/dca/trigger → poll /dca/dataset). A one-line notice is printed. No flag is needed. Batch jobs use a longer poll interval (10 s) and a longer default timeout (1 hour).

If the user only sees the notice and a long wait, that is expected — large jobs take time. Do not "fix" the fallback by switching APIs or restructuring the scraper; let it run.

`--name` and `--version`

--name <name> — tag this run in the Bright Data dashboard. Useful when you're scripting many runs and want to find one later.
--version <version> — pin a specific scraper version (commonly dev when iterating in the web UI). Omit to use the latest published version.

Action 3 — `scraper heal`

Fix an existing scraper in place when it ran but returned wrong, empty, or partial data. The scraper's collector_id stays the same — it is improved, not replaced.

bdata scraper heal <collector_id> "<what's wrong>" [--url <verify-url>] \
    [--timeout <seconds>] [--max-retries <n>] [--no-retry] \
    [--json | --pretty] [-o <path>] [--timing] [-k <api-key>]

You are the detector. The CLI never decides on its own that a scraper is broken — you inspect the run output and decide. A heal is slow and billable, so only heal when the data is actually wrong, not just legitimately empty.

The <prompt> is required and is the most important input. Name exactly what is wrong and what the correct output should be: "The price field returns null — the selector moved into a <span data-testid=...>. Capture price and currency again." Vague prompts ("fix it") produce vague heals. The prompt is capped at 1000 characters.

What happens under the hood

POST /dca/collectors/{collector_id}/refactor_template with {prompt, custom_input: []} — triggers the AI self-healing job.
GET .../refactor_template/progress (polled) — waits for status: "done", same job shape and timing as automate_template.

Output + the verify loop

heal (without --auto-approve) usually ends at an approval gate rather than completing immediately. The response carries status: "awaiting_approval" with two key fields:

preview_result — sample rows the fixed scraper would produce, so you can judge whether the fix is correct before committing it.
diff_summary — a summary of what changed in the template.
next_step — points at bdata scraper approve <collector_id> to commit (or --reject to discard).

{
  "collector_id": "c_mp3tuab31lswoxvpws",
  "status": "awaiting_approval",
  "preview_result": [{"price": "29.99", "currency": "USD"}],
  "diff_summary": "Updated price selector from .price to span[data-testid='price']",
  "next_step": "bdata scraper approve c_mp3tuab31lswoxvpws"
}

Review preview_result, then run bdata scraper approve <collector_id> to commit the fix — or pass --reject to discard it and re-heal with a sharper prompt. approve polls to done and hands back a next_step = bdata scraper run <id> <url> to verify.

Pass --url <verify-url> to heal so the approve and run steps are concrete. The full self-healing loop is now:

run → inspect → heal → review preview → approve → run → verify

To skip the gate entirely, use heal --auto-approve — it approves automatically and polls through to done.

Failure is non-destructive

If a heal fails (429 cap exhausted, timeout, terminal failed), the existing scraper is unchanged and still works as it did before. The CLI says so and prints the collector_id. Unlike a failed create, nothing half-built is left behind.

Action 4 — `scraper approve`

A scraper heal (without --auto-approve) stops at an approval gate: status: "awaiting_approval", with preview_result (sample rows the fixed scraper would produce) and a diff_summary. Review the preview, then commit the fix:

bdata scraper approve <collector_id> [--reject] [--url <verify-url>] \
    [--timeout <seconds>] [--json | --pretty] [-o <path>] [-k <api-key>]

Approves by default (POST /dca/collectors/{id}/resume_automation_job {"message": true}), then polls to done and hands back a next_step = bdata scraper run <id> <url> to verify.
--reject sends {"message": false} to discard the proposed fix; re-heal with a sharper prompt to try again.
If a heal needs multiple approvals, approve may stop at awaiting_approval again — just run it again.

awaiting_approval is not a failure — it means the fix is ready and waiting for your decision.

Full create-then-run workflow

Capture the collector_id cleanly with jq, then chain into run with the multi-URL batch path:

# 1. Create the scraper, save the AI output, extract the collector_id
bdata scraper create https://example.com/product/1 \
    "Extract title, price, currency, image URL, and availability" \
    --pretty -o create.json

# 2. Pull the collector_id (or copy from the human-readable summary)
COLLECTOR_ID=$(jq -r '.collector_id // .id' create.json)
echo "Built: $COLLECTOR_ID"

# 3. Run it on every URL in one batch — single API call, merged result array
bdata scraper run "$COLLECTOR_ID" --input-file urls.txt \
    --pretty -o out/results.json

Do not wrap bdata scraper run in a for url in $(cat urls.txt); do ... done loop — that's N API calls and N snapshots. Use --input-file urls.txt (or --urls "...") instead; the CLI POSTs all of them to /dca/trigger in a single array body and returns one merged result array.

For more end-to-end recipes (batch input file shapes, error recovery, web-UI handoff), see references/recipes.md.

Common mistakes

Inventing command names. The commands are exactly bdata scraper create and bdata scraper run. There is no bdata generate, no bdata scrape-batch, no bdata data, no bdata build. If you're tempted to use one of those, you're hallucinating — run bdata scraper --help to verify.
Re-running create after a timeout. Generation creates a fresh collector every time. If polling times out, the half-built collector is still printed in the error output. Resume it in the web UI or wait and re-poll — don't burn another collector.
Defaulting to --sync for everything. Sync caps at 50 s server-side. Any paginated page or heavy SPA will time out. Default async is the right choice unless you specifically know the page is small.
Skipping Scraper Studio for known platforms. If the target is Amazon, LinkedIn, TikTok, Instagram, YouTube, Reddit, etc., the data-feeds skill exposes pre-built scrapers that are faster, cheaper, and more reliable. Only build a custom scraper when no pre-built one exists for the site, or when the user explicitly wants a custom output shape.
Treating sync timeout as failure. A --sync 202 timeout returns a valid response_id — the job is still running on the backend. Re-run the same command without --sync to pick it up.
Throwing away collector_id / response_id on error. The CLI prints them in every failure path on purpose. Both are recoverable via the API and the web UI. Always surface them to the user.
Trying to disable the batch auto-fallback. It has no flag. The fallback is the correct behavior when a URL expands past the realtime page limit. Let it run.
Hand-rolling a for url in $(cat urls.txt); do bdata scraper run ... loop. That's N API calls, N snapshot IDs, and N poll loops for what the API natively treats as one batch. Use --input-file urls.txt or --urls "u1,u2,..." — the CLI posts the whole array to /dca/trigger in a single request and returns one merged result array. This mirrors the canonical triggerWithUrls / trigger_with_urls helpers from the Scraper Studio reference SDKs.
Vague descriptions in create. A description like "scrape the page" produces a generic scraper. Name every field, name conditions ("if there's a sale price, capture both"), name disambiguators ("the price near the title, not in the recommendations sidebar"). See references/prompts.md.
Re-running create to fix a broken scraper. That builds a new collector and orphans the old one. To fix an existing scraper, use bdata scraper heal <collector_id> "<what's wrong>" — it mutates the scraper in place so your saved collector_id keeps working and improves.
Treating awaiting_approval as a failure. It is the normal end state of a heal — the fix is computed and waiting for your decision. Review preview_result, then bdata scraper approve <id> (or --reject). Use heal --auto-approve to skip the gate.

Troubleshooting

Symptom	Cause	Fix
`bdata: command not found`	CLI not installed	See `brightdata-cli` for install.
`Invalid or expired API key`	Not logged in	`bdata login` (or `bdata login --device` for SSH).
`create` returns no `id`	API call failed before template was created	Check `--timing` output for the failing request; verify network and account status.
`Timeout after 600 seconds waiting for AI generation`	Page is complex; default poll exceeded	The `collector_id` is still printed. Open it in the web UI; or re-run with `--timeout 1200`.
`status: "failed"` from progress poll	AI Flow couldn't build the template	Improve the description — be more specific about fields and selectors. Try again with a cleaner URL (e.g. a canonical product page, not a search result).
`--sync` 202 with `crawl_results_timeout`	Page took > sync server cap	Re-run without `--sync` to poll `/dca/get_result` for the printed `response_id`.
`--sync-timeout must be between 25 and 50 seconds`	Out-of-range value	Use a value in `[25, 50]`.
`Request generated N pages and exceeded realtime job limit`	URL expanded too far	This is handled automatically — wait for the batch fallback notice and let it poll.
No data returned, just `[]` or `{}`	Selectors didn't match	The scraper template ran but extracted nothing. Open `https://brightdata.com/cp/scrapers/{collector_id}` to inspect/edit; or rebuild with a tighter description.

Reference files

references/prompts.md — How to write a description for scraper create that the AI Flow can act on. Examples of strong vs. weak descriptions, field-listing patterns, and conditional rules.
references/recipes.md — End-to-end recipes: capture collector_id from create, batch run a list of URLs, handle sync→async fallback, recover from a failed create, list scrapers from the dashboard.
references/api-flow.md — Exact REST endpoints, payloads, and status sentinels the CLI uses (/dca/collector, /dca/collectors/{id}/automate_template, /dca/trigger_immediate, /dca/get_result, /dca/crawl, /dca/trigger, /dca/dataset). Read this when debugging unexpected CLI output or when the user wants to hit the API directly.

scraper-studio

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

scraper-studio

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Bright Data — Scraper Studio

Setup gate (run first)

Pick your path

Action 1 — scraper create

What happens under the hood

Critical: hold the collector_id

--timeout — default 600 s

--deliver-webhook — placeholder by default

Action 2 — scraper run

Choosing the run mode

--sync-timeout is bounded to 25–50

--sync timeout recovery

Silent auto-fallback to batch (paginated / large pages)

--name and --version

Action 3 — scraper heal

What happens under the hood

Output + the verify loop

Failure is non-destructive

Action 4 — scraper approve

Full create-then-run workflow

Common mistakes

Troubleshooting

Reference files

Similar Skills

Bright Data — Scraper Studio

Setup gate (run first)

Pick your path

Action 1 — scraper create

What happens under the hood

Critical: hold the collector_id

--timeout — default 600 s

--deliver-webhook — placeholder by default

Action 2 — scraper run

Choosing the run mode

--sync-timeout is bounded to 25–50

--sync timeout recovery

Silent auto-fallback to batch (paginated / large pages)

--name and --version

Action 3 — scraper heal

What happens under the hood

Output + the verify loop

Failure is non-destructive

Action 4 — scraper approve

Full create-then-run workflow

Common mistakes

Troubleshooting

Reference files

Similar Skills

Action 1 — `scraper create`

Critical: hold the `collector_id`

`--timeout` — default 600 s

`--deliver-webhook` — placeholder by default

Action 2 — `scraper run`

`--sync-timeout` is bounded to 25–50

`--sync` timeout recovery

`--name` and `--version`

Action 3 — `scraper heal`

Action 4 — `scraper approve`

Action 1 — `scraper create`

Critical: hold the `collector_id`

`--timeout` — default 600 s

`--deliver-webhook` — placeholder by default

Action 2 — `scraper run`

`--sync-timeout` is bounded to 25–50

`--sync` timeout recovery

`--name` and `--version`

Action 3 — `scraper heal`

Action 4 — `scraper approve`