Search everything...

Skill

brightdata-sdk

Extracts web data using Bright Data Python SDK: platform scraping (Amazon, LinkedIn, etc.), SERP search, datasets, discovery, and browser automation.

Python

data-engineering

automation

npx claudepluginhub brightdata/skills --plugin brightdata-plugin

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Access web data through a unified Python SDK. One client, eight service categories: platform scraping, platform search, web search (SERP), AI-powered discovery, datasets, web unlocking, browser automation, and scraper studio.

Supporting Assets

references/advanced.mdreferences/datasets-overview.mdreferences/scrapers.mdreferences/search.md

SKILL.md

Similar Skills

bright-data-best-practices

Integrates Bright Data APIs for production web scraping, SERP results, structured extraction, and browser automation with best practices, CLI setup, and auth patterns.

5 files

brightdata-plugin

brightdata-core-workflow-a

1.9k

Scrape structured data from JavaScript-rendered pages, SPAs, or interactive sites using Bright Data Scraping Browser with Playwright or Puppeteer.

6 tools

brightdata-pack

apify-ultimate-scraper

36.5k

AI-driven data extraction from 55+ Actors across all major platforms. This skill automatically selects the best Actor for your task.

1 file

antigravity-awesome-skills

Stats

Stars67

Forks11

Last CommitApr 27, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

brightdata-sdk | brightdata-plugin | ClaudePluginHub

Back to Skills

Skill

brightdata-sdk

From brightdata-plugin

Extracts web data using Bright Data Python SDK: platform scraping (Amazon, LinkedIn, etc.), SERP search, datasets, discovery, and browser automation.

Python

data-engineering

automation

npx claudepluginhub brightdata/skills --plugin brightdata-plugin

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Supporting Assets

references/advanced.mdreferences/datasets-overview.mdreferences/scrapers.mdreferences/search.md

SKILL.md

Bright Data SDK

Always use the client as a context manager. In synchronous environments (scripts, notebooks, Claude Code), use SyncBrightDataClient. In async environments, use BrightDataClient. Both use the same method names — the sync client wraps calls automatically. Note: the sync client currently has limited platform coverage — see the sync compatibility note in references/scrapers.md for details. For unsupported platforms or the datasets API, use the async client (BrightDataClient).

Service Selection (decide first, then look up the specific method)

Use this decision tree to pick the right service BEFORE reaching for any specific method. Most routing failures come from skipping this step and pattern-matching on user keywords instead.

Have a URL?
  ├── On a supported platform (Amazon, LinkedIn, Facebook, Instagram, YouTube,
  │   TikTok, Reddit, ChatGPT, Perplexity, Pinterest, DigiKey)?
  │     → Platform scraping: client.scrape.<platform>.<method>(url=...)
  │
  ├── Generic page (not on any supported platform)?
  │     → Web unlocker: client.scrape_url(url=...)
  │
  └── Need login / JavaScript / click-scroll-fill / CAPTCHA / multi-step navigation?
        → Browser API: client.browser.get_connect_url() (then connect with Playwright)

No URL?
  ├── Want entities matching natural-language criteria
  │   ("find AI startups in Berlin", "competitors of Acme Corp", "people who worked at X")?
  │     → Discover: client.discover(query=..., intent=...)
  │
  ├── Want web pages / articles / search-result links
  │   ("search Google for X", "find pages about Y")?
  │     → SERP: client.search.google(query=...) [or .bing / .yandex]
  │
  ├── Want to search WITHIN a specific platform
  │   ("find products on Amazon", "TikTok videos by hashtag", "Pinterest pins about recipes")?
  │     → Platform search: client.search.<platform>.<method>(...)
  │
  └── Want bulk historical data at scale
      ("all LinkedIn companies in tech", "historical Amazon prices", "every Zillow listing in Texas")?
        → Datasets: client.datasets.<name>(filter=...) (then .download(snapshot_id))

Edge cases:

URL on supported platform BUT user explicitly mentions login/click/scroll/JS → Browser API (the interaction trumps the platform).
URL on supported platform BUT scraper returns 403/blocked → fall back to client.scrape_url() (web unlocker).
"Find/research/who are X" with a URL alongside (e.g., "find competitors of acme.com") → still Discover; the URL is context, not the scrape target.

Method Names: Verify Before Asserting

Before claiming any platform method exists, doesn't exist, or asserting a platform isn't supported, you MUST consult references/scrapers.md first. Past evals show the model has hallucinated method names (e.g., client.scrape.linkedin.people — does NOT exist; use .profiles) and falsely claimed platforms unsupported (e.g., Pinterest is supported via both client.scrape.pinterest and client.search.pinterest).

Rule: load references/scrapers.md before naming any specific platform method. The reference file lists every platform, every method signature, and the sync/async availability matrix. Verify, don't assume. If you've already loaded references/scrapers.md in this session, consult what's in context — no need to reload.

Known hallucinations (these names do NOT exist in the SDK — the model has invented them in past evals):

Hallucinated	Correct replacement
`client.scrape.linkedin.people(...)`	`client.scrape.linkedin.profiles(url=...)`
`client.scrape.instagram.users(...)`	`client.scrape.instagram.profiles(url=...)`
`client.list_datasets()`	`client.datasets.list()`
`asyncio.gather(*[client.scrape.X.<quick>(...) for ...])`	Trigger pattern — see Batch gotcha

This list grows as new hallucinations are observed in evals. If you're tempted to write a method name that "feels right" but you haven't seen in references/scrapers.md, treat it as a likely hallucination — load the reference and verify.

Useful Standalone Methods

Methods that don't belong to any specific workflow — easy to overlook because they're not tied to a platform or a routing decision. The model has hallucinated some of these (client.list_datasets() instead of client.datasets.list()); use the canonical names below.

Method	What it does
`client.datasets.list()`	List all 310+ datasets at runtime. Do NOT use `dir()` or introspection — use this method.
`client.discover(query=, intent=)`	AI-ranked entity search (companies, people, products). See Service Selection above.
`client.scrape_url(url=...)`	Web unlocker for any URL. Use for sites without a dedicated platform scraper.
`client.browser.get_connect_url()`	CDP WebSocket URL for Playwright / Puppeteer / Selenium.
`client.list_zones()`	List active Bright Data zones.
`client.delete_zone(name)`	Remove a zone.
`client.test_connection()`	Verify the API token works.
`client.get_account_info()`	Usage, quotas, active zones.

How to Handle Requests

Exploring capabilities

If the user wants to know what's available or asks "what can this do?", describe these 8 categories. Each follows the template: Name — what it does — example invocation — when to use.

Platform scraping — extract structured data from 11 supported platforms by URL. client.scrape.<platform>.<method>(url=...) (platforms: Amazon, LinkedIn, Facebook, Instagram, YouTube, TikTok, Reddit, ChatGPT, Perplexity, Pinterest, DigiKey). Use when the user has a URL on a supported platform and wants structured fields (price, profile data, post engagement, etc.).
Platform search — search within a specific platform by keyword/profile/filter. client.search.<platform>.<method>(...). Use for "find products on Amazon by keyword", "discover TikTok videos by hashtag", "find Pinterest pins about recipes" — i.e., the user wants to search WITHIN a platform but doesn't have a specific URL.
Web search (SERP) — get structured search engine results (titles, links, snippets, rankings). client.search.google(query=...) (or .bing / .yandex). Use for "search Google for X", "find pages/articles about Y", "look up news on Z" — i.e., the user wants web pages, not entities.
Discover (AI-powered) — client.discover(query=..., intent=...) to find entities (companies, people, products, places) matching natural-language criteria. Use for "find AI startups in Berlin", "competitors of Acme Corp", "people who worked at Stripe", "research the SaaS pricing landscape" — i.e., the user wants a list of entities matching a description, not web pages.
Datasets — access 310+ pre-built datasets with historical/bulk data at scale. client.datasets.<name>(filter=...) (then .download(snapshot_id)). Use for "bulk LinkedIn company data", "historical Amazon prices for electronics", "all Zillow listings in Texas" — i.e., the user wants many records at once, not live data on one page.
Web unlocker — scrape any URL with anti-bot bypass for sites without a dedicated platform scraper. client.scrape_url(url=...). Use when the URL is on a generic website (no dedicated scraper) or as fallback when a platform scraper returns 403/blocked.
Browser API — connect to a remote browser via CDP (Chrome DevTools Protocol) for real-browser interaction. client.browser.get_connect_url() (then connect with Playwright/Puppeteer). Use for login flows, JavaScript-heavy single-page apps, click/scroll/fill interactions, CAPTCHA — i.e., anything requiring a real browser session. Most expensive option; use only when simpler methods can't accomplish the task.
Scraper Studio — run pre-built or custom scraping templates configured in the Bright Data dashboard. client.scraper_studio.run(collector="c_xxx", input={...}). Use when the user provides a collector ID for a template not covered by platform scrapers.

Offer to load the relevant reference file for details on any category.

Data extraction from a specific URL

The user has a URL and wants structured data from it.

If the URL is from a supported platform (Amazon, LinkedIn, Facebook, Instagram, YouTube, TikTok, Reddit, ChatGPT, Perplexity, Pinterest, DigiKey — see references/scrapers.md for the full list and available methods):

Use client.scrape.<platform>.<method>(url=...)
Read references/scrapers.md for available methods per platform

If the URL is from an unsupported platform or a generic website:

Use client.scrape_url(url=...) for raw page data with anti-bot bypass
Read references/advanced.md for web unlocker options

If the user has MULTIPLE URLs (batch):

Use BrightDataClient and trigger methods (_trigger suffix) to avoid sequential blocking
Fire all triggers first, then collect results with job.wait() and job.to_result()
Read references/advanced.md for batch execution patterns

Research or discovery without a specific URL

The user wants to find information but doesn't have a starting URL.

For web search results (links, snippets, rankings):

Use client.search.google(query=...), client.search.bing(query=...), or client.search.yandex(query=...)
Read references/search.md for available search engines and parameters

For platform-specific search (find products on Amazon, profiles on LinkedIn, videos on YouTube, etc.):

Use client.search.<platform>.<method>(...)
Read references/scrapers.md — search methods are listed under each platform

For deeper discovery (find companies, people, or entities matching criteria):

Use client.discover(query=..., intent=...)
The Discover API requires an intent phrase, not just keywords
Read references/search.md for discover API details

Bulk or historical data needs

The user asks for "bulk data", "historical data", "database", "list of", or wants data at scale without scraping individual pages.

Use client.datasets.list() at runtime to discover available datasets
Read references/datasets-overview.md for dataset categories and usage patterns
Create a filtered snapshot: snapshot_id = client.datasets.<name>(filter={...})
Download data: data = client.datasets.<name>.download(snapshot_id) (default format is jsonl; also supports json, csv)
Snapshots take time to build — download blocks until ready (up to 5 minutes)

Multi-step research workflow

The user has a broad research goal (e.g., "research competitors in Berlin").

Step 1: Find sources

client.discover(query=..., intent=...) for entity-level discovery
OR client.search.google(query=...) for web search results

Step 2: Extract data from discovered sources

client.scrape.<platform>.<method>(url=...) on each discovered URL
Use trigger methods for batch processing if many URLs

Step 3: Optionally enrich with bulk data

Check client.datasets for historical context on the entities found

Interactive web tasks

The user needs login, clicking, scrolling, form filling, or JavaScript execution.

Use client.browser.get_connect_url() to get a CDP WebSocket URL
Connect with Playwright, Puppeteer, or another CDP client
This is the most expensive option — only use when simpler methods cannot accomplish the task
Read references/advanced.md for browser API details

Scraper Studio templates

The user wants to use a pre-built or custom scraping template.

Use client.scraper_studio.run(collector="c_xxx", input={...})
Requires a collector ID — the user must provide this or know which template to use
Read references/advanced.md for scraper studio details

Gotchas

Browser API is a last resort, not a default

Default: For pages on a supported platform (Amazon products, LinkedIn profiles, Instagram posts/reels, etc.) → use the platform scraper.

Override: User explicitly mentions one of the following → comply with the browser-API request (it IS the right tool): login, sign-in, click, scroll, fill, type, JavaScript execution, CAPTCHA, screenshot, PDF generation, multi-step navigation.

Counter-override: User requests browser API for a page that does NOT need any of the above AND the URL is on a supported platform → DO NOT comply. Show the platform scraper code and explain the cost/speed difference (~10x cheaper, ~30s vs ~5min).

# WRONG (browser when scraper would do):
cdp_url = client.browser.get_connect_url()
browser = await playwright.chromium.connect_over_cdp(cdp_url)
page = await browser.new_page()
await page.goto("https://amazon.com/dp/B09V3KXJPB")
# ↑ scraper is ~10x cheaper, ~30s vs ~5min, returns structured data not raw HTML

# RIGHT (use the platform scraper):
result = await client.scrape.amazon.products(url="https://amazon.com/dp/B09V3KXJPB")

# RIGHT — legitimate browser-API case (the user mentions login):
# User said "log into Amazon and check my recent orders"
cdp_url = client.browser.get_connect_url(country="us")
browser = await playwright.chromium.connect_over_cdp(cdp_url)
page = await browser.new_page()
await page.goto("https://amazon.com/login")
await page.fill("#ap_email", username)
# ... etc — browser is the right tool here.

Datasets return HISTORICAL data, not live/real-time data. If the user needs current data, use platform scrapers or web unlocker instead.
The Discover API requires an INTENT (natural language description of what you're looking for), not just a keyword. Rephrase bare keywords like "restaurants" into intent phrases like "find Italian restaurants with outdoor seating in downtown Austin."
When a scraper returns 403 or is blocked, try client.scrape_url() (web unlocker) as fallback — it handles anti-bot protections.
Always prefer the cheapest service that satisfies the request. Cost hierarchy (cheapest first): datasets → SERP → platform scrapers → web unlocker → discover → scraper studio → browser API.
Always use the client as a context manager. Never create multiple client instances — reuse one client across all operations in a session.
Each scraper supports 3 execution patterns: quick (blocks until result), trigger (returns job immediately), manual (trigger + status + fetch). Default to quick unless the user needs batch processing or non-blocking execution.
Quick methods block for several minutes depending on platform and page complexity. Do NOT set short timeouts — the SDK defaults are calibrated per platform. Expect 2-10 minutes for most operations.
The SDK auto-retries network errors and timeouts (3 retries, exponential backoff). Do NOT add your own retry logic on top — it will double-retry and waste API credits.
Dataset operations return a snapshot_id, not data directly. Snapshots go through a lifecycle: scheduled → building → ready. Use .download(snapshot_id) which blocks until the snapshot is ready. Supported download formats: json, jsonl, csv.

For batch operations (many URLs), use the trigger pattern

Why: Quick methods (e.g., client.scrape.amazon.products) block for 2-10 minutes each waiting for the scrape to complete. Even with the default 10 req/s rate limit, asyncio.gather of 200 quick calls = 200 × ~5 minutes / 10 (rate limit) = ~100 minutes of blocked execution. The trigger pattern fires the request and returns a job; you collect results in parallel when they're ready.

# WRONG (anti-pattern, even with rate_limit respected):
results = await asyncio.gather(*[
    client.scrape.amazon.products(url=u) for u in urls
])  # ↑ each call blocks ~5min; total ~100min for 200 URLs

# RIGHT (trigger pattern):
jobs = [await client.scrape.amazon.products_trigger(url=u) for u in urls]
for job in jobs:
    await job.wait(timeout=600)
results = [await job.to_result() for job in jobs]
# ↑ total time ≈ longest single scrape ≈ 5-10 min, regardless of N

Rate limit (10 req/s default) keeps the trigger fires sequential; the parallelism happens during the wait phase, which is just status polling and is cheap. Do NOT use asyncio.gather to fire triggers in parallel either — you'll hit the rate limiter.

In synchronous environments (scripts, notebooks, Claude Code), use SyncBrightDataClient. In async environments, use BrightDataClient. Both use the SAME method names — the only difference is that async calls need await. Do NOT use _sync suffix methods with SyncBrightDataClient. Note: the sync client has limited platform coverage. Sync scraping supports: Amazon, LinkedIn, Instagram, Facebook, ChatGPT, Pinterest. Sync search supports: Google, Bing, Yandex, Amazon, LinkedIn, Instagram, ChatGPT, Pinterest. For TikTok, YouTube, Reddit, Perplexity, DigiKey scrapers/search and the datasets API, use the async client.
Platform search methods (e.g., client.search.amazon.products()) are different from platform scrapers (e.g., client.scrape.amazon.products()). Search finds items by keyword. Scrape extracts data from a specific URL.

Examples

"Get me reviews for this Amazon product"

Use client.scrape.amazon.reviews(url="<the_url>"). Returns structured review data: rating, text, date, reviewer name. Quick method — blocks until complete (up to ~4 minutes).

"Find AI startups in Berlin"

Step 1: client.discover(query="AI startups in Berlin", intent="find technology companies") Returns a list of matching entities with URLs and metadata.

Step 2: For each result with a URL, optionally scrape deeper data: client.scrape.linkedin.companies(url=...) or client.scrape_url(url=...).

"I need historical pricing data for electronics"

Step 1: client.datasets.list() to find relevant datasets.

Step 2: Create a filtered snapshot: snapshot_id = client.datasets.amazon_products(filter={"name": "category", "operator": "=", "value": "Electronics"}, records_limit=1000)

Step 3: Download the data: data = client.datasets.amazon_products.download(snapshot_id)

Note: Download blocks while the snapshot builds (up to 5 minutes). Default format is jsonl (also supports json, csv). This is historical/bulk data, not live prices. Returns a list of records.

Troubleshooting

401 Unauthorized: API token is invalid or expired. Check the token passed to the client constructor.
403 Forbidden / Blocked: The target site blocked the request. Try client.scrape_url() (web unlocker) as fallback, or use a different scraper method.
Timeout: Do not lower the timeout — increase it. Some operations take several minutes. Platform-specific defaults are already optimized.
"Dataset not found": Use client.datasets.list() to see available datasets. Dataset attribute names are snake_case (e.g., amazon_products, linkedin_profiles).
SSL/Proxy errors in sandboxed environments: Pass ssl_verify=False to the client constructor to skip SSL verification, or use ssl_ca_cert='/path/to/cert.pem' for custom certificate handling.
Rate limit errors: Reduce concurrency. Default limit is 10 requests/second. Use sequential calls or trigger methods for batch work.

When to Load References

Read references/scrapers.md when the user mentions Amazon, LinkedIn, Facebook, Instagram, YouTube, TikTok, Reddit, ChatGPT, Perplexity, Pinterest, DigiKey or other specific platforms — to see available scraper methods, search methods, and parameters for that platform.
Read references/search.md when the user asks to "find", "search", "discover", "research", "look up" something without mentioning a specific platform — to see SERP engines and Discover API options.
Read references/datasets-overview.md when the user asks for "bulk data", "historical data", "database", "list of", "dataset" or wants data at scale — to see dataset categories and how to discover specific datasets at runtime.
Read references/advanced.md when the user needs batch processing of multiple URLs, non-blocking execution, browser automation, JavaScript execution, login/session handling, custom scraping templates, or when simpler methods have failed — to see execution patterns, batch workflows, Web Unlocker, Browser API, and Scraper Studio details.