From brightdata-plugin
Extracts web data using Bright Data Python SDK: platform scraping (Amazon, LinkedIn, etc.), SERP search, datasets, discovery, and browser automation.
npx claudepluginhub brightdata/skills --plugin brightdata-pluginThis skill uses the workspace's default tool permissions.
Access web data through a unified Python SDK. One client, eight service categories: platform scraping, platform search, web search (SERP), AI-powered discovery, datasets, web unlocking, browser automation, and scraper studio.
Integrates Bright Data APIs for production web scraping, SERP results, structured extraction, and browser automation with best practices, CLI setup, and auth patterns.
Scrape structured data from JavaScript-rendered pages, SPAs, or interactive sites using Bright Data Scraping Browser with Playwright or Puppeteer.
AI-driven data extraction from 55+ Actors across all major platforms. This skill automatically selects the best Actor for your task.
Share bugs, ideas, or general feedback.
Access web data through a unified Python SDK. One client, eight service categories: platform scraping, platform search, web search (SERP), AI-powered discovery, datasets, web unlocking, browser automation, and scraper studio.
Always use the client as a context manager. In synchronous environments (scripts, notebooks, Claude Code), use SyncBrightDataClient. In async environments, use BrightDataClient. Both use the same method names — the sync client wraps calls automatically. Note: the sync client currently has limited platform coverage — see the sync compatibility note in references/scrapers.md for details. For unsupported platforms or the datasets API, use the async client (BrightDataClient).
Use this decision tree to pick the right service BEFORE reaching for any specific method. Most routing failures come from skipping this step and pattern-matching on user keywords instead.
Have a URL?
├── On a supported platform (Amazon, LinkedIn, Facebook, Instagram, YouTube,
│ TikTok, Reddit, ChatGPT, Perplexity, Pinterest, DigiKey)?
│ → Platform scraping: client.scrape.<platform>.<method>(url=...)
│
├── Generic page (not on any supported platform)?
│ → Web unlocker: client.scrape_url(url=...)
│
└── Need login / JavaScript / click-scroll-fill / CAPTCHA / multi-step navigation?
→ Browser API: client.browser.get_connect_url() (then connect with Playwright)
No URL?
├── Want entities matching natural-language criteria
│ ("find AI startups in Berlin", "competitors of Acme Corp", "people who worked at X")?
│ → Discover: client.discover(query=..., intent=...)
│
├── Want web pages / articles / search-result links
│ ("search Google for X", "find pages about Y")?
│ → SERP: client.search.google(query=...) [or .bing / .yandex]
│
├── Want to search WITHIN a specific platform
│ ("find products on Amazon", "TikTok videos by hashtag", "Pinterest pins about recipes")?
│ → Platform search: client.search.<platform>.<method>(...)
│
└── Want bulk historical data at scale
("all LinkedIn companies in tech", "historical Amazon prices", "every Zillow listing in Texas")?
→ Datasets: client.datasets.<name>(filter=...) (then .download(snapshot_id))
Edge cases:
client.scrape_url() (web unlocker).Before claiming any platform method exists, doesn't exist, or asserting a platform isn't supported, you MUST consult references/scrapers.md first. Past evals show the model has hallucinated method names (e.g., client.scrape.linkedin.people — does NOT exist; use .profiles) and falsely claimed platforms unsupported (e.g., Pinterest is supported via both client.scrape.pinterest and client.search.pinterest).
Rule: load references/scrapers.md before naming any specific platform method. The reference file lists every platform, every method signature, and the sync/async availability matrix. Verify, don't assume. If you've already loaded references/scrapers.md in this session, consult what's in context — no need to reload.
Known hallucinations (these names do NOT exist in the SDK — the model has invented them in past evals):
| Hallucinated | Correct replacement |
|---|---|
client.scrape.linkedin.people(...) | client.scrape.linkedin.profiles(url=...) |
client.scrape.instagram.users(...) | client.scrape.instagram.profiles(url=...) |
client.list_datasets() | client.datasets.list() |
asyncio.gather(*[client.scrape.X.<quick>(...) for ...]) | Trigger pattern — see Batch gotcha |
This list grows as new hallucinations are observed in evals. If you're tempted to write a method name that "feels right" but you haven't seen in references/scrapers.md, treat it as a likely hallucination — load the reference and verify.
Methods that don't belong to any specific workflow — easy to overlook because they're not tied to a platform or a routing decision. The model has hallucinated some of these (client.list_datasets() instead of client.datasets.list()); use the canonical names below.
| Method | What it does |
|---|---|
client.datasets.list() | List all 310+ datasets at runtime. Do NOT use dir() or introspection — use this method. |
client.discover(query=, intent=) | AI-ranked entity search (companies, people, products). See Service Selection above. |
client.scrape_url(url=...) | Web unlocker for any URL. Use for sites without a dedicated platform scraper. |
client.browser.get_connect_url() | CDP WebSocket URL for Playwright / Puppeteer / Selenium. |
client.list_zones() | List active Bright Data zones. |
client.delete_zone(name) | Remove a zone. |
client.test_connection() | Verify the API token works. |
client.get_account_info() | Usage, quotas, active zones. |
If the user wants to know what's available or asks "what can this do?", describe these 8 categories. Each follows the template: Name — what it does — example invocation — when to use.
Platform scraping — extract structured data from 11 supported platforms by URL. client.scrape.<platform>.<method>(url=...) (platforms: Amazon, LinkedIn, Facebook, Instagram, YouTube, TikTok, Reddit, ChatGPT, Perplexity, Pinterest, DigiKey). Use when the user has a URL on a supported platform and wants structured fields (price, profile data, post engagement, etc.).
Platform search — search within a specific platform by keyword/profile/filter. client.search.<platform>.<method>(...). Use for "find products on Amazon by keyword", "discover TikTok videos by hashtag", "find Pinterest pins about recipes" — i.e., the user wants to search WITHIN a platform but doesn't have a specific URL.
Web search (SERP) — get structured search engine results (titles, links, snippets, rankings). client.search.google(query=...) (or .bing / .yandex). Use for "search Google for X", "find pages/articles about Y", "look up news on Z" — i.e., the user wants web pages, not entities.
Discover (AI-powered) — client.discover(query=..., intent=...) to find entities (companies, people, products, places) matching natural-language criteria. Use for "find AI startups in Berlin", "competitors of Acme Corp", "people who worked at Stripe", "research the SaaS pricing landscape" — i.e., the user wants a list of entities matching a description, not web pages.
Datasets — access 310+ pre-built datasets with historical/bulk data at scale. client.datasets.<name>(filter=...) (then .download(snapshot_id)). Use for "bulk LinkedIn company data", "historical Amazon prices for electronics", "all Zillow listings in Texas" — i.e., the user wants many records at once, not live data on one page.
Web unlocker — scrape any URL with anti-bot bypass for sites without a dedicated platform scraper. client.scrape_url(url=...). Use when the URL is on a generic website (no dedicated scraper) or as fallback when a platform scraper returns 403/blocked.
Browser API — connect to a remote browser via CDP (Chrome DevTools Protocol) for real-browser interaction. client.browser.get_connect_url() (then connect with Playwright/Puppeteer). Use for login flows, JavaScript-heavy single-page apps, click/scroll/fill interactions, CAPTCHA — i.e., anything requiring a real browser session. Most expensive option; use only when simpler methods can't accomplish the task.
Scraper Studio — run pre-built or custom scraping templates configured in the Bright Data dashboard. client.scraper_studio.run(collector="c_xxx", input={...}). Use when the user provides a collector ID for a template not covered by platform scrapers.
Offer to load the relevant reference file for details on any category.
The user has a URL and wants structured data from it.
If the URL is from a supported platform (Amazon, LinkedIn, Facebook, Instagram, YouTube, TikTok, Reddit, ChatGPT, Perplexity, Pinterest, DigiKey — see references/scrapers.md for the full list and available methods):
client.scrape.<platform>.<method>(url=...)references/scrapers.md for available methods per platformIf the URL is from an unsupported platform or a generic website:
client.scrape_url(url=...) for raw page data with anti-bot bypassreferences/advanced.md for web unlocker optionsIf the user has MULTIPLE URLs (batch):
_trigger suffix) to avoid sequential blockingjob.wait() and job.to_result()references/advanced.md for batch execution patternsThe user wants to find information but doesn't have a starting URL.
For web search results (links, snippets, rankings):
client.search.google(query=...), client.search.bing(query=...), or client.search.yandex(query=...)references/search.md for available search engines and parametersFor platform-specific search (find products on Amazon, profiles on LinkedIn, videos on YouTube, etc.):
client.search.<platform>.<method>(...)references/scrapers.md — search methods are listed under each platformFor deeper discovery (find companies, people, or entities matching criteria):
client.discover(query=..., intent=...)references/search.md for discover API detailsThe user asks for "bulk data", "historical data", "database", "list of", or wants data at scale without scraping individual pages.
client.datasets.list() at runtime to discover available datasetsreferences/datasets-overview.md for dataset categories and usage patternssnapshot_id = client.datasets.<name>(filter={...})data = client.datasets.<name>.download(snapshot_id) (default format is jsonl; also supports json, csv)The user has a broad research goal (e.g., "research competitors in Berlin").
Step 1: Find sources
client.discover(query=..., intent=...) for entity-level discoveryclient.search.google(query=...) for web search resultsStep 2: Extract data from discovered sources
client.scrape.<platform>.<method>(url=...) on each discovered URLStep 3: Optionally enrich with bulk data
client.datasets for historical context on the entities foundThe user needs login, clicking, scrolling, form filling, or JavaScript execution.
client.browser.get_connect_url() to get a CDP WebSocket URLreferences/advanced.md for browser API detailsThe user wants to use a pre-built or custom scraping template.
client.scraper_studio.run(collector="c_xxx", input={...})references/advanced.md for scraper studio detailsDefault: For pages on a supported platform (Amazon products, LinkedIn profiles, Instagram posts/reels, etc.) → use the platform scraper.
Override: User explicitly mentions one of the following → comply with the browser-API request (it IS the right tool): login, sign-in, click, scroll, fill, type, JavaScript execution, CAPTCHA, screenshot, PDF generation, multi-step navigation.
Counter-override: User requests browser API for a page that does NOT need any of the above AND the URL is on a supported platform → DO NOT comply. Show the platform scraper code and explain the cost/speed difference (~10x cheaper, ~30s vs ~5min).
# WRONG (browser when scraper would do):
cdp_url = client.browser.get_connect_url()
browser = await playwright.chromium.connect_over_cdp(cdp_url)
page = await browser.new_page()
await page.goto("https://amazon.com/dp/B09V3KXJPB")
# ↑ scraper is ~10x cheaper, ~30s vs ~5min, returns structured data not raw HTML
# RIGHT (use the platform scraper):
result = await client.scrape.amazon.products(url="https://amazon.com/dp/B09V3KXJPB")
# RIGHT — legitimate browser-API case (the user mentions login):
# User said "log into Amazon and check my recent orders"
cdp_url = client.browser.get_connect_url(country="us")
browser = await playwright.chromium.connect_over_cdp(cdp_url)
page = await browser.new_page()
await page.goto("https://amazon.com/login")
await page.fill("#ap_email", username)
# ... etc — browser is the right tool here.
client.scrape_url() (web unlocker) as fallback — it handles anti-bot protections.snapshot_id, not data directly. Snapshots go through a lifecycle: scheduled → building → ready. Use .download(snapshot_id) which blocks until the snapshot is ready. Supported download formats: json, jsonl, csv.Why: Quick methods (e.g., client.scrape.amazon.products) block for 2-10 minutes each waiting for the scrape to complete. Even with the default 10 req/s rate limit, asyncio.gather of 200 quick calls = 200 × ~5 minutes / 10 (rate limit) = ~100 minutes of blocked execution. The trigger pattern fires the request and returns a job; you collect results in parallel when they're ready.
# WRONG (anti-pattern, even with rate_limit respected):
results = await asyncio.gather(*[
client.scrape.amazon.products(url=u) for u in urls
]) # ↑ each call blocks ~5min; total ~100min for 200 URLs
# RIGHT (trigger pattern):
jobs = [await client.scrape.amazon.products_trigger(url=u) for u in urls]
for job in jobs:
await job.wait(timeout=600)
results = [await job.to_result() for job in jobs]
# ↑ total time ≈ longest single scrape ≈ 5-10 min, regardless of N
Rate limit (10 req/s default) keeps the trigger fires sequential; the parallelism happens during the wait phase, which is just status polling and is cheap. Do NOT use asyncio.gather to fire triggers in parallel either — you'll hit the rate limiter.
SyncBrightDataClient. In async environments, use BrightDataClient. Both use the SAME method names — the only difference is that async calls need await. Do NOT use _sync suffix methods with SyncBrightDataClient. Note: the sync client has limited platform coverage. Sync scraping supports: Amazon, LinkedIn, Instagram, Facebook, ChatGPT, Pinterest. Sync search supports: Google, Bing, Yandex, Amazon, LinkedIn, Instagram, ChatGPT, Pinterest. For TikTok, YouTube, Reddit, Perplexity, DigiKey scrapers/search and the datasets API, use the async client.client.search.amazon.products()) are different from platform scrapers (e.g., client.scrape.amazon.products()). Search finds items by keyword. Scrape extracts data from a specific URL.Use client.scrape.amazon.reviews(url="<the_url>").
Returns structured review data: rating, text, date, reviewer name.
Quick method — blocks until complete (up to ~4 minutes).
Step 1: client.discover(query="AI startups in Berlin", intent="find technology companies")
Returns a list of matching entities with URLs and metadata.
Step 2: For each result with a URL, optionally scrape deeper data:
client.scrape.linkedin.companies(url=...) or client.scrape_url(url=...).
Step 1: client.datasets.list() to find relevant datasets.
Step 2: Create a filtered snapshot:
snapshot_id = client.datasets.amazon_products(filter={"name": "category", "operator": "=", "value": "Electronics"}, records_limit=1000)
Step 3: Download the data:
data = client.datasets.amazon_products.download(snapshot_id)
Note: Download blocks while the snapshot builds (up to 5 minutes). Default format is jsonl (also supports json, csv). This is historical/bulk data, not live prices. Returns a list of records.
client.scrape_url() (web unlocker) as fallback, or use a different scraper method.client.datasets.list() to see available datasets. Dataset attribute names are snake_case (e.g., amazon_products, linkedin_profiles).ssl_verify=False to the client constructor to skip SSL verification, or use ssl_ca_cert='/path/to/cert.pem' for custom certificate handling.references/scrapers.md when the user mentions Amazon, LinkedIn, Facebook, Instagram, YouTube, TikTok, Reddit, ChatGPT, Perplexity, Pinterest, DigiKey or other specific platforms — to see available scraper methods, search methods, and parameters for that platform.references/search.md when the user asks to "find", "search", "discover", "research", "look up" something without mentioning a specific platform — to see SERP engines and Discover API options.references/datasets-overview.md when the user asks for "bulk data", "historical data", "database", "list of", "dataset" or wants data at scale — to see dataset categories and how to discover specific datasets at runtime.references/advanced.md when the user needs batch processing of multiple URLs, non-blocking execution, browser automation, JavaScript execution, login/session handling, custom scraping templates, or when simpler methods have failed — to see execution patterns, batch workflows, Web Unlocker, Browser API, and Scraper Studio details.