Help us improve
Share bugs, ideas, or general feedback.
From brightdata-plugin
Builds a RAG pipeline or custom search engine using Bright Data's Discover API for intent-ranked web results and parsed page content as retrieval/ingestion layer for LLMs or vector stores.
npx claudepluginhub brightdata/skills --plugin brightdata-pluginHow this skill is triggered — by the user, by Claude, or both
Slash command
/brightdata-plugin:rag-pipelineThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Use Discover as the **retrieval layer** for an LLM app or a custom search engine.
Designs a retrieval-augmented generation pipeline with ingestion, chunking, embedding, vector DB, hybrid search, re-ranking, and prompt construction to ground LLM outputs in external knowledge.
<!-- AUTO-GENERATED by export-plugins.py — DO NOT EDIT -->
Share bugs, ideas, or general feedback.
Use Discover as the retrieval layer for an LLM app or a custom search engine.
Discover already returns intent-ranked, relevance-scored results with parsed
page content, so it does the "search + fetch + clean" stage of RAG for you. This
is a code/architecture skill built on the discover-api skill — read that
for API mechanics (trigger/poll, modes, params, limits).
Pick the right neighbor: a written brief → live-research; markdown of specific
URLs you already have → scrape; structured platform records → data-feeds.
Does the corpus change every query, or is it a stable knowledge base?
├── Per-query, always-fresh ("ground each answer in live web data")
│ → LIVE RETRIEVAL: Discover(include_content) at query time → top-k → LLM
│ Pros: always current, no storage. Cons: per-query latency + cost.
│
└── Reused across many queries ("build a knowledge base / search engine")
→ INGESTION: Discover(include_content) → chunk → embed → vector store
then at query time: embed query → vector search → (rerank) → LLM
Pros: fast queries, cacheable. Cons: can go stale (re-ingest on a schedule).
Many systems do both: an ingested base for breadth + a live Discover call for freshness, merged before the LLM.
Pattern: on each user question, run Discover with a sharp intent, take the
top-k by relevance_score, and pass their content as context to the LLM. The
LLM cites the links.
import { bdclient } from '@brightdata/sdk';
const client = new bdclient(); // BRIGHTDATA_API_TOKEN
async function retrieve(question, k = 6) {
const res = await client.discover(question, {
intent: `authoritative sources that directly answer: ${question}`,
includeContent: true,
numResults: Math.min(k * 2, 20), // over-fetch, then trim
});
// NOTE: the JS SDK returns a WRAPPER object, not a bare array:
// { success, data: [ {link,title,description,relevance_score,content?} ], totalResults, cost, taskId, ... }
// The result rows are in `.data` (CLI/REST use `.results` instead — see discover-api).
if (!res.success) throw new Error(`discover failed: ${res.error ?? 'unknown'}`);
return (res.data ?? [])
.filter(r => r.content && !/just a moment|captcha|access denied|not found/i.test(r.content) && r.content.length > 200)
.sort((a, b) => b.relevance_score - a.relevance_score)
.slice(0, k);
}
// → build a prompt from sources[].content, ask the LLM to answer WITH [n] citations to sources[].link
Full prompt-assembly + citation pattern: references/code.md.
Pattern: discover broadly (high volume — zeroRanking via REST is ideal here),
chunk each page's content, embed the chunks, upsert into a vector store with the
source URL as metadata. At query time: embed the query, vector-search, optionally
rerank, then feed to the LLM.
Stages: discover → dedup → chunk → embed → upsert (ingest), then
embed query → search → rerank → generate (serve). Provider-agnostic code for
both stages, including chunking and metadata, is in
references/code.md.
For bulk corpus building, prefer the raw REST "mode":"zeroRanking" flow (max raw
results, no ranking) from the discover-api skill — but note it ignores
num_results and does not support include_content, so you fetch content
separately (Discover standard/deep with content, or the scrape skill).
link (and ideally title +
relevance_score). RAG without citations is unverifiable.content before embedding. Skip block pages and empty bodies
(oversized PDFs return null content). Embedding garbage poisons retrieval.relevance_score. Discover's score is a strong prior
for top-k selection before (or instead of) a reranker.num_results ≤ 20 per call; dedup by normalized URL across
calls so one article via three aggregators isn't triple-weighted.content made it into the index — spot-check stored chunks.[n] the LLM emits maps to a real source link in the retrieved set.content without filtering block pages / nulls.num_results as unlimited (cap 20) or expecting include_content under zeroRanking.references/code.md — runnable JS + Python for both architectures: live retrieval with prompt+citation assembly, and the full ingestion pipeline (discover → dedup → chunk → embed → upsert → query), with a provider-agnostic embedder/vector-store interface.discover-api — the retrieval API (trigger/poll, modes, include_content, limits). Read first.live-research — one-off synthesized report instead of a standing system.scrape — fetch markdown for specific URLs you already have.js-sdk-best-practices / python-sdk-best-practices — client.discover() option details and batch patterns.