From actionbook
Extracts structured data from websites like product listings, tables, or search results, generating executable Playwright scripts and JSON/CSV output.
npx claudepluginhub actionbook/actionbook --plugin actionbookThis skill uses the workspace's default tool permissions.
Activate when the user wants to **obtain data** from a website:
Extracts structured data like tables, lists, and prices from web pages using multi-strategy scraping with pagination, validation, transforms, and CSV/JSON export.
Builds production-ready web scrapers for any website using Bright Data APIs including Web Unlocker, Browser, and SERP. Guides site analysis, selector extraction, pagination handling, and code implementation in Python or Node.js.
Automatically scrapes websites by analyzing page structure, handling pagination/anti-blocking, discovering article series using Playwright and Crawl4AI. Zero config needed.
Share bugs, ideas, or general feedback.
Activate when the user wants to obtain data from a website:
The deliverable is always two artifacts:
.cjs file that reproduces the extraction without Actionbook at runtime.Use Actionbook as a conditional accelerator, not a mandatory step. The goal is reliable selectors in the shortest path.
User request
│
├─► actionbook search "<site> <intent>"
│ ├─ Results with Health Score ≥ 70% ──► actionbook get "<ID>" ──► use selectors
│ └─ No results / low score ──► Fallback
│
└─► Fallback: actionbook browser open <url>
├─ actionbook browser snapshot (accessibility tree → find selectors)
├─ actionbook browser screenshot (visual confirmation)
└─ manual selector discovery via DOM inspection
Priority order for selector sources:
| Priority | Source | When |
|---|---|---|
| 1 | actionbook get | Site is indexed, health score ≥ 70% |
| 2 | actionbook browser snapshot | Not indexed or selectors outdated |
| 3 | DOM inspection via screenshot + snapshot | Complex SPA / dynamic content |
Non-negotiable rule: if search + get already provides usable selectors for required fields, start from get selectors and do not jump to full fallback (snapshot/screenshot) by default. Exception: lightweight mechanism probes (for hydration/virtualization/pagination) are allowed when runtime behavior may affect script correctness. Escalate to snapshot/screenshot only when probes/sample validation indicate selector gaps or instability.
Websites use patterns that break naive scraping. The generated Playwright script must account for these:
Pages may render a shell first, then stream or hydrate content.
// Wait for hydration to complete — not just DOMContentLoaded
await page.waitForSelector('[data-item]', { state: 'attached' });
await page.waitForFunction(() => {
const items = document.querySelectorAll('[data-item]');
return items.length > 0 && !document.querySelector('[data-pending]');
});
Detection cues: React root with data-reactroot, Next.js __NEXT_DATA__, empty containers that fill after JS runs. If actionbook browser text "<selector>" returns empty but the screenshot shows content, hydration hasn't completed.
Only visible rows exist in the DOM. Scrolling renders new rows and destroys old ones.
// Scroll-and-collect loop for virtualized lists (scroll container aware)
const allItems = [];
const maxScrolls = 50;
let scrolls = 0;
const container = await page.$('<scroll-container-selector>');
if (!container) throw new Error('Scroll container not found');
let previousTop = await container.evaluate(el => el.scrollTop);
while (scrolls < maxScrolls) {
const items = await page.$$eval('[data-row]', rows =>
rows.map(r => ({ text: r.textContent.trim() }))
);
for (const item of items) {
if (!allItems.find(i => i.text === item.text)) allItems.push(item);
}
await container.evaluate(el => el.scrollBy(0, 600));
await page.waitForTimeout(300);
const currentTop = await container.evaluate(el => el.scrollTop);
if (currentTop === previousTop) break;
previousTop = currentTop;
scrolls += 1;
}
Detection cues: Container has fixed height with overflow: auto/scroll, row count in DOM is much smaller than stated total, rows have transform: translateY(...) or position: absolute; top: ...px.
New content appends when the user scrolls near the bottom.
// Scroll to bottom until no new content loads (with no-growth tolerance)
let itemCount = 0;
let noGrowthStreak = 0;
const maxScrolls = 80;
let scrolls = 0;
while (scrolls < maxScrolls && noGrowthStreak < 3) {
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await page.waitForTimeout(1200);
const newCount = await page.$$eval('.item', els => els.length);
if (newCount > itemCount) {
itemCount = newCount;
noGrowthStreak = 0;
} else {
noGrowthStreak += 1;
}
scrolls += 1;
}
Detection cues: Intersection Observer in page JS, "Load more" button, sentinel element at bottom, network requests firing on scroll.
Multi-page results behind "Next" buttons or numbered pages.
// Click-through pagination (navigation-aware, SPA-safe)
const allData = [];
const maxPages = 50;
let pageIndex = 0;
while (pageIndex < maxPages) {
const pageData = await page.$$eval('.result-item', items =>
items.map(el => ({ title: el.querySelector('h3')?.textContent?.trim() }))
);
allData.push(...pageData);
const nextBtn = await page.$('a.next-page:not([disabled])');
if (!nextBtn) break;
const previousUrl = page.url();
const previousFirstItem = await page
.$eval('.result-item', el => el.textContent?.trim() || '')
.catch(() => '');
await nextBtn.click();
// Post-click detection only: advance must be caused by this click
const advanced = await Promise.any([
page
.waitForURL(url => url.toString() !== previousUrl, { timeout: 5000 })
.then(() => true),
page
.waitForFunction(
prev => {
const first = document.querySelector('.result-item');
return !!first && (first.textContent || '').trim() !== prev;
},
previousFirstItem,
{ timeout: 5000 }
)
.then(() => true),
]).catch(() => false);
if (!advanced) break;
await page.waitForLoadState('networkidle').catch(() => {});
pageIndex += 1;
}
Identify from the user request:
# Try Actionbook index first
actionbook search "<site> <data-description>" --domain <domain>
# If good results (health ≥ 70%), get full selectors
actionbook get "<ID>"
Use this routing strictly:
Path A (default when get is good): requested fields are covered by get selectors and quality is acceptable.
get selectors and move to script draft quickly.browser text, quick scroll checks) before finalizing script strategy.snapshot / screenshot) before first draft unless probe/sample validation shows mismatch.get selectors and mark source as actionbook_get.Path B (partial / unstable): get exists but required fields are missing, selector resolves 0 elements, or validation fails.
Path C (no usable coverage): search/get has no usable result.
Path A mechanism detection timing:
actionbook browser open "<url>" (if current tab context is unknown/stale)Fallback discovery by path:
Path B targeted fallback (only failed fields/steps):
actionbook browser open "<url>" # if not already open
actionbook browser snapshot # focus on failed field/container mapping
# actionbook browser screenshot # optional visual confirmation for failed area
Path C full fallback (no usable coverage):
actionbook browser open "<url>"
actionbook browser snapshot
actionbook browser screenshot
Mechanism probes (run when script strategy needs confirmation):
# Hydration / streaming check
actionbook browser text "<container-selector>"
# Infinite scroll quick signal (explicit before/after decision)
actionbook browser eval "document.querySelectorAll('<item-selector>').length" # before
actionbook browser click "<scroll-container-selector-or-body>" # focus scroll context
actionbook browser eval "const c=document.querySelector('<scroll-container-selector>') || document.scrollingElement; c.scrollBy(0, c.clientHeight || window.innerHeight);"
actionbook browser eval "document.querySelectorAll('<item-selector>').length" # after
# If count increases, treat page as lazy-load/infinite-scroll.
Fallback trigger conditions:
actionbook get cannot map all required fields.actionbook get selectors return empty/unstable values in sample run.Write a standalone Playwright script (extract_<domain>_<slug>.cjs) that:
load — see mechanisms above).JSON.stringify / CSV).maxPages, maxScrolls, timeout budget) to avoid infinite loops.Script template:
// extract_<domain>_<slug>.cjs
// Generated by Actionbook extract skill
// Usage: node extract_<domain>_<slug>.cjs
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('<URL>', { waitUntil: 'domcontentloaded' });
// -- wait for readiness --
await page.waitForSelector('<container>', { state: 'visible' });
// -- extract --
const data = await page.$$eval('<item-selector>', items =>
items.map(el => ({
// fields mapped from user request
}))
);
// -- output --
const fs = require('fs');
fs.writeFileSync('output.json', JSON.stringify(data, null, 2));
console.log(`Extracted ${data.length} items → output.json`);
await browser.close();
})();
Run the script to confirm it works:
node extract_<domain>_<slug>.cjs
Validation rules:
| Check | Pass condition |
|---|---|
| Script exits 0 | No runtime errors |
| Output file exists | Non-empty file written |
| Record count > 0 | At least one item extracted |
| No null/empty fields | Every declared field has a value in ≥ 90% of records |
| Data matches page | Spot-check first and last record against actionbook browser text |
If validation fails, inspect the output, adjust selectors or wait strategy, and re-run.
Present to the user:
.cjs file they can re-run anytime.Every extract invocation produces:
| Artifact | Path | Format |
|---|---|---|
| Playwright script | ./extract_<domain>_<slug>.cjs | Standalone Node.js script using playwright |
| Extracted data | ./output.json (default) or user-specified path | JSON array of objects (default), CSV, or user-specified |
The script must be re-runnable — a user should be able to execute it later without Actionbook installed, as long as Node.js + Playwright are available in the runtime environment.
When multiple selector types are available from actionbook get:
| Priority | Type | Reason |
|---|---|---|
| 1 | data-testid | Stable, test-oriented, rarely changes |
| 2 | aria-label | Accessibility-driven, semantically meaningful |
| 3 | CSS selector | Structural, may break on redesign |
| 4 | XPath | Last resort, most brittle |
| Error | Action |
|---|---|
actionbook search returns no results | Fall back to snapshot + screenshot |
| Selector returns 0 elements | Re-snapshot, compare with screenshot, update selector |
| Script times out | Add longer waitForTimeout, check for anti-bot measures |
| Partial data (some fields empty) | Check if content is lazy-loaded; add scroll/wait |
| Anti-bot / CAPTCHA | Inform user; suggest running with headless: false or using their own browser session via actionbook setup extension mode |