npx claudepluginhub actionbook/actionbook --plugin actionbookWant just this skill?
Then install: npx claudepluginhub u/[userId]/[slug]
Extract structured data from websites and produce an executable Playwright script plus extracted data. Use when the user wants to scrape, extract, pull, collect, or harvest data from any website — product listings, tables, search results, feeds, profiles, or any repeating content.
This skill uses the workspace's default tool permissions.
When to Use This Skill
Activate when the user wants to obtain data from a website:
- "Extract all product prices from this page"
- "Scrape the table of results from ..."
- "Pull the list of authors and titles from arXiv search results"
- "Collect all job listings from this page"
- "Get the data from this dashboard table"
- "Harvest review scores from ..."
- "Download all the links/images/cards from ..."
The deliverable is always two artifacts:
- Executable Playwright script — a standalone
.cjsfile that reproduces the extraction without Actionbook at runtime. - Extracted data — JSON (default), CSV, or user-specified format written to disk.
Decision Strategy
Use Actionbook as a conditional accelerator, not a mandatory step. The goal is reliable selectors in the shortest path.
User request
│
├─► actionbook search "<site> <intent>"
│ ├─ Results with Health Score ≥ 70% ──► actionbook get "<ID>" ──► use selectors
│ └─ No results / low score ──► Fallback
│
└─► Fallback: actionbook browser open <url>
├─ actionbook browser snapshot (accessibility tree → find selectors)
├─ actionbook browser screenshot (visual confirmation)
└─ manual selector discovery via DOM inspection
Priority order for selector sources:
| Priority | Source | When |
|---|---|---|
| 1 | actionbook get | Site is indexed, health score ≥ 70% |
| 2 | actionbook browser snapshot | Not indexed or selectors outdated |
| 3 | DOM inspection via screenshot + snapshot | Complex SPA / dynamic content |
Non-negotiable rule: if search + get already provides usable selectors for required fields, start from get selectors and do not jump to full fallback (snapshot/screenshot) by default. Exception: lightweight mechanism probes (for hydration/virtualization/pagination) are allowed when runtime behavior may affect script correctness. Escalate to snapshot/screenshot only when probes/sample validation indicate selector gaps or instability.
Mechanism-Aware Script Strategy
Websites use patterns that break naive scraping. The generated Playwright script must account for these:
Streaming / SSR / RSC hydration
Pages may render a shell first, then stream or hydrate content.
// Wait for hydration to complete — not just DOMContentLoaded
await page.waitForSelector('[data-item]', { state: 'attached' });
await page.waitForFunction(() => {
const items = document.querySelectorAll('[data-item]');
return items.length > 0 && !document.querySelector('[data-pending]');
});
Detection cues: React root with data-reactroot, Next.js __NEXT_DATA__, empty containers that fill after JS runs. If actionbook browser text "<selector>" returns empty but the screenshot shows content, hydration hasn't completed.
Virtualized lists / virtual DOM
Only visible rows exist in the DOM. Scrolling renders new rows and destroys old ones.
// Scroll-and-collect loop for virtualized lists (scroll container aware)
const allItems = [];
const maxScrolls = 50;
let scrolls = 0;
const container = await page.$('<scroll-container-selector>');
if (!container) throw new Error('Scroll container not found');
let previousTop = await container.evaluate(el => el.scrollTop);
while (scrolls < maxScrolls) {
const items = await page.$$eval('[data-row]', rows =>
rows.map(r => ({ text: r.textContent.trim() }))
);
for (const item of items) {
if (!allItems.find(i => i.text === item.text)) allItems.push(item);
}
await container.evaluate(el => el.scrollBy(0, 600));
await page.waitForTimeout(300);
const currentTop = await container.evaluate(el => el.scrollTop);
if (currentTop === previousTop) break;
previousTop = currentTop;
scrolls += 1;
}
Detection cues: Container has fixed height with overflow: auto/scroll, row count in DOM is much smaller than stated total, rows have transform: translateY(...) or position: absolute; top: ...px.
Infinite scroll / lazy loading
New content appends when the user scrolls near the bottom.
// Scroll to bottom until no new content loads (with no-growth tolerance)
let itemCount = 0;
let noGrowthStreak = 0;
const maxScrolls = 80;
let scrolls = 0;
while (scrolls < maxScrolls && noGrowthStreak < 3) {
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await page.waitForTimeout(1200);
const newCount = await page.$$eval('.item', els => els.length);
if (newCount > itemCount) {
itemCount = newCount;
noGrowthStreak = 0;
} else {
noGrowthStreak += 1;
}
scrolls += 1;
}
Detection cues: Intersection Observer in page JS, "Load more" button, sentinel element at bottom, network requests firing on scroll.
Pagination
Multi-page results behind "Next" buttons or numbered pages.
// Click-through pagination (navigation-aware, SPA-safe)
const allData = [];
const maxPages = 50;
let pageIndex = 0;
while (pageIndex < maxPages) {
const pageData = await page.$$eval('.result-item', items =>
items.map(el => ({ title: el.querySelector('h3')?.textContent?.trim() }))
);
allData.push(...pageData);
const nextBtn = await page.$('a.next-page:not([disabled])');
if (!nextBtn) break;
const previousUrl = page.url();
const previousFirstItem = await page
.$eval('.result-item', el => el.textContent?.trim() || '')
.catch(() => '');
await nextBtn.click();
// Post-click detection only: advance must be caused by this click
const advanced = await Promise.any([
page
.waitForURL(url => url.toString() !== previousUrl, { timeout: 5000 })
.then(() => true),
page
.waitForFunction(
prev => {
const first = document.querySelector('.result-item');
return !!first && (first.textContent || '').trim() !== prev;
},
previousFirstItem,
{ timeout: 5000 }
)
.then(() => true),
]).catch(() => false);
if (!advanced) break;
await page.waitForLoadState('networkidle').catch(() => {});
pageIndex += 1;
}
Execution Chain
Step 1: Understand the target
Identify from the user request:
- URL — the page to extract from
- Data shape — what fields / columns are needed
- Scope — single page, paginated, infinite scroll, or multi-page crawl
- Output format — JSON (default), CSV, or other
Step 2: Obtain selectors and choose execution path
# Try Actionbook index first
actionbook search "<site> <data-description>" --domain <domain>
# If good results (health ≥ 70%), get full selectors
actionbook get "<ID>"
Use this routing strictly:
-
Path A (default when
getis good): requested fields are covered bygetselectors and quality is acceptable.- Start from
getselectors and move to script draft quickly. - You may run lightweight mechanism probes (
browser text, quick scroll checks) before finalizing script strategy. - Do not run full fallback (
snapshot/screenshot) before first draft unless probe/sample validation shows mismatch. - Field mapping must default to
getselectors and mark source asactionbook_get.
- Start from
-
Path B (partial / unstable):
getexists but required fields are missing, selector resolves 0 elements, or validation fails.- Run targeted fallback only for failed fields/steps.
-
Path C (no usable coverage): search/get has no usable result.
- Run full fallback discovery.
Step 3: Probe page mechanisms and fallback only when needed
Path A mechanism detection timing:
- Run minimal probes either before final script draft or during sample validation.
- Before any probe command, ensure the correct page context is open:
actionbook browser open "<url>"(if current tab context is unknown/stale)
- If probes/sample run indicate mismatch (missing rows, unstable selectors, wrong pagination behavior), escalate to Path B targeted fallback.
Fallback discovery by path:
Path B targeted fallback (only failed fields/steps):
actionbook browser open "<url>" # if not already open
actionbook browser snapshot # focus on failed field/container mapping
# actionbook browser screenshot # optional visual confirmation for failed area
Path C full fallback (no usable coverage):
actionbook browser open "<url>"
actionbook browser snapshot
actionbook browser screenshot
Mechanism probes (run when script strategy needs confirmation):
# Hydration / streaming check
actionbook browser text "<container-selector>"
# Infinite scroll quick signal (explicit before/after decision)
actionbook browser eval "document.querySelectorAll('<item-selector>').length" # before
actionbook browser click "<scroll-container-selector-or-body>" # focus scroll context
actionbook browser eval "const c=document.querySelector('<scroll-container-selector>') || document.scrollingElement; c.scrollBy(0, c.clientHeight || window.innerHeight);"
actionbook browser eval "document.querySelectorAll('<item-selector>').length" # after
# If count increases, treat page as lazy-load/infinite-scroll.
Fallback trigger conditions:
actionbook getcannot map all required fields.actionbook getselectors return empty/unstable values in sample run.- Runtime behavior conflicts with expected mechanism (e.g., virtualized container, delayed hydration).
Step 4: Generate Playwright script
Write a standalone Playwright script (extract_<domain>_<slug>.cjs) that:
- Navigates to the target URL.
- Waits for the correct readiness signal (not just
load— see mechanisms above). - Handles the detected mechanism (virtual scroll, pagination, etc.).
- Extracts data into structured objects.
- Writes output to disk (
JSON.stringify/ CSV). - Closes the browser.
- Enforces guardrails (
maxPages,maxScrolls, timeout budget) to avoid infinite loops.
Script template:
// extract_<domain>_<slug>.cjs
// Generated by Actionbook extract skill
// Usage: node extract_<domain>_<slug>.cjs
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('<URL>', { waitUntil: 'domcontentloaded' });
// -- wait for readiness --
await page.waitForSelector('<container>', { state: 'visible' });
// -- extract --
const data = await page.$$eval('<item-selector>', items =>
items.map(el => ({
// fields mapped from user request
}))
);
// -- output --
const fs = require('fs');
fs.writeFileSync('output.json', JSON.stringify(data, null, 2));
console.log(`Extracted ${data.length} items → output.json`);
await browser.close();
})();
Step 5: Execute and validate
Run the script to confirm it works:
node extract_<domain>_<slug>.cjs
Validation rules:
| Check | Pass condition |
|---|---|
| Script exits 0 | No runtime errors |
| Output file exists | Non-empty file written |
| Record count > 0 | At least one item extracted |
| No null/empty fields | Every declared field has a value in ≥ 90% of records |
| Data matches page | Spot-check first and last record against actionbook browser text |
If validation fails, inspect the output, adjust selectors or wait strategy, and re-run.
Step 6: Deliver
Present to the user:
- Script path — the
.cjsfile they can re-run anytime. - Data path — the output JSON/CSV file.
- Record count — how many items were extracted.
- Notes — any mechanism-specific caveats (e.g., "this site uses infinite scroll; the script scrolls up to 50 pages by default").
Output Contract
Every extract invocation produces:
| Artifact | Path | Format |
|---|---|---|
| Playwright script | ./extract_<domain>_<slug>.cjs | Standalone Node.js script using playwright |
| Extracted data | ./output.json (default) or user-specified path | JSON array of objects (default), CSV, or user-specified |
The script must be re-runnable — a user should be able to execute it later without Actionbook installed, as long as Node.js + Playwright are available in the runtime environment.
Selector Priority
When multiple selector types are available from actionbook get:
| Priority | Type | Reason |
|---|---|---|
| 1 | data-testid | Stable, test-oriented, rarely changes |
| 2 | aria-label | Accessibility-driven, semantically meaningful |
| 3 | CSS selector | Structural, may break on redesign |
| 4 | XPath | Last resort, most brittle |
Error Handling
| Error | Action |
|---|---|
actionbook search returns no results | Fall back to snapshot + screenshot |
| Selector returns 0 elements | Re-snapshot, compare with screenshot, update selector |
| Script times out | Add longer waitForTimeout, check for anti-bot measures |
| Partial data (some fields empty) | Check if content is lazy-loaded; add scroll/wait |
| Anti-bot / CAPTCHA | Inform user; suggest running with headless: false or using their own browser session via actionbook setup extension mode |
Similar Skills
You MUST use this before any creative work - creating features, building components, adding functionality, or modifying behavior. Explores user intent, requirements and design before implementation.