Skill

scraper-builder

Guide AI agents to generate complete PageObject pattern web scraper projects using Playwright and TypeScript with Docker deployment. Supports agent-browser site analysis for automated selector discovery. Keywords: scraper, playwright, pageobject, web scraping, docker, typescript, data extraction, automation.

Install

npx claudepluginhub joshuarweaver/cascade-content-creation-misc-1 --plugin jwynia-agent-skills-1

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Generate complete, runnable web scraper projects using the PageObject pattern with Playwright and TypeScript. This skill produces site-specific scrapers with typed data extraction, Docker deployment, and optional agent-browser integration for automated site analysis.

Supporting Assets

SKILL.md

Similar Skills

cache-components

Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.

cache-components

139.2k

mcp-builder

9 files

Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).

anthropics-skills-13

124.2k

canvas-design

20 files

Generates original PNG/PDF visual art via design philosophy manifestos for posters, graphics, and static designs on user request.

anthropics-skills-13

124.2k

Stats

Stars65

Forks11

Last CommitFeb 4, 2026

Actions

View Source View Plugin View on GitHub View README

Scraper Builder

When to Use This Skill

Use this skill when:

Building a site-specific web scraper for data extraction
Generating PageObject classes for a target website
Scaffolding a complete scraper project with Docker support
Using agent-browser to analyze a site and auto-generate selectors
Creating reusable scraping components (pagination, data tables)

Do NOT use this skill when:

Building API clients (use HTTP client libraries directly)
Writing QA/E2E test suites (use Playwright test runner with test-focused patterns)
Mass crawling or spidering entire domains (use Crawlee or Scrapy)
Scraping sites that require authentication bypass or CAPTCHA solving

Core Principles

1. PageObject Encapsulation

Each page on the target site maps to one PageObject class. Locators are defined in the constructor, and scraping logic lives in methods. Page objects never contain assertions or business logic — they extract and return data.

2. Selector Resilience

Prefer selectors in this order: data-testid > id > semantic HTML (role, aria-label) > structured CSS classes > text content. Avoid positional selectors (nth-child) and layout-dependent paths. See references/playwright-selectors.md for the full hierarchy.

3. Composition Over Inheritance

Reusable UI patterns (pagination, data tables, search bars) are modeled as component classes that page objects compose via properties. Only BasePage uses inheritance — everything else composes.

4. Typed Data Extraction

All scraped data flows through Zod schemas for validation. This catches selector drift (when a site changes its markup) at extraction time rather than downstream. See assets/templates/data-schema.ts.md.

5. Docker-First Deployment

Generated projects include a Dockerfile using Microsoft's official Playwright images and a docker-compose.yml with volume mounts for output data and debug screenshots. This ensures consistent browser environments across machines.

Generation Modes

Mode 1: Agent-Browser Analysis

Use agent-browser to navigate the target site, capture accessibility tree snapshots, and automatically discover selectors. This is the preferred mode when the agent has access to the agent-browser CLI.

Prerequisites: If agent-browser is not already installed, add it as a skill first:

npx skills add vercel-labs/agent-browser

Workflow:

# 1. Open the target page
agent-browser open https://example.com/products

# 2. Capture interactive snapshot with element references
agent-browser snapshot -i --json > snapshot.json

# 3. Capture scoped sections for focused analysis
agent-browser snapshot -i --json -s "main" > main-content.json
agent-browser snapshot -i --json -s "nav" > navigation.json

# 4. Test dynamic behavior (pagination, load-more)
agent-browser click @e3
agent-browser wait --load networkidle
agent-browser snapshot -i --json > after-click.json

# 5. Close when done
agent-browser close

What the agent does with snapshots:

Parse element references (@e1, @e2, etc.) and their roles
Group elements by semantic purpose (navigation, data display, forms, actions)
Map data elements to fields (title, price, image, etc.)
Generate PageObject classes with discovered selectors
Identify pagination and dynamic loading patterns

See references/agent-browser-workflow.md for the complete workflow reference.

Mode 2: Manual Description

The user describes the target site's page structure and the agent maps it to page objects. The agent asks structured questions:

What pages to scrape? — List of URLs or page types
What data to extract? — Field names and expected types per page
How is data paginated? — Numbered pages, load-more, infinite scroll, or single page
What selectors are known? — Any CSS selectors, data-testid values, or XPath the user already knows

The agent then:

Matches the description to a site archetype from data/site-archetypes.json
Proposes a page object map with class names and responsibilities
Generates code after the user confirms the plan

Mode 3: Full Project Scaffold

Generate a complete runnable project in one operation using the scaffolder script:

deno run --allow-read --allow-write scripts/scaffold-scraper-project.ts \
  --name "my-scraper" \
  --url "https://example.com" \
  --pages "ProductListing,ProductDetail" \
  --fields "title,price,image_url,description"

This produces a project with all source files, configuration, Docker setup, and an entry point ready to run. See the Scripts Reference section for full options.

Quick Reference

Category	Approach	Details
Framework	Playwright	`playwright` package, not `@playwright/test`
Language	TypeScript	Strict mode, ES2022 target
Pattern	PageObject	One class per page, compose components
Selectors	Resilient	data-testid > id > role > CSS class > text
Wait strategy	Auto-wait	Playwright built-in, plus `networkidle` for navigation
Validation	Zod	Schema per page object's output type
Output	JSON + CSV	Configurable via storage utility
Docker	Official image	`mcr.microsoft.com/playwright:v1.48.0-jammy`
Retry	Exponential backoff	3 attempts default, configurable
Screenshots	On error	Saved to `screenshots/` for debugging

Generation Process

Follow this sequence when generating a scraper:

Step 1: Gather Requirements

Ask the user for:

Target site URL(s)
Data fields to extract
Number of pages/items expected
Output format preference (JSON, CSV, both)
Whether Docker deployment is needed

Step 2: Analyze the Site

Use Mode 1 (agent-browser) or Mode 2 (manual description) to understand:

Page structure and navigation flow
Data element locations and selector strategies
Pagination or infinite scroll patterns
Dynamic content loading behavior

Step 3: Design the Page Object Map

Create a plan listing:

Each PageObject class and its URL pattern
Component classes needed (Pagination, DataTable, etc.)
Data schema fields and types per page
The scraper's navigation flow between pages

Step 4: Present the Plan

Show the user the page object map before generating code. Include class names, field names, and the execution flow. Wait for confirmation.

Step 5: Generate Code

Use the templates in assets/templates/ as the foundation:

base-page.ts.md — BasePage abstract class
page-object.ts.md — Site-specific page object
component.ts.md — Reusable components
scraper-runner.ts.md — Orchestrator
data-schema.ts.md — Zod validation schemas

Step 6: Deliver

Provide the complete project with:

All source files
Configuration files from assets/configs/
A README explaining how to run it
Docker setup (unless explicitly excluded)

Code Patterns

BasePage

Abstract class providing navigate(), waitForPageLoad(), screenshot(), and getText() helpers. All page objects extend this.

export abstract class BasePage {
  constructor(protected readonly page: Page) {}
  async navigate(url: string): Promise<void> { /* ... */ }
  async screenshot(name: string): Promise<void> { /* ... */ }
}

See: assets/templates/base-page.ts.md

PageObject

Site-specific class with locators as readonly properties, scrape methods returning typed data, and navigation methods for multi-page flows.

export class ProductListingPage extends BasePage {
  readonly productCards: Locator;
  readonly nextButton: Locator;
  async scrapeProducts(): Promise<Product[]> { /* ... */ }
  async goToNextPage(): Promise<boolean> { /* ... */ }
}

See: assets/templates/page-object.ts.md

Component

Reusable UI pattern (Pagination, DataTable) that receives a parent locator scope and provides extraction methods.

export class Pagination {
  constructor(private page: Page, private scope: Locator) {}
  async hasNextPage(): Promise<boolean> { /* ... */ }
  async goToNext(): Promise<void> { /* ... */ }
}

See: assets/templates/component.ts.md

ScraperRunner

Orchestrator that launches the browser, creates page objects, iterates through pages, collects data, validates with schemas, and writes output.

export class SiteScraper {
  async run(): Promise<void> {
    const browser = await chromium.launch();
    const page = await browser.newPage();
    // navigate, scrape, validate, write
  }
}

See: assets/templates/scraper-runner.ts.md

DataSchema

Zod schemas that validate scraped records, catching selector drift and malformed data at extraction time.

export const ProductSchema = z.object({
  title: z.string().min(1),
  price: z.number().positive(),
});

See: assets/templates/data-schema.ts.md

Anti-Patterns

Anti-Pattern	Problem	Solution
Monolith Scraper	All scraping logic in one file	Split into PageObject classes per page
Sleep Waiter	Using `setTimeout`/fixed delays	Use Playwright auto-wait and `networkidle`
Unvalidated Pipeline	No schema validation on output	Add Zod schemas for every data type
Selector Lottery	Fragile positional selectors	Use resilient selector hierarchy
Silent Failure	Swallowing errors without logging	Log failures and save debug screenshots
Unthrottled Crawler	No delay between requests	Add configurable request delays
Hardcoded Config	URLs and selectors in code	Use environment variables and config files
No Retry Logic	Single attempt per request	Implement exponential backoff

See references/anti-patterns.md for the extended catalog with examples and fixes.

Scripts Reference

scaffold-scraper-project.ts

Generate a complete scraper project:

deno run --allow-read --allow-write scripts/scaffold-scraper-project.ts [options]

Options:
  --name <name>       Project name (required)
  --path <path>       Target directory (default: ./)
  --url <url>         Target site base URL
  --pages <pages>     Comma-separated page names (e.g., ProductListing,ProductDetail)
  --fields <fields>   Comma-separated data fields (e.g., title,price,rating)
  --no-docker         Skip Docker setup
  --no-validation     Skip Zod validation setup
  --json              Output as JSON
  -h, --help          Show help

Examples:
  # Scaffold a product scraper
  deno run --allow-read --allow-write scripts/scaffold-scraper-project.ts \
    --name "shop-scraper" --url "https://shop.example.com" \
    --pages "ProductListing,ProductDetail" --fields "title,price,image_url"

  # Minimal scraper without Docker
  deno run --allow-read --allow-write scripts/scaffold-scraper-project.ts \
    --name "blog-scraper" --no-docker

generate-page-object.ts

Generate a single PageObject class for an existing project:

deno run --allow-read --allow-write scripts/generate-page-object.ts [options]

Options:
  --name <name>           Class name (required)
  --url <url>             Page URL (for documentation comment)
  --fields <fields>       Comma-separated data fields
  --selectors <json>      JSON map of field to selector
  --with-pagination       Include pagination methods
  --output <path>         Output file path (default: stdout)
  --json                  Output as JSON
  -h, --help              Show help

Examples:
  # Generate a page object with known selectors
  deno run --allow-read --allow-write scripts/generate-page-object.ts \
    --name "ProductListing" --url "https://shop.example.com/products" \
    --fields "title,price,rating" \
    --selectors '{"title":".product-title","price":".product-price","rating":".star-rating"}' \
    --with-pagination --output src/pages/ProductListingPage.ts

  # Quick generation to stdout
  deno run --allow-read scripts/generate-page-object.ts \
    --name "SearchResults" --fields "title,url,snippet"

Templates & References

Templates (assets/templates/)

Template	Purpose
`base-page.ts.md`	Abstract BasePage with navigation, screenshots, text helpers
`page-object.ts.md`	Site-specific page object with locators and scrape methods
`component.ts.md`	Reusable components: Pagination, DataTable
`scraper-runner.ts.md`	Orchestrator: browser launch, iteration, collection, output
`data-schema.ts.md`	Zod schemas for scraped data validation

Configs (assets/configs/)

Config	Purpose
`dockerfile.md`	Multi-stage Dockerfile using official Playwright image
`docker-compose.yml.md`	Service with data/screenshots volume mounts
`tsconfig.json.md`	Strict TypeScript with ES2022 target
`package.json.md`	playwright, zod, tsx dependencies
`playwright.config.ts.md`	Scraper-focused Playwright configuration

References (references/)

Reference	Purpose
`pageobject-pattern.md`	PageObject pattern adapted for scraping
`playwright-selectors.md`	Selector strategies and resilience hierarchy
`docker-setup.md`	Docker configuration and deployment
`agent-browser-workflow.md`	Agent-browser analysis workflow
`anti-patterns.md`	Extended anti-pattern catalog

Examples (assets/examples/)

Example	Purpose
`ecommerce-scraper.md`	Complete multi-page product scraper walkthrough
`multi-page-pagination.md`	Pagination handling strategies

Data Files (data/)

File	Purpose
`selector-patterns.json`	Common selectors organized by UI element type
`site-archetypes.json`	Website structure archetypes with typical pages and fields

Example Interaction

User: "I need a scraper for an online bookstore. I want to get book titles, authors, prices, and ratings from the catalog pages."

Agent workflow:

Checks site-archetypes.json — matches ecommerce archetype
Proposes page object map:
- BookListingPage — catalog with pagination
- BookDetailPage — individual book page (if detail scraping needed)
- Pagination component — shared pagination handler
Presents the plan with field mapping:
- title → [itemprop="name"] or .book-title
- author → [itemprop="author"] or .book-author
- price → [itemprop="price"] or .price
- rating → .star-rating or [data-rating]
After confirmation, generates using the scaffold script or manual code generation
Delivers project with Docker setup and Zod schemas for Book type

Integration

This skill connects to:

typescript-best-practices — TypeScript coding patterns used in generated code
devcontainer — Development container setup for the generated project
agent-browser — Site analysis and selector discovery (external tool)

What You Do NOT Do

This skill does NOT:

Bypass authentication or login walls
Solve CAPTCHAs or bot detection
Generate JavaScript-only output (always TypeScript)
Produce crawlers that spider entire domains
Create scrapers that violate robots.txt
Handle rate-limited APIs (use HTTP clients for API work)
Generate test suites (use Playwright test patterns for QA)