Guide AI agents to generate complete PageObject pattern web scraper projects using Playwright and TypeScript with Docker deployment. Supports agent-browser site analysis for automated selector discovery. Keywords: scraper, playwright, pageobject, web scraping, docker, typescript, data extraction, automation.
npx claudepluginhub joshuarweaver/cascade-content-creation-misc-1 --plugin jwynia-agent-skills-1This skill uses the workspace's default tool permissions.
Generate complete, runnable web scraper projects using the PageObject pattern with Playwright and TypeScript. This skill produces site-specific scrapers with typed data extraction, Docker deployment, and optional agent-browser integration for automated site analysis.
assets/configs/docker-compose.yml.mdassets/configs/dockerfile.mdassets/configs/package.json.mdassets/configs/playwright.config.ts.mdassets/configs/tsconfig.json.mdassets/examples/ecommerce-scraper.mdassets/examples/multi-page-pagination.mdassets/templates/base-page.ts.mdassets/templates/component.ts.mdassets/templates/data-schema.ts.mdassets/templates/page-object.ts.mdassets/templates/scraper-runner.ts.mddata/selector-patterns.jsondata/site-archetypes.jsonreferences/agent-browser-workflow.mdreferences/anti-patterns.mdreferences/docker-setup.mdreferences/pageobject-pattern.mdreferences/playwright-selectors.mdscripts/generate-page-object.tsGuides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).
Generates original PNG/PDF visual art via design philosophy manifestos for posters, graphics, and static designs on user request.
Generate complete, runnable web scraper projects using the PageObject pattern with Playwright and TypeScript. This skill produces site-specific scrapers with typed data extraction, Docker deployment, and optional agent-browser integration for automated site analysis.
Use this skill when:
Do NOT use this skill when:
Each page on the target site maps to one PageObject class. Locators are defined in the constructor, and scraping logic lives in methods. Page objects never contain assertions or business logic — they extract and return data.
Prefer selectors in this order: data-testid > id > semantic HTML (role, aria-label) > structured CSS classes > text content. Avoid positional selectors (nth-child) and layout-dependent paths. See references/playwright-selectors.md for the full hierarchy.
Reusable UI patterns (pagination, data tables, search bars) are modeled as component classes that page objects compose via properties. Only BasePage uses inheritance — everything else composes.
All scraped data flows through Zod schemas for validation. This catches selector drift (when a site changes its markup) at extraction time rather than downstream. See assets/templates/data-schema.ts.md.
Generated projects include a Dockerfile using Microsoft's official Playwright images and a docker-compose.yml with volume mounts for output data and debug screenshots. This ensures consistent browser environments across machines.
Use agent-browser to navigate the target site, capture accessibility tree snapshots, and automatically discover selectors. This is the preferred mode when the agent has access to the agent-browser CLI.
Prerequisites: If agent-browser is not already installed, add it as a skill first:
npx skills add vercel-labs/agent-browser
Workflow:
# 1. Open the target page
agent-browser open https://example.com/products
# 2. Capture interactive snapshot with element references
agent-browser snapshot -i --json > snapshot.json
# 3. Capture scoped sections for focused analysis
agent-browser snapshot -i --json -s "main" > main-content.json
agent-browser snapshot -i --json -s "nav" > navigation.json
# 4. Test dynamic behavior (pagination, load-more)
agent-browser click @e3
agent-browser wait --load networkidle
agent-browser snapshot -i --json > after-click.json
# 5. Close when done
agent-browser close
What the agent does with snapshots:
@e1, @e2, etc.) and their rolesSee references/agent-browser-workflow.md for the complete workflow reference.
The user describes the target site's page structure and the agent maps it to page objects. The agent asks structured questions:
The agent then:
data/site-archetypes.jsonGenerate a complete runnable project in one operation using the scaffolder script:
deno run --allow-read --allow-write scripts/scaffold-scraper-project.ts \
--name "my-scraper" \
--url "https://example.com" \
--pages "ProductListing,ProductDetail" \
--fields "title,price,image_url,description"
This produces a project with all source files, configuration, Docker setup, and an entry point ready to run. See the Scripts Reference section for full options.
| Category | Approach | Details |
|---|---|---|
| Framework | Playwright | playwright package, not @playwright/test |
| Language | TypeScript | Strict mode, ES2022 target |
| Pattern | PageObject | One class per page, compose components |
| Selectors | Resilient | data-testid > id > role > CSS class > text |
| Wait strategy | Auto-wait | Playwright built-in, plus networkidle for navigation |
| Validation | Zod | Schema per page object's output type |
| Output | JSON + CSV | Configurable via storage utility |
| Docker | Official image | mcr.microsoft.com/playwright:v1.48.0-jammy |
| Retry | Exponential backoff | 3 attempts default, configurable |
| Screenshots | On error | Saved to screenshots/ for debugging |
Follow this sequence when generating a scraper:
Ask the user for:
Use Mode 1 (agent-browser) or Mode 2 (manual description) to understand:
Create a plan listing:
Show the user the page object map before generating code. Include class names, field names, and the execution flow. Wait for confirmation.
Use the templates in assets/templates/ as the foundation:
base-page.ts.md — BasePage abstract classpage-object.ts.md — Site-specific page objectcomponent.ts.md — Reusable componentsscraper-runner.ts.md — Orchestratordata-schema.ts.md — Zod validation schemasProvide the complete project with:
assets/configs/Abstract class providing navigate(), waitForPageLoad(), screenshot(), and getText() helpers. All page objects extend this.
export abstract class BasePage {
constructor(protected readonly page: Page) {}
async navigate(url: string): Promise<void> { /* ... */ }
async screenshot(name: string): Promise<void> { /* ... */ }
}
See: assets/templates/base-page.ts.md
Site-specific class with locators as readonly properties, scrape methods returning typed data, and navigation methods for multi-page flows.
export class ProductListingPage extends BasePage {
readonly productCards: Locator;
readonly nextButton: Locator;
async scrapeProducts(): Promise<Product[]> { /* ... */ }
async goToNextPage(): Promise<boolean> { /* ... */ }
}
See: assets/templates/page-object.ts.md
Reusable UI pattern (Pagination, DataTable) that receives a parent locator scope and provides extraction methods.
export class Pagination {
constructor(private page: Page, private scope: Locator) {}
async hasNextPage(): Promise<boolean> { /* ... */ }
async goToNext(): Promise<void> { /* ... */ }
}
See: assets/templates/component.ts.md
Orchestrator that launches the browser, creates page objects, iterates through pages, collects data, validates with schemas, and writes output.
export class SiteScraper {
async run(): Promise<void> {
const browser = await chromium.launch();
const page = await browser.newPage();
// navigate, scrape, validate, write
}
}
See: assets/templates/scraper-runner.ts.md
Zod schemas that validate scraped records, catching selector drift and malformed data at extraction time.
export const ProductSchema = z.object({
title: z.string().min(1),
price: z.number().positive(),
});
See: assets/templates/data-schema.ts.md
| Anti-Pattern | Problem | Solution |
|---|---|---|
| Monolith Scraper | All scraping logic in one file | Split into PageObject classes per page |
| Sleep Waiter | Using setTimeout/fixed delays | Use Playwright auto-wait and networkidle |
| Unvalidated Pipeline | No schema validation on output | Add Zod schemas for every data type |
| Selector Lottery | Fragile positional selectors | Use resilient selector hierarchy |
| Silent Failure | Swallowing errors without logging | Log failures and save debug screenshots |
| Unthrottled Crawler | No delay between requests | Add configurable request delays |
| Hardcoded Config | URLs and selectors in code | Use environment variables and config files |
| No Retry Logic | Single attempt per request | Implement exponential backoff |
See references/anti-patterns.md for the extended catalog with examples and fixes.
Generate a complete scraper project:
deno run --allow-read --allow-write scripts/scaffold-scraper-project.ts [options]
Options:
--name <name> Project name (required)
--path <path> Target directory (default: ./)
--url <url> Target site base URL
--pages <pages> Comma-separated page names (e.g., ProductListing,ProductDetail)
--fields <fields> Comma-separated data fields (e.g., title,price,rating)
--no-docker Skip Docker setup
--no-validation Skip Zod validation setup
--json Output as JSON
-h, --help Show help
Examples:
# Scaffold a product scraper
deno run --allow-read --allow-write scripts/scaffold-scraper-project.ts \
--name "shop-scraper" --url "https://shop.example.com" \
--pages "ProductListing,ProductDetail" --fields "title,price,image_url"
# Minimal scraper without Docker
deno run --allow-read --allow-write scripts/scaffold-scraper-project.ts \
--name "blog-scraper" --no-docker
Generate a single PageObject class for an existing project:
deno run --allow-read --allow-write scripts/generate-page-object.ts [options]
Options:
--name <name> Class name (required)
--url <url> Page URL (for documentation comment)
--fields <fields> Comma-separated data fields
--selectors <json> JSON map of field to selector
--with-pagination Include pagination methods
--output <path> Output file path (default: stdout)
--json Output as JSON
-h, --help Show help
Examples:
# Generate a page object with known selectors
deno run --allow-read --allow-write scripts/generate-page-object.ts \
--name "ProductListing" --url "https://shop.example.com/products" \
--fields "title,price,rating" \
--selectors '{"title":".product-title","price":".product-price","rating":".star-rating"}' \
--with-pagination --output src/pages/ProductListingPage.ts
# Quick generation to stdout
deno run --allow-read scripts/generate-page-object.ts \
--name "SearchResults" --fields "title,url,snippet"
| Template | Purpose |
|---|---|
base-page.ts.md | Abstract BasePage with navigation, screenshots, text helpers |
page-object.ts.md | Site-specific page object with locators and scrape methods |
component.ts.md | Reusable components: Pagination, DataTable |
scraper-runner.ts.md | Orchestrator: browser launch, iteration, collection, output |
data-schema.ts.md | Zod schemas for scraped data validation |
| Config | Purpose |
|---|---|
dockerfile.md | Multi-stage Dockerfile using official Playwright image |
docker-compose.yml.md | Service with data/screenshots volume mounts |
tsconfig.json.md | Strict TypeScript with ES2022 target |
package.json.md | playwright, zod, tsx dependencies |
playwright.config.ts.md | Scraper-focused Playwright configuration |
| Reference | Purpose |
|---|---|
pageobject-pattern.md | PageObject pattern adapted for scraping |
playwright-selectors.md | Selector strategies and resilience hierarchy |
docker-setup.md | Docker configuration and deployment |
agent-browser-workflow.md | Agent-browser analysis workflow |
anti-patterns.md | Extended anti-pattern catalog |
| Example | Purpose |
|---|---|
ecommerce-scraper.md | Complete multi-page product scraper walkthrough |
multi-page-pagination.md | Pagination handling strategies |
| File | Purpose |
|---|---|
selector-patterns.json | Common selectors organized by UI element type |
site-archetypes.json | Website structure archetypes with typical pages and fields |
User: "I need a scraper for an online bookstore. I want to get book titles, authors, prices, and ratings from the catalog pages."
Agent workflow:
site-archetypes.json — matches ecommerce archetypeBookListingPage — catalog with paginationBookDetailPage — individual book page (if detail scraping needed)Pagination component — shared pagination handlertitle → [itemprop="name"] or .book-titleauthor → [itemprop="author"] or .book-authorprice → [itemprop="price"] or .pricerating → .star-rating or [data-rating]Book typeThis skill connects to:
This skill does NOT: