From zyte-web-data
Generates a Scrapy spider that wires web-poet page objects for item extraction and navigation into a working crawler with pagination and subcategory support.
How this skill is triggered — by the user, by Claude, or both
Slash command
/zyte-web-data:scrape-create-spider [project-dir] [item-page] [nav-page][project-dir] [item-page] [nav-page]This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are generating a Scrapy spider that wires together web-poet page objects (item
You are generating a Scrapy spider that wires together web-poet page objects (item extraction + navigation) into a working crawler.
Read python-environments.md and docs-access.md from ${CLAUDE_SKILL_DIR}/../scrape/references.
The raw argument string is $ARGUMENTS. Split it into 3 whitespace-separated positional arguments:
books_project.pages.books_toscrape_com.ProductPage)books_project.pages.books_toscrape_com.NavigationPage)Plus, taken from the surrounding prompt text (not from the argument string):
Detect the project name from {project_dir}.
Use the provided PO import paths to determine the module and class names for imports. Parse start URLs to derive the spider name from the domain.
Read references/scrapy-poet-reference.md for spider patterns.
Write a spider to {project_name}/spiders/{spider_name}.py.
The spider uses the navigation PO to discover links and the item extraction PO to extract data. Pattern:
import scrapy
from scrapy_poet import DummyResponse
from {project_name}.pages.{module} import {ItemPage}, {NavPage}
class {SpiderClass}(scrapy.Spider):
name = "{spider_name}"
start_urls = ["{start_url}"]
async def parse(self, response: DummyResponse, nav: {NavPage}):
"""Parse list/category pages — extract navigation links."""
nav_item = await nav.to_item()
# Follow item links → item extraction PO
for link in nav_item.items or []:
yield scrapy.Request(link["url"], callback=self.parse_item)
# Follow pagination
if nav_item.next_page:
yield scrapy.Request(nav_item.next_page, callback=self.parse)
# Follow subcategories
for link in nav_item.subcategories or []:
yield scrapy.Request(link["url"], callback=self.parse)
async def parse_item(self, response: DummyResponse, page: {ItemPage}):
"""Extract item data."""
yield await page.to_item()
Key points:
parse is the default callback for start_urlsresponse: DummyResponse since we only need the PO, not raw responseparsebooks_toscrape_com)BooksToscrapeCom)If the site requires Zyte API (e.g., detected during spec building), add:
custom_settings = {
"ZYTE_API_TRANSPARENT_MODE": True,
}
Read the scrapy-zyte-api reference:
references/scrapy-zyte-api-reference.md
Run a test crawl that saves items to a file so you can inspect them:
cd {project_dir} && uv run scrapy crawl {spider_name} -s CLOSESPIDER_ITEMCOUNT=5 -o items.jsonl 2>&1
If the crawl fails (non-zero exit, exceptions in output):
SCRAPY_POET_DISCOVER includes the pages moduleZYTE_API_LOG_REQUESTS=True if using Zyte APIIf the crawl succeeds, read items.jsonl and check for obvious data-quality issues. If you find any, read the relevant page object, diagnose and fix the root cause, delete items.jsonl, and re-run. Repeat up to 2 more times (3 total). If items still look wrong after 3 attempts, stop and report what you found.
Only declare the spider complete once items look correct.
Created spider at {project_name}/spiders/{spider_name}.py:
Start URL: {start_url}
Navigation: {NavPage} → follows items, pagination, subcategories
Extraction: {ItemPage} via callback_for
Run: cd {project_dir} && uv run scrapy crawl {spider_name}
npx claudepluginhub zytedata/claude-skills --plugin zyte-web-dataOrchestrates end-to-end web scraping workflow from URL to working Scrapy spider with web-poet page objects. Use for full-site or multiple-page crawls.
Builds production-ready web scrapers for any site using Bright Data infrastructure. Guides site analysis, API selection, selector extraction, pagination, and implementation.
Automatically scrapes websites by analyzing page structure, handling pagination/anti-blocking, discovering article series using Playwright and Crawl4AI. Zero config needed.