From zyte-web-data
Downloads diverse pages, compares HTML variants, extracts values, and optionally presents for browser review to validate extraction specs.
How this skill is triggered — by the user, by Claude, or both
Slash command
/zyte-web-data:scrape-spec [site-path][site-path]This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are expanding and validating an extraction spec that was drafted by `/scrape-define`. Download diverse detail and listing pages, compare HTML variants, extract values, and optionally present for browser-based review.
You are expanding and validating an extraction spec that was drafted by /scrape-define. Download diverse detail and listing pages, compare HTML variants, extract values, and optionally present for browser-based review.
Read ${CLAUDE_SKILL_DIR}/../scrape/references/python-environments.md.
The input is a spec folder with at least one data type containing a schema and 1 page (from /scrape-define). The output is the same folder with more pages, validated values, and a navigation data type.
The raw argument string is $ARGUMENTS — a single value, used as-is:
.scrape/books-toscrapeRead {site_path}/spec.json to get url and data_types.
For the primary data type (first in data_types that isn't "navigation"), read:
{site_path}/{data_type}/spec.json → schema, html_variantDerive site_name from site_path (last component, e.g. books-toscrape).
Stage 1 doesn't store pages — only schema and values. Stage 2 downloads all pages fresh using /scrape-explore-site in a subagent.
Agent(description="explore-site", prompt="Run /scrape-explore-site {url} .scrape/.work/{site_name}/explore 3 2")
This downloads the start page + 3 detail pages + 2 list pages, classifies links, and generates navigation values. All output goes to .scrape/.work/{site_name}/explore/.
If the subagent reports that the site is blocked, invoke /scrape-zyte-login. After it returns, re-run the scrape-explore-site subagent above. Only proceed once exploration succeeds.
Then distribute pages to the right data-type subfolders:
mkdir -p {site_path}/{data_type}/pages {site_path}/navigation/pages {site_path}/navigation/values
# Detail pages → data type
for d in .scrape/.work/{site_name}/explore/pages/detail-*; do
[ -d "$d" ] && cp -r "$d" {site_path}/{data_type}/pages/
done
# Start + list pages → navigation
for d in .scrape/.work/{site_name}/explore/pages/start-* .scrape/.work/{site_name}/explore/pages/list-*; do
[ -d "$d" ] && cp -r "$d" {site_path}/navigation/pages/
done
# Navigation values (generated by explore-site)
cp .scrape/.work/{site_name}/explore/values/*.json {site_path}/navigation/values/ 2>/dev/null || true
Before analyzing detail pages, independently determine the HTML variant navigation pages need. Run extract_links.py on all navigation pages using both variants:
mkdir -p .scrape/.work/{site_name}/analyze-nav
uv run ${CLAUDE_SKILL_DIR}/../scrape-explore-site/scripts/extract_links.py \
{site_path}/navigation/pages/*/raw.html \
--group --base-url-from-meta \
> .scrape/.work/{site_name}/analyze-nav/nav.raw.json 2>/dev/null || true
uv run ${CLAUDE_SKILL_DIR}/../scrape-explore-site/scripts/extract_links.py \
{site_path}/navigation/pages/*/rendered.html \
--group --base-url-from-meta \
> .scrape/.work/{site_name}/analyze-nav/nav.rendered.json 2>/dev/null || true
Read both output files and analyze the link groups. Determine which HTML variant provides better navigation coverage — consider the group structure, whether meaningful navigation links appear in each variant, and whether one variant reveals links the other misses. Prefer raw unless there is clear evidence that rendered provides materially better navigation coverage.
Store the result as nav_html_variant for use in Step 6.
Analyze all detail pages (including the one from Stage 1) with both HTML variants. Launch one Agent per (page x variant) combination — all in a single message for parallel execution.
Agent(description="analyze detail-1 raw", prompt="/scrape-analyze-page Extract data from {site_path}/{data_type}/pages/detail-1/raw.html using the schema in {site_path}/{data_type}/spec.json and save it into .scrape/.work/{site_name}/analyze-page/detail-1.raw.json ")
Agent(description="analyze detail-1 rendered", prompt="Run /scrape-analyze-page {site_path}/{data_type}/pages/detail-1/rendered.html using the schema in {site_path}/{data_type}/spec.json and save it into .scrape/.work/{site_name}/analyze-page/detail-1.rendered.json")
Agent(description="analyze detail-2 raw", prompt="Run /scrape-analyze-page {site_path}/{data_type}/pages/detail-2/raw.html using the schema in {site_path}/{data_type}/spec.json and save it into .scrape/.work/{site_name}/analyze-page/detail-2.raw.json")
... (all in one message)
Skip variants whose HTML files don't exist. The schema_path gives analyze-page the approved field names, descriptions, and examples — so it extracts with the correct names and value formats.
Compare raw vs rendered results across all detail pages. Read all analysis files from .scrape/.work/{site_name}/analyze-page/.
For each page, compare {page_id}.raw.json and {page_id}.rendered.json:
Raw is preferred by default — it's faster, cheaper, and more reliable. Only use rendered if it finds schema fields that raw consistently misses, or if raw HTML is essentially empty (SPA site).
Present the comparison in the terminal:
HTML variant comparison:
Both variants found: name, price, brand, description, image_url
Only in rendered: reviews_count (3/3 pages), sale_price (2/3 pages)
Only in raw: (none)
Value differences:
price: raw="$29.99", rendered="$24.99" (detail-1) — rendered may show sale price
Recommendation: raw (all schema fields found)
Note: rendered also has reviews_count, sale_price — say "use rendered" if you need these.
If it's clear which variant to use, use it. If it's not, for example when the raw view contains
most of the fields but not all of them, ask the user via AskUserQuestion which variant to use,
providing details:
In this case use the variant that the user selects.
If the variant changes from what Stage 1 used, update {site_path}/{data_type}/spec.json with the new html_variant.
Use extract_values.py to build values from analysis files, filtered by the schema:
uv run ${CLAUDE_SKILL_DIR}/../scrape-explore-site/scripts/extract_values.py \
.scrape/.work/{site_name}/analyze-page/ \
{site_path}/{data_type}/spec.json \
--variant {html_variant} \
-O {site_path}/{data_type}/values/
This overwrites any existing values files (including the one from Stage 1) with fresh extractions from all detail pages. No --renames needed — Stage 2's analyze-page receives the full schema (names + descriptions + examples), so it extracts with the correct field names directly.
Write {site_path}/navigation/spec.json with the fixed navigation schema from ${CLAUDE_SKILL_DIR}/../scrape/references/extraction-spec.md, using the site URL and nav_html_variant (determined in Step 3).
Navigation values were already copied from explore-site output in Step 2.
Tell the user the extraction stats first ("Extracted values for {N} detail pages and {M} navigation pages."), then ask via AskUserQuestion:
Skip browser review — "Continue without opening the browser."Open browser review — "Review the values in the browser."If the user picks Skip review, go to step 9.
If the user picks Open browser review, invoke /scrape-review-schema with the data type spec:
/scrape-review-schema {site_path}/{data_type} .scrape/.work/{site_name} {schema_json} {html_variant}
Report the review dir path to the user before opening.
If the user reviews and provides feedback:
If feedback starts with APPROVED: apply any included schema changes (drops, renames, description edits, kept fields → change source to "requested") and skip to step 9. The user has signed off.
If feedback does NOT start with APPROVED (user clicked "Request changes"):
You corrected price to "$24.99" — the rendered variant already extracts this correctly (across 3/3 pages).
Switch to rendered? [Y/n]
If the user agrees, switch variant, update spec, and regenerate values.After re-analysis or variant switch, re-extract values (step 6) and offer review again. Pass a changes summary as the 5th argument to /scrape-review-schema:
/scrape-review-schema {site_path}/{data_type} .scrape/.work/{site_name} {schema_json} {html_variant} '["Re-analyzed price across all pages","Dropped field isbn"]'
Loop steps 7-8 until the user approves or skips review.
Update {site_path}/{data_type}/spec.json with any schema changes from the review.
Update {site_path}/spec.json — ensure data_types includes both the primary data type and "navigation".
Report:
Spec finalized at {site_path}/:
{data_type}: {N} detail pages, {F} fields
navigation: {M} pages
Ready for codegen: /scrape-codegen {site_path}/{data_type} ./{project_dir}
npx claudepluginhub zytedata/claude-skills --plugin zyte-web-dataDownloads a single detail page, discovers fields, and iterates on a schema definition with user approval. Useful for quickly defining extraction specs for a website.
Extracts structured data from websites like product listings, tables, search results, or profiles, generating an executable Playwright script and JSON/CSV output.
Builds production-ready web scrapers for any site using Bright Data infrastructure. Guides site analysis, API selection, selector extraction, pagination, and implementation.