Skill

scrape-spec

Downloads diverse pages, compares HTML variants, extracts values, and optionally presents for browser review to validate extraction specs.

developer-tools

automation

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/zyte-web-data:scrape-spec [site-path]

User invocable

Model invocable

Inline context

Default effort

Argument hint[site-path]

Tool Access

This skill is limited to the following tools:

AgentSkillAskUserQuestionBashReadWrite

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are expanding and validating an extraction spec that was drafted by `/scrape-define`. Download diverse detail and listing pages, compare HTML variants, extract values, and optionally present for browser-based review.

SKILL.md

206 lines · ~2.5k tokens

Stats

LanguagePython

Stars18

Forks3

MaintenanceExcellent

Last CommitJun 24, 2026

Actions

View Source View Plugin View on GitHub View README

Input

The raw argument string is $ARGUMENTS — a single value, used as-is:

site_path: path to the site spec folder, e.g. .scrape/books-toscrape

Step 1: Read existing spec

Read {site_path}/spec.json to get url and data_types.

For the primary data type (first in data_types that isn't "navigation"), read:

{site_path}/{data_type}/spec.json → schema, html_variant

Derive site_name from site_path (last component, e.g. books-toscrape).

Step 2: Explore the site

Stage 1 doesn't store pages — only schema and values. Stage 2 downloads all pages fresh using /scrape-explore-site in a subagent.

Agent(description="explore-site", prompt="Run /scrape-explore-site {url} .scrape/.work/{site_name}/explore 3 2")

This downloads the start page + 3 detail pages + 2 list pages, classifies links, and generates navigation values. All output goes to .scrape/.work/{site_name}/explore/.

If the subagent reports that the site is blocked, invoke /scrape-zyte-login. After it returns, re-run the scrape-explore-site subagent above. Only proceed once exploration succeeds.

Then distribute pages to the right data-type subfolders:

mkdir -p {site_path}/{data_type}/pages {site_path}/navigation/pages {site_path}/navigation/values

# Detail pages → data type
for d in .scrape/.work/{site_name}/explore/pages/detail-*; do
  [ -d "$d" ] && cp -r "$d" {site_path}/{data_type}/pages/
done

# Start + list pages → navigation
for d in .scrape/.work/{site_name}/explore/pages/start-* .scrape/.work/{site_name}/explore/pages/list-*; do
  [ -d "$d" ] && cp -r "$d" {site_path}/navigation/pages/
done

# Navigation values (generated by explore-site)
cp .scrape/.work/{site_name}/explore/values/*.json {site_path}/navigation/values/ 2>/dev/null || true

Step 3: Determine navigation HTML variant

Before analyzing detail pages, independently determine the HTML variant navigation pages need. Run extract_links.py on all navigation pages using both variants:

mkdir -p .scrape/.work/{site_name}/analyze-nav

uv run ${CLAUDE_SKILL_DIR}/../scrape-explore-site/scripts/extract_links.py \
  {site_path}/navigation/pages/*/raw.html \
  --group --base-url-from-meta \
  > .scrape/.work/{site_name}/analyze-nav/nav.raw.json 2>/dev/null || true

uv run ${CLAUDE_SKILL_DIR}/../scrape-explore-site/scripts/extract_links.py \
  {site_path}/navigation/pages/*/rendered.html \
  --group --base-url-from-meta \
  > .scrape/.work/{site_name}/analyze-nav/nav.rendered.json 2>/dev/null || true

Read both output files and analyze the link groups. Determine which HTML variant provides better navigation coverage — consider the group structure, whether meaningful navigation links appear in each variant, and whether one variant reveals links the other misses. Prefer raw unless there is clear evidence that rendered provides materially better navigation coverage.

Store the result as nav_html_variant for use in Step 6.

Step 4: Analyze detail pages (both variants)

Analyze all detail pages (including the one from Stage 1) with both HTML variants. Launch one Agent per (page x variant) combination — all in a single message for parallel execution.

Agent(description="analyze detail-1 raw", prompt="/scrape-analyze-page Extract data from {site_path}/{data_type}/pages/detail-1/raw.html using the schema in {site_path}/{data_type}/spec.json and save it into .scrape/.work/{site_name}/analyze-page/detail-1.raw.json ")
Agent(description="analyze detail-1 rendered", prompt="Run /scrape-analyze-page {site_path}/{data_type}/pages/detail-1/rendered.html using the schema in {site_path}/{data_type}/spec.json and save it into .scrape/.work/{site_name}/analyze-page/detail-1.rendered.json")
Agent(description="analyze detail-2 raw", prompt="Run /scrape-analyze-page {site_path}/{data_type}/pages/detail-2/raw.html using the schema in {site_path}/{data_type}/spec.json and save it into .scrape/.work/{site_name}/analyze-page/detail-2.raw.json")
... (all in one message)

Skip variants whose HTML files don't exist. The schema_path gives analyze-page the approved field names, descriptions, and examples — so it extracts with the correct names and value formats.

Step 5: Choose HTML variant

Compare raw vs rendered results across all detail pages. Read all analysis files from .scrape/.work/{site_name}/analyze-page/.

For each page, compare {page_id}.raw.json and {page_id}.rendered.json:

Which variant found more of the schema fields?
Which fields are only in one variant?
Do values differ between variants for the same field?

Raw is preferred by default — it's faster, cheaper, and more reliable. Only use rendered if it finds schema fields that raw consistently misses, or if raw HTML is essentially empty (SPA site).

Present the comparison in the terminal:

HTML variant comparison:
  Both variants found: name, price, brand, description, image_url
  Only in rendered: reviews_count (3/3 pages), sale_price (2/3 pages)
  Only in raw: (none)
  Value differences:
    price: raw="$29.99", rendered="$24.99" (detail-1) — rendered may show sale price

  Recommendation: raw (all schema fields found)
  Note: rendered also has reviews_count, sale_price — say "use rendered" if you need these.

If it's clear which variant to use, use it. If it's not, for example when the raw view contains most of the fields but not all of them, ask the user via AskUserQuestion which variant to use, providing details:

question: "Which HTML variant should I use for extraction?"
header: "HTML variant"
options: "Use raw" and "Use rendered", mark the recommended one as such.

In this case use the variant that the user selects.

If the variant changes from what Stage 1 used, update {site_path}/{data_type}/spec.json with the new html_variant.

Step 6: Extract values

Use extract_values.py to build values from analysis files, filtered by the schema:

uv run ${CLAUDE_SKILL_DIR}/../scrape-explore-site/scripts/extract_values.py \
  .scrape/.work/{site_name}/analyze-page/ \
  {site_path}/{data_type}/spec.json \
  --variant {html_variant} \
  -O {site_path}/{data_type}/values/

This overwrites any existing values files (including the one from Stage 1) with fresh extractions from all detail pages. No --renames needed — Stage 2's analyze-page receives the full schema (names + descriptions + examples), so it extracts with the correct field names directly.

Navigation

Write {site_path}/navigation/spec.json with the fixed navigation schema from ${CLAUDE_SKILL_DIR}/../scrape/references/extraction-spec.md, using the site URL and nav_html_variant (determined in Step 3).

Navigation values were already copied from explore-site output in Step 2.

Step 7: Optional browser review

Tell the user the extraction stats first ("Extracted values for {N} detail pages and {M} navigation pages."), then ask via AskUserQuestion:

question: "Open a browser review of the extracted values?"
header: "Browser review"
options:
- Skip browser review — "Continue without opening the browser."
- Open browser review — "Review the values in the browser."

If the user picks Skip review, go to step 9.

If the user picks Open browser review, invoke /scrape-review-schema with the data type spec:

/scrape-review-schema {site_path}/{data_type} .scrape/.work/{site_name} {schema_json} {html_variant}

Report the review dir path to the user before opening.

Step 8: Apply feedback

If the user reviews and provides feedback:

If feedback starts with APPROVED: apply any included schema changes (drops, renames, description edits, kept fields → change source to "requested") and skip to step 9. The user has signed off.

If feedback does NOT start with APPROVED (user clicked "Request changes"):

Apply schema changes (drops, renames, description edits) directly
For value corrections: check the other variant's analysis files first. If the other variant has the correct value, suggest switching variants:
```
You corrected price to "$24.99" — the rendered variant already extracts this correctly (across 3/3 pages).
Switch to rendered? [Y/n]
```
If the user agrees, switch variant, update spec, and regenerate values.
If the other variant doesn't help: re-run analysis for corrected fields across ALL detail pages (using the chosen variant). Launch parallel Agents, one per page.

After re-analysis or variant switch, re-extract values (step 6) and offer review again. Pass a changes summary as the 5th argument to /scrape-review-schema:

/scrape-review-schema {site_path}/{data_type} .scrape/.work/{site_name} {schema_json} {html_variant} '["Re-analyzed price across all pages","Dropped field isbn"]'

Loop steps 7-8 until the user approves or skips review.

Step 9: Finalize

Update {site_path}/{data_type}/spec.json with any schema changes from the review.

Update {site_path}/spec.json — ensure data_types includes both the primary data type and "navigation".

Report:

Spec finalized at {site_path}/:
  {data_type}: {N} detail pages, {F} fields
  navigation: {M} pages

Ready for codegen: /scrape-codegen {site_path}/{data_type} ./{project_dir}

scrape-spec

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

scrape-spec

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

Input

Step 1: Read existing spec

Step 2: Explore the site

Step 3: Determine navigation HTML variant

Step 4: Analyze detail pages (both variants)

Step 5: Choose HTML variant

Step 6: Extract values

Navigation

Step 7: Optional browser review

Step 8: Apply feedback

Step 9: Finalize

Similar Skills

Input

Step 1: Read existing spec

Step 2: Explore the site

Step 3: Determine navigation HTML variant

Step 4: Analyze detail pages (both variants)

Step 5: Choose HTML variant

Step 6: Extract values

Navigation

Step 7: Optional browser review

Step 8: Apply feedback

Step 9: Finalize

Similar Skills