Help us improve
Share bugs, ideas, or general feedback.
From zyte-web-data
Extracts all available structured fields (from HTML and JSON-LD) from a detail page, with optional schema-guided field mapping and strict extraction.
npx claudepluginhub zytedata/claude-skills-testHow this skill is triggered — by the user, by Claude, or both
Slash command
/zyte-web-data:scrape-analyze-page [page-html-path] [work-path] [schema-path] [strict][page-html-path] [work-path] [schema-path] [strict]This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are extracting structured data from a detail page. Given saved HTML, identify all available fields and extract their values.
Mines projects and conversations into a searchable memory palace and retrieves past work via semantic search.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.
Share bugs, ideas, or general feedback.
You are extracting structured data from a detail page. Given saved HTML, identify all available fields and extract their values.
The raw argument string is $ARGUMENTS. Split it into up to 4 whitespace-separated positional arguments:
.scrape/spec/pages/detail-1/rendered.html.scrape/.work/specstrict. Only valid with schema_path. When set, extract only schema fields — no extras.The page directory (parent of the HTML file) also contains meta.json with the source URL and other metadata.
Derive the page ID from the directory name (e.g. detail-1 from .../detail-1/rendered.html).
Derive the html_variant from the filename (e.g. raw from raw.html, rendered from rendered.html).
Read meta.json from the same directory for the source URL.
Only process this one page. Do not read or compare with other pages' analysis files — the orchestrator handles cross-page concerns.
Clean the HTML and extract metadata, saving outputs to the work directory. Use {page_id}.{html_variant} as the filename base to avoid collisions:
mkdir -p {work_path}/analysis
uv run ${CLAUDE_SKILL_DIR}/scripts/clean_html.py PAGE.html -l1 -o {work_path}/analyze-page/{page_id}.{html_variant}.cleaned.html
uv run ${CLAUDE_SKILL_DIR}/scripts/extract_metadata.py PAGE.html -u PAGE_URL -o {work_path}/analyze-page/{page_id}.{html_variant}.metadata.json
Read only the cleaned HTML (never the original) and the metadata JSON. The metadata may be empty {} if the page has no structured data.
IMPORTANT: Never read the original HTML file (PAGE.html). Only use the cleaned HTML output from step 1 as your HTML source.
Use both the cleaned HTML and the metadata as data sources. Metadata (especially JSON-LD) often has cleaner, more complete values than what's visible in the HTML — e.g., structured price/priceCurrency vs rendered "$29.99", aggregateRating with review count, brand as a structured object. Some fields may only exist in metadata (e.g., sku, gtin, @type).
Examine both sources and extract all meaningful data fields. For each field, determine:
Three modes depending on arguments:
examples for formatting. Also extract additional fields not in the schema — they may reveal data the user didn't know about.For fields with large values (long text, HTML content, nested structures):
Save complete extraction to {work_path}/analyze-page/{page_id}.{html_variant}.json:
{
"url": "https://...",
"page_id": "detail-1",
"html_variant": "rendered",
"fields": {
"name": {"type": "str", "value": "Widget X"},
"price": {"type": "str", "value": "$29.99"},
"description": {"type": "str", "value": "Full long description..."}
}
}
Return a concise summary for the orchestrator. Truncate large values:
detail-1 (https://...):
name (str): "Widget X"
price (str): "$29.99"
description (str): "Premium widget with advanced..." (2340 chars)
rating (float): 4.5
Keep the summary compact — it will be loaded into the orchestrator's main context alongside summaries from other pages.