Help us improve
Share bugs, ideas, or general feedback.
From zyte-web-data
Analyzes an HTML page against a schema and expected values to produce detailed field extraction instructions, covering CSS selectors, JSON-LD paths, microdata, OpenGraph tags, and script data, for code generation.
npx claudepluginhub zytedata/claude-skills --plugin zyte-web-dataHow this skill is triggered — by the user, by Claude, or both
Slash command
/zyte-web-data:scrape-codegen-analyze [page-html-path] [work-path] [spec-path] [values-path][page-html-path] [work-path] [spec-path] [values-path]This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are analyzing a detail page to produce extraction instructions for a code generation system. Given an HTML page, a schema, and expected values, you determine WHERE and HOW each field can be extracted from the page.
Mines projects and conversations into a searchable memory palace and retrieves past work via semantic search.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.
Share bugs, ideas, or general feedback.
You are analyzing a detail page to produce extraction instructions for a code generation system. Given an HTML page, a schema, and expected values, you determine WHERE and HOW each field can be extracted from the page.
Your analysis will be read by a separate code-generation agent that does not have access to the HTML. It must be detailed enough for that agent to write correct web-poet extraction code.
The raw argument string is $ARGUMENTS. Split it into 4 whitespace-separated positional arguments:
.scrape/spec/pages/detail-1/raw.html.scrape/.work/spec.scrape/spec/spec.json.scrape/spec/values/detail-1.jsonPlus, taken from the surrounding prompt text (not from the argument string):
The page directory (parent of the HTML file) also contains meta.json with the source URL.
Derive the page_id from the directory name (e.g. detail-1 from .../detail-1/raw.html).
Read meta.json from the same directory for the source URL.
Read the schema from {spec_path} — use the properties object inside schema.
Read the expected values from {values_path} — use the values object (may be {}).
Clean the HTML and extract structured metadata. Use level 0 cleaning to preserve scripts (which may contain JSON-LD/embedded data):
mkdir -p {work_path}/codegen-analyze
uv run ${CLAUDE_SKILL_DIR}/../scrape-analyze-page/scripts/clean_html.py PAGE.html -l0 -o {work_path}/codegen-analyze/{page_id}.cleaned.html
uv run ${CLAUDE_SKILL_DIR}/../scrape-analyze-page/scripts/extract_metadata.py PAGE.html -u PAGE_URL -o {work_path}/codegen-analyze/{page_id}.metadata.json
Read only the cleaned HTML (not the original) and the metadata JSON.
For each field in the schema, produce a detailed analysis. Consider all possible data sources in the HTML:
<script type="application/ld+json"> blocks — note the JSON pathitemscope/itemprop attributes<meta property="og:..."> tags<script> tags (e.g. window.__DATA__ = {...})<meta name="..."> tagsdata-* attributesFor each source found, describe:
... to shorten long content)Then recommend the best extraction method and explain why.
For each field, determine the correct target extraction value:
nullSave to {work_path}/codegen-analyze/{page_id}.json:
{
"url": "https://example.com/product/widget-x",
"page_id": "detail-1",
"fields": {
"name": {
"target_value": "Widget X",
"analysis": "The product name appears in two places:\n\n1. **HTML element** `<h1 class=\"product-title\">Widget X</h1>`\n - Selector: `h1.product-title::text`\n - Clean text, no post-processing needed\n - Reliable: unique h1 on the page\n\n2. **JSON-LD** in `<script type=\"application/ld+json\">`:\n ```json\n {\"@type\": \"Product\", \"name\": \"Widget X\", ...}\n ```\n - Path: `name` on the Product object\n - Also reliable\n\nRecommended: CSS selector `h1.product-title::text` — simplest, most direct."
},
"price": {
"target_value": "$29.99",
"analysis": "..."
}
}
}
Return a compact summary for the orchestrator:
detail-1 (https://...):
name: "Widget X" — h1.product-title, also in JSON-LD
price: "$29.99" — span.price::text, JSON-LD offers.price
description: "A premium widget..." (2340 chars) — div.description
rating: null — not found in HTML