Help us improve
Share bugs, ideas, or general feedback.
From zyte-web-data
Generates web-poet page object code from an extraction spec produced by /scrape-spec. Handles any data type by reading schema, HTML fixtures, and expected values.
npx claudepluginhub zytedata/claude-skills --plugin zyte-web-dataHow this skill is triggered — by the user, by Claude, or both
Slash command
/zyte-web-data:scrape-codegen [spec-path] [project-dir] [fields][spec-path] [project-dir] [fields]This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are generating a web-poet page object from an extraction spec. The spec contains
Mines projects and conversations into a searchable memory palace and retrieves past work via semantic search.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.
Share bugs, ideas, or general feedback.
You are generating a web-poet page object from an extraction spec. The spec contains a schema, saved HTML pages, and expected values. It may describe any data type — product details, navigation links, article content, etc. Codegen doesn't need to know the data type; it generates a PO that extracts according to the schema.
The spec was produced by /scrape-spec and the project by /scrape-ensure-project.
The raw argument string is $ARGUMENTS. Split it into up to 3 whitespace-separated positional arguments:
.scrape/books-toscrapeRead {spec_path}/spec.json to get:
schema.properties — the field definitionshtml_variant — which HTML to use (raw or rendered)url — the starting URL (used for domain name)data_type — what's being extracted (used for class naming)If fields is provided, filter schema.properties to only include those fields.
List page directories in {spec_path}/pages/ that have corresponding values in
{spec_path}/values/. Read expected values from each.
Derive site_name from the spec_path (parent directory name, e.g. books-toscrape from .scrape/books-toscrape/products).
Detect the project name from {project_dir}.
Check {project_name}/items.py for an existing item class matching data_type.
If none exists, write one based on the schema (all fields optional, | None = None).
Add a page object stub:
/scrape-add-page-object {project_dir}/{project_name}/pages/{module_name}.py \
{ClassName} {domain} web_poet.WebPage {project_name}.items.{ItemClass}
Use web_poet.BrowserPage if html_variant is rendered.
Find the fixture class path from the project structure (e.g.,
{project_name}.pages.{module_name}.{ClassName}).
uv run ${CLAUDE_SKILL_DIR}/scripts/convert_fixtures.py \
{spec_path} {project_dir} {fixture_class_path}
mkdir -p .scrape/.work/{site_name}/codegen-analyze
Launch one Agent per page with values, all in a single message for parallel
execution. Each agent runs /scrape-codegen-analyze with all 4 arguments:
/scrape-codegen-analyze {spec_path}/pages/{page_id}/{html_variant}.html .scrape/.work/{site_name} {spec_path}/spec.json {spec_path}/values/{page_id}.json
Skip pages whose HTML file doesn't exist.
After all analysis agents complete, launch a single Agent running
/scrape-codegen-generate with all 3 arguments:
/scrape-codegen-generate .scrape/.work/{site_name} {project_dir}/{project_name}/pages/{module_name}.py {spec_path}/spec.json
cd {project_dir} && uv run pytest fixtures/ -x -v
Report results. If tests fail, read errors and consider re-generating failed fields.
Generated page object at {project_dir}/{project_name}/pages/{module_name}.py:
Class: {ClassName} (N fields)
Fixtures: N test cases
Tests: N/N passing
Follow the web-poet reference at references/web-poet-reference.md, plus:
None for missing data — never empty string, False, or []None before attribute accessException — only specific exceptionsextruct — the metadata format matches extract_metadata.py output from earlier
stages, so the same access patterns work in the page object