From zyte-web-data
Deploy Scrapy projects to Scrapy Cloud / Zyte Cloud, schedule spiders, list and stop jobs, and help inspect items and logs via the web UI.
How this skill is triggered — by the user, by Claude, or both
Slash command
/zyte-web-data:scrape-scrapy-cloud [project-dir][project-dir]This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are a general-purpose Scrapy Cloud assistant. You can deploy projects, schedule
You are a general-purpose Scrapy Cloud assistant. You can deploy projects, schedule
spiders, manage jobs (list, stop), and direct users to the web UI to inspect items
and logs — all using shub and the Scrapy Cloud HTTP API.
Read python-environments.md and docs-access.md from ${CLAUDE_SKILL_DIR}/../scrape/references.
scrapinghub.yml, ~/.scrapinghub.yml, and environment variables).scrapinghub.yml,
use it to determine which project to deploy.The raw argument string is $ARGUMENTS — use it as-is, treat empty as "no argument given":
IMPORTANT: It's critical that API keys are not displayed or exposed to the agent or user during a session.
Do not use any tool that might print the key or its value (e.g. cat ~/.scrapinghub.yml) as an auth probe.
Navigate to project_dir (or the current directory if none was given). Confirm scrapy.cfg exists:
ls scrapy.cfg
If scrapy.cfg is not in the current directory, walk up the directory tree to find it. Once found, cd into that directory — all subsequent commands must run from there.
ls scrapinghub.yml 2>/dev/null || echo "(not found)"
If the file exists skip step 3 and proceed directly to step 4.
Get the project ID
Check for a saved project ID from a previous scrape-zyte-login:
cat .scrape/.zyte/project-id 2>/dev/null || echo "(not set)"
If present, use it and skip asking.
Otherwise, open the Zyte projects page:
xdg-open "https://app.zyte.com/o/projects/" 2>/dev/null || open "https://app.zyte.com/o/projects/"
Ask the user:
Please open https://app.zyte.com/o/projects/ and provide the numeric project ID
for this spider (e.g. 12345). If you don't have a project yet, create one first.
Wait for the user to supply a numeric project ID.
Detect requirements file
Check which dependency file exists (in order of preference):
ls requirements.txt pyproject.toml Pipfile 2>/dev/null | head -1
Specify a Scrapy stack
Use the Docker Hub API or scrape the tags page to find the most recent stack tag for the scrapinghub/scrapinghub-stack-scrapy repository that matches the Scrapy version in the requirements file:
https://hub.docker.com/v2/repositories/scrapinghub/scrapinghub-stack-scrapy/tagshttps://hub.docker.com/r/scrapinghub/scrapinghub-stack-scrapy/tagsTags follow the format {VERSION}-{YYYYMMDD} (e.g. 2.14-20260326). Select the tag with the highest version number and, among equal versions, the latest frozen date. Use that tag as the stack value in scrapinghub.yml, prefixed with scrapy: (e.g. scrapy:2.14-20260326).
If the requirements file is missing or doesn't specify a Scrapy version, use the latest tag overall.
Write scrapinghub.yml
Create the file. Use the detected requirements file, or omit the requirements block if none was found.
If SCRAPY_CLOUD_ENDPOINT is set, add an endpoint: key to target a non-production environment.
Example with requirements.txt:
project: 12345
stack: scrapy:2.14-20260326 # replace with the latest tag fetched above
requirements:
file: requirements.txt
endpoint: https://app-staging.zyte.com/api/ # include when SCRAPY_CLOUD_ENDPOINT is set
Example with pyproject.toml (Poetry):
project: 12345
stack: scrapy:2.14-20260326 # replace with the latest tag fetched above
requirements:
file: pyproject.toml
Example with no requirements file found:
project: 12345
stack: scrapy:2.14-20260326 # replace with the latest tag fetched above
Note: the requirements file should only list packages not already provided by the stack (e.g. do not include
scrapyitself).
Run the deploy from the project root with uvx, which fetches shub on
demand — no separate install step, and it works whether or not shub is
already on PATH:
uvx shub deploy
Stream and display the full output. A successful deploy looks like:
Packing version 3af023e-master
Deploying to Scrapy Cloud project "12345"
{"status": "ok", "project": 12345, "version": "3af023e-master", "spiders": 2}
Run your spiders at: https://app.zyte.com/p/12345/
On failure, diagnose the error and fix before retrying:
Error: Not logged in. Please run 'shub login' first. → not authenticated; invoke /scrape-zyte-login.Error: Invalid value for target: Please specify target or configure a default target in scrapinghub.yml. → no project ID configured; ask the user for a project ID, update scrapinghub.yml and retry.Authentication error / 403 → API key is wrong or missing; invoke /scrape-zyte-login.Project N does not exist → wrong project ID; correct scrapinghub.yml and retry.Could not find requirements file → the requirements.file path in scrapinghub.yml is wrong; fix the path and retry.No module named scrapy or build errors → a dependency is missing or incompatible with the selected stack; update the requirements file and retry.Tell the user the deploy succeeded, and include the version, spider count, and project link (https://app.zyte.com/p//).
Use uvx shub schedule to start a job that runs a spider on Scrapy Cloud
without redeploying.
Tag jobs with skill.
uvx shub schedule SPIDER --tag skill
On success, shub prints the job ID and convenience links:
Spider myspider scheduled, job ID: 12345/2/15
Watch the log on the command line:
shub log -f 2/15
or watch it running in Zyte's web interface:
https://app.zyte.com/p/12345/job/2/15
-a)Repeat -a KEY=VALUE for each argument:
uvx shub schedule myspider -a ARG1=VALUE1 -a ARG2=VALUE2 --tag skill
-s)uvx shub schedule myspider -s LOG_LEVEL=DEBUG -s CLOSESPIDER_PAGECOUNT=10 --tag skill
uvx shub schedule 33333/myspider --tag skill
| Flag | Description |
|---|---|
-u | Number of Scrapy Cloud units to use (1–6) |
-p | Job priority: 0 (lowest) to 4 (highest) |
uvx shub schedule myspider -p 3 -u 3 --tag skill
uvx shub schedule myspider \
-a start_url=https://example.com \
-s LOG_LEVEL=DEBUG \
-p 2 \
-u 1 \
--tag skill
Use the Scrapy Cloud Jobs HTTP API for listing and stopping jobs.
Base URL: ${SCRAPY_CLOUD_ENDPOINT:-https://app.zyte.com/api/}
IMPORTANT It is critical that API requests are made with the wrapper script at scripts/scrapy_cloud_api.py
which handles authentication without leaking credentials to the agent. Do not make API requests with curl or other
tools that might expose credentials.
uv run ${CLAUDE_SKILL_DIR}/scripts/scrapy_cloud_api.py HTTP_METHOD API_URL [-q QUERY_ARG=VALUE]... [-b BODY_ARG=VALUE]...
With arbitrary query parameters (-q) and body parameters (-b) as needed per endpoint. See the script's help message for details.
# All running jobs
uv run ${CLAUDE_SKILL_DIR}/scripts/scrapy_cloud_api.py GET "${SCRAPY_CLOUD_ENDPOINT:-https://app.zyte.com/api/}jobs/list.json" -q project=PROJECT_ID -q state=running
# Latest 3 finished jobs for a specific spider
uv run ${CLAUDE_SKILL_DIR}/scripts/scrapy_cloud_api.py GET "${SCRAPY_CLOUD_ENDPOINT:-https://app.zyte.com/api/}jobs/list.json" -q project=PROJECT_ID -q spider=SPIDER_NAME -q state=finished -q count=3
# Jobs that lack the "consumed" tag
uv run ${CLAUDE_SKILL_DIR}/scripts/scrapy_cloud_api.py GET "${SCRAPY_CLOUD_ENDPOINT:-https://app.zyte.com/api/}jobs/list.json" -q project=PROJECT_ID -q lacks_tag=consumed
Available state values: pending, running, finished, deleted.
Available filter parameters: job, spider, state, has_tag, lacks_tag, count.
uv run ${CLAUDE_SKILL_DIR}/scripts/scrapy_cloud_api.py POST "${SCRAPY_CLOUD_ENDPOINT:-https://app.zyte.com/api/}jobs/stop.json" -b project=PROJECT_ID -b job=PROJECT_ID/SPIDER_ID/JOB_ID
Direct the user to the Scrapy Cloud web UI to inspect items and logs.
Derive the web UI base URL from SCRAPY_CLOUD_ENDPOINT by stripping the
trailing /api/ path (default: https://app.zyte.com).
${SCRAPY_CLOUD_ENDPOINT%api/}p/PROJECT_ID/SPIDER_ID/JOB_ID/items
Open with:
BASE_UI="${SCRAPY_CLOUD_ENDPOINT%api/}"
xdg-open "${BASE_UI:-https://app.zyte.com/}p/PROJECT_ID/SPIDER_ID/JOB_ID/items" 2>/dev/null \
|| open "${BASE_UI:-https://app.zyte.com/}p/PROJECT_ID/SPIDER_ID/JOB_ID/items"
For a quick summary of item counts and field coverage without downloading all items, use the stats endpoint directly:
uv run ${CLAUDE_SKILL_DIR}/scripts/scrapy_cloud_api.py GET "${SCRAPY_CLOUD_ENDPOINT:-https://app.zyte.com/api/}items/PROJECT_ID/SPIDER_ID/JOB_ID/stats"
# Response: {"counts":{"field1":9350,"field2":514},"totals":{"input_bytes":14390294,"input_values":10000}}
Response fields:
| Field | Description |
|---|---|
counts[field] | Number of times each field was populated. |
totals.input_bytes | Total size of all items in bytes. |
totals.input_values | Total number of items. |
${SCRAPY_CLOUD_ENDPOINT%api/}p/PROJECT_ID/SPIDER_ID/JOB_ID/log
Open with:
BASE_UI="${SCRAPY_CLOUD_ENDPOINT%api/}"
xdg-open "${BASE_UI:-https://app.zyte.com/}p/PROJECT_ID/SPIDER_ID/JOB_ID/log" 2>/dev/null \
|| open "${BASE_UI:-https://app.zyte.com/}p/PROJECT_ID/SPIDER_ID/JOB_ID/log"
Logs returned by the HTTP API include a numeric level field:
| Value | Level |
|---|---|
| 10 | DEBUG |
| 20 | INFO |
| 30 | WARNING |
| 40 | ERROR |
| 50 | CRITICAL |
Each log entry is a JSON object with fields: time (Unix ms), level, and message.
For bulk downloads, pagination, field filtering, and format options (JSON, JSON Lines), refer to the HTTP API documentation:
Invoke this skill when the user asks to:
Example phrases that should trigger this skill:
| Variable | Default | Description |
|---|---|---|
SHUB_APIKEY | (none) | Scrapy Cloud API key; falls back to ~/.scrapinghub.yml |
SCRAPY_CLOUD_ENDPOINT | https://app.zyte.com/api/ | Jobs API base URL (override for staging) |
SCRAPY_CLOUD_STORAGE_ENDPOINT | https://storage.zyte.com/ | Items/logs storage base URL (override for staging) |
The web UI base URL is derived from SCRAPY_CLOUD_ENDPOINT by stripping the
trailing /api/ path. For example:
https://app.zyte.com/api/ → https://app.zyte.comhttps://app-staging.zyte.com/api/ → https://app-staging.zyte.com| Symptom | Cause | Fix |
|---|---|---|
Error: No such command / shub not found | shub invoked directly but not installed | Invoke it as uvx shub … — fetched on demand, no install |
Authentication error / 401 | API key missing or invalid | Run shub login or set $SHUB_APIKEY |
403 | API key lacks access to this project | Verify the project ID and key permissions |
Spider not found | Spider name is wrong or project not deployed | Verify the spider name; deploy with uvx shub deploy first |
Project N does not exist | Wrong project ID or alias | Check scrapinghub.yml or specify the correct ID |
Could not find requirements file | Wrong path in scrapinghub.yml | Fix the requirements.file path and redeploy |
No module named scrapy / build errors | Dependency missing or wrong stack | Update requirements file and redeploy |
npx claudepluginhub zytedata/claude-skills --plugin zyte-web-dataOrchestrates end-to-end web scraping workflow from URL to working Scrapy spider with web-poet page objects. Use for full-site or multiple-page crawls.
Develop, debug, and deploy Apify Actors — serverless cloud programs for web scraping, automation, and data processing. Guides setup, template selection, and CLI usage.
Builds, updates, and troubleshoots shub-workflow crawl managers that schedule Scrapy Cloud spider jobs and react to outcomes. Covers base class selection, generator pattern, hooks, and concurrent scheduling.