From pyspider-dev
Production-grade Pyspider SOP with dual-mode workflows (new project vs refactor), strategy patterns A-E, strict engineering redlines, and best practices. Use when: (1) Creating new Pyspider crawlers with anti-scraping strategies, (2) Refactoring existing production crawlers, (3) Managing database operations for scraping projects, (4) Implementing BrightData V3, Cookie pools, SSR parsing, API forwarding, or dispatchers. Provides strict redlines, zero-field-loss principles, and automation scripts.
npx claudepluginhub within-7/minto-plugin-tools --plugin pyspider-devThis skill uses the workspace's default tool permissions.
Comprehensive production-grade SOP for Pyspider crawler development with dual-mode workflows and strategy patterns.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Checks Next.js compilation errors using a running Turbopack dev server after code edits. Fixes actionable issues before reporting complete. Replaces `next build`.
Comprehensive production-grade SOP for Pyspider crawler development with dual-mode workflows and strategy patterns.
Use when creating crawlers from scratch:
ScrapingMongoQuery for preset name and scrap_keyFunctions.get_dict_by_dot and raise Exception for error handlingGenerate Strategy Templates:
# Strategy A (BrightData V3)
python scripts/init_strategy_crawler.py TikTokCrawler A ./tiktok_crawler.py
# Strategy B (Cookie Pool)
python scripts/init_strategy_crawler.py FacebookCrawler B ./facebook_crawler.py
# Strategy C (SSR)
python scripts/init_strategy_crawler.py YoutubeCrawler C ./youtube_crawler.py
# Strategy D (API Forward)
python scripts/init_strategy_crawler.py DifyCrawler D ./dify_crawler.py
# Strategy E (Dispatcher)
python scripts/init_strategy_crawler.py MainDispatch E ./main_dispatch.py
See STRATEGY_DEEP_DIVE.md for strategy details and strategy_examples.md for example implementations.
Use when optimizing existing production crawlers:
1. Contract Lock (Pre-Audit)
read_file (including Sec-* fields)2. Shadow Preservation Principle
result field names must be pixel-perfect with old version3. Transparent Auditing
See MASTER_SOP.md for complete refactoring guidelines.
logger.error + silent return. Always raise Exception to trigger FAILED statusScript (on_message) + Git (Pure .py) + DB (register_business.py) must be synchronizedname field when updating database config (breaks scheduling mapping).py files. Never commit ./skills, .md, .jsonSee PROJECT_GUIDE.md for complete architecture rules.
For top-tier anti-scraping:
records field. If status is ready/done but records == 0, throw BD_EMPTY_DATAFor strong account binding (Facebook/Reddit/Mjjl):
ispProxy_us_forced_cookies via save parameterFor proxy breakthrough (Youtube/Amazon/Twitter):
For pure APIs and internal AI forwarding:
For scheduling and distribution:
on_message for fanout and task routingSee STRATEGY_DEEP_DIVE.md for complete strategy details.
Register new crawler mappings in ScrapingMongoQuery:
Pipeline Rules:
$match to match scrap_key$project to map fieldsTool Usage:
# Set environment variables (optional, defaults available)
export MONGO_URI='mongodb://user:pass@host:port/?tls=false'
export MONGO_DB='feishudb'
export MONGO_COLLECTION='ScrapingMongoQuery'
# Register business config
python3 scripts/register_business.py '<JSON_CONFIG>'
Environment Variables:
MONGO_URI - MongoDB connection URI (default: production URI)MONGO_DB - Database name (default: feishudb)MONGO_COLLECTION - Collection name (default: ScrapingMongoQuery)See DATABASE_OPS_GUIDE.md for details.
Delete crawler with proper cleanup:
rm [Script].pygit add . → git commitpython3 scripts/delete_project.py [project_name]Environment Variables (same as registration):
MONGO_URI - MongoDB connection URI (default: production URI)MONGO_DB - Database name (default: projectdb)MONGO_COLLECTION - Collection name (default: projectdb)except Exception as e: pass@catch_status_code_error for Pyspider handlingon_message or on_start. Must pass error marker via save, throw exception in callback phasePurpose: Ensure exceptions occur within Pyspider task lifecycle, generating red FAILED status and triggering n8n Webhook.
Principle: When optimizing production scripts, data structure (Payload/Result) priority > code elegance
Action: Never delete any "useless" API parameters (e.g., nodeIdPaths) unless confirmed as dirty data
All scripts must implement:
def on_message(self, project, message):
if project == self.project_name: return message
# Parse message from Dispatcher
url = message.get('url')
if url:
self.crawl(url, callback=self.index_page)
Generate production-ready strategy templates:
python scripts/init_strategy_crawler.py <CrawlerName> <StrategyType> <output_path>
Available strategies:
See strategy_examples.md for:
See REFERENCE_INDEX.json for example scripts organized by strategy:
See reference_map.json for detailed strategy breakdown with features.
| Task Type | Must Load | Do NOT Load |
|---|---|---|
| New crawler development | All references | None |
| Refactoring existing crawler | MASTER_SOP.md, STRATEGY_DEEP_DIVE.md | PROJECT_GUIDE.md |
| Database operations | DATABASE_OPS_GUIDE.md | Strategy deep dives |
| Strategy selection | STRATEGY_DEEP_DIVE.md, REFERENCE_INDEX.json | Refactoring SOPs |
Load all references when:
Load specific references when:
DATABASE_OPS_GUIDE.mdMASTER_SOP.md, STRATEGY_DEEP_DIVE.mdSTRATEGY_DEEP_DIVE.md, REFERENCE_INDEX.json