Production-grade Pyspider SOP with dual-mode workflows (new project vs refactor), strategy patterns A-E, strict engineering redlines, and best practices. Use when: (1) Creating new Pyspider crawlers with anti-scraping strategies, (2) Refactoring existing production crawlers, (3) Managing database operations for scraping projects, (4) Implementing BrightData V3, Cookie pools, SSR parsing, API forwarding, or dispatchers. Provides strict redlines, zero-field-loss principles, and automation scripts.
Generates production-grade Pyspider crawlers with strategy patterns, refactoring redlines, and database operations for web scraping projects.
npx claudepluginhub within-7/minto-plugin-toolsThis skill inherits all available tools. When active, it can use any tool Claude has access to.
assets/examples/strategy_examples.mdreferences/DATABASE_OPS_GUIDE.mdreferences/MASTER_SOP.mdreferences/PROJECT_GUIDE.mdreferences/REFERENCE_INDEX.jsonreferences/STRATEGY_DEEP_DIVE.mdreferences/reference_map.jsonscripts/delete_project.pyscripts/init_strategy_crawler.pyscripts/register_business.pyComprehensive production-grade SOP for Pyspider crawler development with dual-mode workflows and strategy patterns.
Use when creating crawlers from scratch:
ScrapingMongoQuery for preset name and scrap_keyFunctions.get_dict_by_dot and raise Exception for error handlingGenerate Strategy Templates:
# Strategy A (BrightData V3)
python scripts/init_strategy_crawler.py TikTokCrawler A ./tiktok_crawler.py
# Strategy B (Cookie Pool)
python scripts/init_strategy_crawler.py FacebookCrawler B ./facebook_crawler.py
# Strategy C (SSR)
python scripts/init_strategy_crawler.py YoutubeCrawler C ./youtube_crawler.py
# Strategy D (API Forward)
python scripts/init_strategy_crawler.py DifyCrawler D ./dify_crawler.py
# Strategy E (Dispatcher)
python scripts/init_strategy_crawler.py MainDispatch E ./main_dispatch.py
See STRATEGY_DEEP_DIVE.md for strategy details and strategy_examples.md for example implementations.
Use when optimizing existing production crawlers:
1. Contract Lock (Pre-Audit)
read_file (including Sec-* fields)2. Shadow Preservation Principle
result field names must be pixel-perfect with old version3. Transparent Auditing
See MASTER_SOP.md for complete refactoring guidelines.
logger.error + silent return. Always raise Exception to trigger FAILED statusScript (on_message) + Git (Pure .py) + DB (register_business.py) must be synchronizedname field when updating database config (breaks scheduling mapping).py files. Never commit ./skills, .md, .jsonSee PROJECT_GUIDE.md for complete architecture rules.
For top-tier anti-scraping:
records field. If status is ready/done but records == 0, throw BD_EMPTY_DATAFor strong account binding (Facebook/Reddit/Mjjl):
ispProxy_us_forced_cookies via save parameterFor proxy breakthrough (Youtube/Amazon/Twitter):
For pure APIs and internal AI forwarding:
For scheduling and distribution:
on_message for fanout and task routingSee STRATEGY_DEEP_DIVE.md for complete strategy details.
Register new crawler mappings in ScrapingMongoQuery:
Pipeline Rules:
$match to match scrap_key$project to map fieldsTool Usage:
# Set environment variables (optional, defaults available)
export MONGO_URI='mongodb://user:pass@host:port/?tls=false'
export MONGO_DB='feishudb'
export MONGO_COLLECTION='ScrapingMongoQuery'
# Register business config
python3 scripts/register_business.py '<JSON_CONFIG>'
Environment Variables:
MONGO_URI - MongoDB connection URI (default: production URI)MONGO_DB - Database name (default: feishudb)MONGO_COLLECTION - Collection name (default: ScrapingMongoQuery)See DATABASE_OPS_GUIDE.md for details.
Delete crawler with proper cleanup:
rm [Script].pygit add . → git commitpython3 scripts/delete_project.py [project_name]Environment Variables (same as registration):
MONGO_URI - MongoDB connection URI (default: production URI)MONGO_DB - Database name (default: projectdb)MONGO_COLLECTION - Collection name (default: projectdb)except Exception as e: pass@catch_status_code_error for Pyspider handlingon_message or on_start. Must pass error marker via save, throw exception in callback phasePurpose: Ensure exceptions occur within Pyspider task lifecycle, generating red FAILED status and triggering n8n Webhook.
Principle: When optimizing production scripts, data structure (Payload/Result) priority > code elegance
Action: Never delete any "useless" API parameters (e.g., nodeIdPaths) unless confirmed as dirty data
All scripts must implement:
def on_message(self, project, message):
if project == self.project_name: return message
# Parse message from Dispatcher
url = message.get('url')
if url:
self.crawl(url, callback=self.index_page)
Generate production-ready strategy templates:
python scripts/init_strategy_crawler.py <CrawlerName> <StrategyType> <output_path>
Available strategies:
See strategy_examples.md for:
See REFERENCE_INDEX.json for example scripts organized by strategy:
See reference_map.json for detailed strategy breakdown with features.
| Task Type | Must Load | Do NOT Load |
|---|---|---|
| New crawler development | All references | None |
| Refactoring existing crawler | MASTER_SOP.md, STRATEGY_DEEP_DIVE.md | PROJECT_GUIDE.md |
| Database operations | DATABASE_OPS_GUIDE.md | Strategy deep dives |
| Strategy selection | STRATEGY_DEEP_DIVE.md, REFERENCE_INDEX.json | Refactoring SOPs |
Load all references when:
Load specific references when:
DATABASE_OPS_GUIDE.mdMASTER_SOP.md, STRATEGY_DEEP_DIVE.mdSTRATEGY_DEEP_DIVE.md, REFERENCE_INDEX.jsonActivates when the user asks about AI prompts, needs prompt templates, wants to search for prompts, or mentions prompts.chat. Use for discovering, retrieving, and improving prompts.
Search, retrieve, and install Agent Skills from the prompts.chat registry using MCP tools. Use when the user asks to find skills, browse skill catalogs, install a skill for Claude, or extend Claude's capabilities with reusable AI agent components.
This skill should be used when the user wants to "create a skill", "add a skill to plugin", "write a new skill", "improve skill description", "organize skill content", or needs guidance on skill structure, progressive disclosure, or skill development best practices for Claude Code plugins.