Skill

production-sop

Production-grade Pyspider SOP with dual-mode workflows (new project vs refactor), strategy patterns A-E, strict engineering redlines, and best practices. Use when: (1) Creating new Pyspider crawlers with anti-scraping strategies, (2) Refactoring existing production crawlers, (3) Managing database operations for scraping projects, (4) Implementing BrightData V3, Cookie pools, SSR parsing, API forwarding, or dispatchers. Provides strict redlines, zero-field-loss principles, and automation scripts.

npx claudepluginhub within-7/minto-plugin-tools --plugin pyspider-dev

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Comprehensive production-grade SOP for Pyspider crawler development with dual-mode workflows and strategy patterns.

Supporting Assets

assets/examples/strategy_examples.mdreferences/DATABASE_OPS_GUIDE.mdreferences/MASTER_SOP.mdreferences/PROJECT_GUIDE.mdreferences/REFERENCE_INDEX.jsonreferences/STRATEGY_DEEP_DIVE.mdreferences/reference_map.jsonscripts/delete_project.pyscripts/init_strategy_crawler.pyscripts/register_business.py

SKILL.md

Similar Skills

skill-lookup

159.9k

Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.

prompts.chat

prompt-lookup

159.9k

Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.

prompts.chat

next-compile

139.2k

Checks Next.js compilation errors using a running Turbopack dev server after code edits. Fixes actionable issues before reporting complete. Replaces `next build`.

1 file

vercel-next-js-2

Stats

Parent Repo Stars1

Parent Repo Forks0

Last CommitMar 1, 2026

Actions

View Source View Plugin View on GitHub View README

Production Pyspider SOP

Comprehensive production-grade SOP for Pyspider crawler development with dual-mode workflows and strategy patterns.

Dual-Mode Workflows

Mode 1: New Project Development

Use when creating crawlers from scratch:

Select Strategy Template: Choose from Strategy A-E based on target website characteristics
Generate Code: Use strategy template generator to create crawler skeleton
Define Contract: Check ScrapingMongoQuery for preset name and scrap_key
Assemble: Use Functions.get_dict_by_dot and raise Exception for error handling

Generate Strategy Templates:

# Strategy A (BrightData V3)
python scripts/init_strategy_crawler.py TikTokCrawler A ./tiktok_crawler.py

# Strategy B (Cookie Pool)
python scripts/init_strategy_crawler.py FacebookCrawler B ./facebook_crawler.py

# Strategy C (SSR)
python scripts/init_strategy_crawler.py YoutubeCrawler C ./youtube_crawler.py

# Strategy D (API Forward)
python scripts/init_strategy_crawler.py DifyCrawler D ./dify_crawler.py

# Strategy E (Dispatcher)
python scripts/init_strategy_crawler.py MainDispatch E ./main_dispatch.py

See STRATEGY_DEEP_DIVE.md for strategy details and strategy_examples.md for example implementations.

Mode 2: Refactor & Debug (CORE REDLINES)

Use when optimizing existing production crawlers:

1. Contract Lock (Pre-Audit)

Extract full Headers from old code via read_file (including Sec-* fields)
Extract all output key names (Result Keys)

2. Shadow Preservation Principle

NEVER remove browser fingerprint Headers for code simplicity
Field 1:1 Alignment: Optimized result field names must be pixel-perfect with old version
Proxy Inheritance: Never change verified proxy types (e.g., MX residential proxy)

3. Transparent Auditing

If line count decreases significantly, explain what redundant logic was simplified
Never silently delete core logic

See MASTER_SOP.md for complete refactoring guidelines.

Global Engineering Redlines

Critical Rules

Exception Must Be Red: Never use logger.error + silent return. Always raise Exception to trigger FAILED status
Three-in-One Sync: Script (on_message) + Git (Pure .py) + DB (register_business.py) must be synchronized
Never Modify Name: Never change name field when updating database config (breaks scheduling mapping)
Repository Hygiene: Git commits only .py files. Never commit ./skills, .md, .json

See PROJECT_GUIDE.md for complete architecture rules.

Strategy Patterns (A-E)

Strategy A (BrightData V3)

For top-tier anti-scraping:

Pure Request Protocol: No extra params in URL, exact payload alignment with official CURL examples
Hard Redline: Must validate records field. If status is ready/done but records == 0, throw BD_EMPTY_DATA

Strategy B (Cookie Pool)

For strong account binding (Facebook/Reddit/Mjjl):

Force use ispProxy_us_
Pass forced_cookies via save parameter

Strategy C (SSR)

For proxy breakthrough (Youtube/Amazon/Twitter):

Residential proxy with regex extraction

Strategy D (API Forward)

For pure APIs and internal AI forwarding:

Datacenter proxy with MongoDB task status bits

Strategy E (Dispatcher)

For scheduling and distribution:

Use on_message for fanout and task routing

See STRATEGY_DEEP_DIVE.md for complete strategy details.

Database Operations

Business Registration

Pipeline Rules:

Use $match to match scrap_key
Use $project to map fields
Use Chinese field names for Excel display

Tool Usage:

# Set environment variables (optional, defaults available)
export MONGO_URI='mongodb://user:pass@host:port/?tls=false'
export MONGO_DB='feishudb'
export MONGO_COLLECTION='ScrapingMongoQuery'

# Register business config
python3 scripts/register_business.py '<JSON_CONFIG>'

Environment Variables:

MONGO_URI - MongoDB connection URI (default: production URI)
MONGO_DB - Database name (default: feishudb)
MONGO_COLLECTION - Collection name (default: ScrapingMongoQuery)

See DATABASE_OPS_GUIDE.md for details.

Project Destruction

Delete crawler with proper cleanup:

Physical Delete: rm [Script].py
Git Sync: git add . → git commit
DB Cleanup: python3 scripts/delete_project.py [project_name]

Environment Variables (same as registration):

MONGO_URI - MongoDB connection URI (default: production URI)
MONGO_DB - Database name (default: projectdb)
MONGO_COLLECTION - Collection name (default: projectdb)

Communication Protocol

Non-Read Must Sync: Any modification or database write operation must sync logic changes and get confirmation first

Exception Management

General Redlines

Forbidden: except Exception as e: pass
Allowed: For network fluctuations (e.g., 599), use @catch_status_code_error for Pyspider handling
Monitoring Alignment: Core resource depletion (Cookie/Proxy exhaustion) must NOT throw error in on_message or on_start. Must pass error marker via save, throw exception in callback phase

Purpose: Ensure exceptions occur within Pyspider task lifecycle, generating red FAILED status and triggering n8n Webhook.

Refactoring Redline: Zero Field Loss

Principle: When optimizing production scripts, data structure (Payload/Result) priority > code elegance

Action: Never delete any "useless" API parameters (e.g., nodeIdPaths) unless confirmed as dirty data

Universal Interface: on_message Driven

All scripts must implement:

def on_message(self, project, message):
    if project == self.project_name: return message
    # Parse message from Dispatcher
    url = message.get('url')
    if url:
        self.crawl(url, callback=self.index_page)

Reference Examples

Strategy Template Generator

Generate production-ready strategy templates:

python scripts/init_strategy_crawler.py <CrawlerName> <StrategyType> <output_path>

Available strategies:

A - BrightData V3 (Top-tier anti-scraping)
B - Cookie Pool (Strong account binding)
C - SSR (Proxy breakthrough with regex)
D - API Forward (Pure APIs with task status)
E - Dispatcher (Scheduling and distribution)

Example Implementations

See strategy_examples.md for:

Detailed example scripts for each strategy
Customization guidelines
Production deployment checklist

Reference Index

See REFERENCE_INDEX.json for example scripts organized by strategy:

A_V3_Dataset: BrightData V3 examples
B_Cookie_Pool: Cookie pool examples
C_SSR_Regex: SSR regex examples
D_API_Forward: API forwarding examples
E_Dispatcher: Dispatcher examples

See reference_map.json for detailed strategy breakdown with features.

Loading Triggers

Conditional Loading

Task Type	Must Load	Do NOT Load
New crawler development	All references	None
Refactoring existing crawler	MASTER_SOP.md, STRATEGY_DEEP_DIVE.md	PROJECT_GUIDE.md
Database operations	DATABASE_OPS_GUIDE.md	Strategy deep dives
Strategy selection	STRATEGY_DEEP_DIVE.md, REFERENCE_INDEX.json	Refactoring SOPs

When to Load References

Load all references when:

Starting new crawler project
Implementing specific anti-scraping strategy
Refactoring production crawler

Load specific references when:

Database registration: DATABASE_OPS_GUIDE.md
Refactoring: MASTER_SOP.md, STRATEGY_DEEP_DIVE.md
Strategy selection: STRATEGY_DEEP_DIVE.md, REFERENCE_INDEX.json