From kostja94-marketing-skills-5
Configures, audits, and optimizes robots.txt files for search engine and AI crawler path-level access control, distinguishing from indexing.
npx claudepluginhub joshuarweaver/cascade-data-analytics --plugin kostja94-marketing-skills-5This skill uses the workspace's default tool permissions.
Guides configuration and auditing of robots.txt for search engine and AI crawler control.
Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.
Dynamically discovers and combines enabled skills into cohesive, unexpected delightful experiences like interactive HTML or themed artifacts. Activates on 'surprise me', inspiration, or boredom cues.
Generates images from structured JSON prompts via Python script execution. Supports reference images and aspect ratios for characters, scenes, products, visuals.
Guides configuration and auditing of robots.txt for search engine and AI crawler control.
When invoking: On first use, if helpful, open with 1–2 sentences on what this skill covers and why it matters, then provide the main output. On subsequent use or when the user asks to skip, go directly to the main output.
Check for project context first: If .claude/project-context.md or .cursor/project-context.md exists, read it for site URL and indexing goals.
Identify:
https://example.com)| Point | Note |
|---|---|
| Purpose | Controls crawler access; does NOT prevent indexing (disallowed URLs may still appear in search without snippet) |
| Advisory | Rules are advisory; malicious crawlers may ignore |
| Public | robots.txt is publicly readable; use noindex or auth for sensitive content. See indexing |
| Tool | Controls | Prevents indexing? |
|---|---|---|
| robots.txt | Crawl (path-level) | No—blocked URLs may still appear in SERP |
| noindex (meta / X-Robots-Tag) | Index (page-level) | Yes. See indexing |
| nofollow | Link equity only | No—does not control indexing |
| Use | Tool | Example |
|---|---|---|
| Path-level (whole directory) | robots.txt | Disallow: /admin/, Disallow: /api/, Disallow: /staging/ |
| Page-level (specific pages) | noindex meta / X-Robots-Tag | Login, signup, thank-you, 404, legal. See indexing for full list |
| Critical | Do NOT block in robots.txt | Pages that use noindex—crawlers must access the page to read the directive |
Paths to block in robots.txt: /admin/, /api/, /staging/, temp files. Paths to use noindex (allow crawl): /login/, /signup/, /thank-you/, etc.—see indexing.
| Item | Requirement |
|---|---|
| Path | Site root: https://example.com/robots.txt |
| Encoding | UTF-8 plain text |
| Standard | RFC 9309 (Robots Exclusion Protocol) |
| Directive | Purpose | Example |
|---|---|---|
User-agent: | Target crawler | User-agent: Googlebot, User-agent: * |
Disallow: | Block path prefix | Disallow: /admin/ |
Allow: | Allow path (can override Disallow) | Allow: /public/ |
Sitemap: | Declare sitemap absolute URL | Sitemap: https://example.com/sitemap.xml |
Clean-param: | Strip query params (Yandex) | See below |
| Do not block | Reason |
|---|---|
| CSS, JS, images | Google needs them to render pages; blocking breaks indexing |
/_next/ (Next.js) | Breaks CSS/JS loading; static assets in GSC "Crawled - not indexed" is expected. See indexing |
| Pages that use noindex | Crawlers must access the page to read the noindex directive; blocking in robots.txt prevents that |
Only block: paths that don't need crawling: /admin/, /api/, /staging/, temp files.
robots.txt is effective for all measured AI crawlers (Vercel/MERJ study, 2024). Set rules per user-agent; check each vendor's docs for current tokens.
| User-agent | Purpose | Typical |
|---|---|---|
| OAI-SearchBot | ChatGPT search | Allow |
| GPTBot | OpenAI training | Disallow |
| Claude-SearchBot | Claude search | Allow |
| ClaudeBot | Anthropic training | Disallow |
| PerplexityBot | Perplexity search | Allow |
| Google-Extended | Gemini training | Disallow |
| CCBot | Common Crawl (LLM training) | Disallow |
| Bytespider | ByteDance | Disallow |
| Meta-ExternalAgent | Meta | Disallow |
| AppleBot | Apple (Siri, Spotlight); renders JS | Allow for indexing |
Allow vs Disallow: Allow search/indexing bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot); Disallow training-only bots (GPTBot, ClaudeBot, CCBot) if you don't want content used for model training. See site-crawlability for AI crawler optimization (SSR, URL management).
Clean-param: utm_source&utm_medium&utm_campaign&utm_term&utm_content&ref&fbclid&gclid