Main crawling and indexing logic for documentation sites, handles sitemap parsing, content extraction, and storage
Crawls documentation sites via sitemap, extracts clean content, and creates searchable indexes. Used when `/fetch-docs` is called to fetch and index documentation from any supported framework (Docusaurus, VitePress, Nextra, GitBook, Mintlify, ReadTheDocs).
/plugin marketplace add squirrelsoft-dev/doc-fetcher/plugin install doc-fetcher@squirrelsoft-dev-toolsThis skill inherits all available tools. When active, it can use any tool Claude has access to.
I orchestrate the crawling, parsing, and indexing of documentation from web sources. I'm the core engine behind the /fetch-docs command.
I handle the complete documentation fetching workflow:
First, I determine how to discover all documentation pages:
1. Check for llms.txt (delegate to llms-txt-finder)
✗ Not found → Continue
2. Check for sitemap.xml
✓ Found at https://example.com/sitemap.xml
3. Parse sitemap
Found 234 URLs matching /docs/ pattern
4. Filter and validate URLs
Kept 234 documentation URLs
Excluded 45 blog/marketing URLs
I detect the documentation framework to optimize extraction:
Supported Frameworks:
Docusaurus - Meta's React-based doc framework
docusaurus.config.js or <meta name="generator" content="Docusaurus">VitePress - Vue-powered static site generator
.vitepress/config.js or <meta name="generator" content="VitePress">.vp-doc content blocksconfig.js for sidebar structureNextra - Next.js-based docs framework (used by Next.js itself)
_meta.json files or Next.js patterns_meta.json hierarchyGitBook - Popular documentation platform
SUMMARY.md or GitBook meta tagsSUMMARY.md for structureMintlify - Modern documentation platform (used by Supabase)
ReadTheDocs - Sphinx-based documentation
.document or .body contentsearchindex.js for page listCustom/Static - Fallback for custom sites
Systematic fetching with safeguards:
// Conceptual crawling algorithm
const crawlConfig = {
maxPages: 500, // Configurable limit
delayMs: 1000, // Rate limiting
timeout: 30000, // 30 second timeout per page
retries: 3, // Retry failed pages
respectRobotsTxt: true, // Honor robots.txt
userAgent: "Claude Code Doc Fetcher/1.0"
};
for (const url of discoveredUrls) {
// Rate limiting
await delay(crawlConfig.delayMs);
// Fetch with timeout
const html = await fetchWithTimeout(url, crawlConfig.timeout);
// Extract content
const content = await extractContent(html, framework);
// Save to cache
await saveToCache(url, content);
// Progress update
updateProgress(currentPage, totalPages);
}
I extract clean, AI-friendly content:
What I Keep:
What I Remove:
Example Extraction:
Input (HTML):
<!DOCTYPE html>
<html>
<head>...</head>
<body>
<nav><!-- Navigation --></nav>
<main>
<h1>Server Actions</h1>
<p>Server Actions are asynchronous functions...</p>
<pre><code class="language-tsx">
export async function createUser(formData: FormData) {
'use server'
// ...
}
</code></pre>
</main>
<footer><!-- Footer --></footer>
</body>
</html>
Output (Markdown):
# Server Actions
Server Actions are asynchronous functions...
```tsx
export async function createUser(formData: FormData) {
'use server'
// ...
}
### Phase 5: Storage
I organize documentation in a structured format:
.claude/docs/ └── nextjs/ └── 15.0.3/ ├── index.json # Metadata ├── sitemap.json # Page hierarchy └── pages/ ├── getting-started.md ├── routing/ │ ├── introduction.md │ ├── defining-routes.md │ ├── pages-and-layouts.md │ └── ... ├── data-fetching/ │ ├── fetching.md │ ├── caching.md │ └── ... └── api-reference/ └── ...
**index.json** - Complete metadata:
```json
{
"library": "nextjs",
"version": "15.0.3",
"source_url": "https://nextjs.org/docs",
"fetched_at": "2025-01-17T10:30:00Z",
"framework": "nextra",
"page_count": 234,
"total_size_bytes": 5242880,
"llms_txt_url": null,
"sitemap_url": "https://nextjs.org/sitemap.xml",
"skill_generated": true,
"skill_path": "skills/nextjs-15-expert",
"extraction_strategy": "nextra-optimized",
"crawl_duration_seconds": 245,
"errors": 0,
"warnings": 3
}
sitemap.json - Page hierarchy:
{
"structure": [
{
"title": "Getting Started",
"path": "getting-started.md",
"children": []
},
{
"title": "Routing",
"path": "routing/",
"children": [
{
"title": "Introduction",
"path": "routing/introduction.md"
},
{
"title": "Defining Routes",
"path": "routing/defining-routes.md"
}
]
}
]
}
Create searchable index for fast lookups:
{
"index_version": "1.0",
"created_at": "2025-01-17T10:35:00Z",
"pages": [
{
"path": "routing/defining-routes.md",
"title": "Defining Routes",
"headings": [
"Creating Routes",
"File Conventions",
"Dynamic Routes"
],
"keywords": ["routing", "file-system", "app-router", "dynamic"],
"code_languages": ["tsx", "jsx"],
"word_count": 1234
}
]
}
Strategy:
1. Parse docusaurus.config.js for sidebar structure
2. Extract from <article> or .markdown elements
3. Preserve admonitions (:::note, :::tip, etc.)
4. Handle versioned docs (if multiple versions)
Example:
<article class="markdown">
<h1>Title</h1>
<p>Content...</p>
</article>
→
# Title
Content...
Strategy:
1. Parse .vitepress/config.js for navigation
2. Extract from .vp-doc or .content divs
3. Handle frontmatter in markdown source
4. Preserve custom containers
Example selectors:
- .vp-doc
- .content
- main article
Strategy:
1. Parse _meta.json files for structure
2. Extract main content from MDX
3. Handle Next.js-specific components
4. Follow nested _meta.json for hierarchy
Example:
_meta.json:
{
"index": "Introduction",
"routing": "Routing",
"data-fetching": "Data Fetching"
}
Strategy:
1. Parse SUMMARY.md for table of contents
2. Extract from .page-inner or article elements
3. Handle GitBook-specific markdown extensions
4. Follow chapter structure
Example SUMMARY.md:
# Summary
* [Introduction](README.md)
* [Chapter 1](chapter1.md)
* [Section 1.1](chapter1/section1.md)
Strategy:
1. Use Mintlify API if available
2. Extract from main content containers
3. Handle MDX components
4. Parse mint.json for navigation
Example selectors:
- .docs-content
- main article
Strategy:
1. Parse searchindex.js for page list
2. Extract from .document or .body
3. Handle Sphinx directives
4. Preserve code-block languages
Example selectors:
- div.document
- div.body
- section[role="main"]
Default: 1 second between requests
Configurable: 100ms - 5000ms
Behavior:
- Respect server delays
- Exponential backoff on errors
- Throttle on rate limit responses (429)
1. Fetch robots.txt before crawling
2. Parse User-agent rules
3. Respect Disallow directives
4. Honor Crawl-delay if specified
Example robots.txt:
User-agent: *
Disallow: /admin
Crawl-delay: 1
404 Not Found:
→ Skip page, log warning, continue
500 Server Error:
→ Retry with exponential backoff (3 attempts)
Rate Limit (429):
→ Increase delay, retry after specified time
Timeout:
→ Retry with longer timeout
Network Error:
→ Retry, offer to pause/resume
Fetching Next.js documentation...
[████████████░░░░░░░░] 156/234 pages (67%)
Current: /docs/app/building-your-application/routing/defining-routes
Speed: 2.5 pages/sec
Elapsed: 1m 2s
Remaining: ~30s
Errors: 0
After crawling, I validate the content:
✓ All pages fetched (234/234)
✓ No missing dependencies
✓ Code blocks properly formatted
✓ Images referenced correctly
✓ Internal links valid
⚠ 3 external links broken (logged)
Quality Score: 98/100
When updating existing documentation:
1. Compare current index with new sitemap
2. Identify:
- New pages (added)
- Modified pages (changed)
- Removed pages (deleted)
3. Fetch only changed pages
4. Update index
5. Preserve unchanged content
Example:
Checking for changes...
New: 5 pages
Modified: 12 pages
Removed: 2 pages
Unchanged: 215 pages
Fetching only 17 changed pages...
(Saves ~90% of time vs full re-crawl)
Configure my behavior in doc-fetcher-config.json:
{
"indexer": {
"crawl_delay_ms": 1000,
"max_pages_per_fetch": 500,
"timeout_ms": 30000,
"max_retries": 3,
"respect_robots_txt": true,
"user_agent": "Claude Code Doc Fetcher/1.0",
"frameworks": {
"auto_detect": true,
"priority": ["docusaurus", "vitepress", "nextra", "gitbook"]
},
"extraction": {
"remove_navigation": true,
"preserve_code_blocks": true,
"extract_images": true,
"follow_redirects": true
},
"validation": {
"check_links": true,
"validate_code": false,
"min_content_length": 100
}
}
}
I leverage these tools:
/fetch-docs nextjs
→ Invoking doc-indexer skill...
[1/7] Discovery
✓ Checked for llms.txt (not found)
✓ Found sitemap.xml
✓ Parsed 234 documentation URLs
[2/7] Framework Detection
✓ Detected: Nextra
✓ Optimized extraction strategy loaded
[3/7] Robots.txt Check
✓ Fetched robots.txt
✓ No restrictions for our user-agent
✓ Crawl-delay: 1 second (using default)
[4/7] Crawling
[████████████████████] 234/234 (100%)
Duration: 4m 12s
Errors: 0
[5/7] Content Extraction
✓ Extracted clean markdown from all pages
✓ Preserved 156 code blocks
✓ Removed navigation/footer from all pages
[6/7] Storage
✓ Saved to .claude/docs/nextjs/15.0.3/
✓ Created index.json
✓ Created sitemap.json
Total size: 5.2 MB
[7/7] Validation
✓ All pages valid
✓ Code blocks properly formatted
✓ Links checked (3 external broken - logged)
Quality: 98/100
✓ Documentation indexed successfully!
Next: Generating skill (auto-enabled)
Run: /generate-doc-skill nextjs
"Framework detection failed"
"Too many pages (>500)"
"Rate limited by server"
"Content extraction poor quality"
llms-txt-finder skill - Checks for AI-optimized docs firstdoc-crawler agent - Advanced crawling for difficult sites/fetch-docs command - Main entry point/update-docs command - Incremental updatesThis skill should be used when the user asks to "create a slash command", "add a command", "write a custom command", "define command arguments", "use command frontmatter", "organize commands", "create command with file references", "interactive command", "use AskUserQuestion in command", or needs guidance on slash command structure, YAML frontmatter fields, dynamic arguments, bash execution in commands, user interaction patterns, or command development best practices for Claude Code.
This skill should be used when the user asks to "create an agent", "add an agent", "write a subagent", "agent frontmatter", "when to use description", "agent examples", "agent tools", "agent colors", "autonomous agent", or needs guidance on agent structure, system prompts, triggering conditions, or agent development best practices for Claude Code plugins.
This skill should be used when the user asks to "create a hook", "add a PreToolUse/PostToolUse/Stop hook", "validate tool use", "implement prompt-based hooks", "use ${CLAUDE_PLUGIN_ROOT}", "set up event-driven automation", "block dangerous commands", or mentions hook events (PreToolUse, PostToolUse, Stop, SubagentStop, SessionStart, SessionEnd, UserPromptSubmit, PreCompact, Notification). Provides comprehensive guidance for creating and implementing Claude Code plugin hooks with focus on advanced prompt-based hooks API.