Help us improve
Share bugs, ideas, or general feedback.
From scraper-generator
Systematic approach for analyzing API documentation site structure, discovering HTML patterns, framework signatures, and extraction strategies to inform scraper code generation.
npx claudepluginhub grailautomation/claude-plugins --plugin scraper-generatorHow this skill is triggered — by the user, by Claude, or both
Slash command
/scraper-generator:doc-site-analysisThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
When analyzing an API documentation page to understand its structure and extraction patterns, follow this systematic approach.
Crawls web documentation into context for up-to-date coding assistance. Quick mode fetches 1-15 pages; deep mode indexes 20-100 pages for search and selective retrieval.
Ingests documentation site URLs, discovers pages via sitemap or nav crawl, extracts markdown, and generates Claude Code skill packages with SKILL.md indexes and references.
Analyzes web page content, structure, layout, metadata, and detects frameworks like React, Next.js, Vue, Angular, Svelte using browser automation.
Share bugs, ideas, or general feedback.
When analyzing an API documentation page to understand its structure and extraction patterns, follow this systematic approach.
Your goal is to discover how a documentation site organizes its API reference content so you can later write code to extract that content deterministically. You're not extracting data now—you're discovering the patterns that will inform scraper code generation.
Fetch the target URL and observe the raw HTML structure. Look for:
Document your observations before proceeding. The goal is to understand the "shape" of the documentation.
API documentation typically provides an index of available endpoints. Common patterns:
Look for tables with columns like Method/Resource/Description or similar. These are gold—they give you a complete list of endpoints in structured form.
<table>
<tr><th>Type</th><th>Resource</th><th>Description</th></tr>
<tr>
<td>GET</td>
<td><a href="#get-connection">connections/:connection_id</a></td>
<td>Get connection details</td>
</tr>
</table>
Key signals:
#heading-id) pointing to detail sectionsSome sites list endpoints in the sidebar. Look for:
When no explicit index exists, the headings themselves form the index:
Once you know where endpoints are listed, understand how detail sections are organized:
For heading-based sections (most common), note:
Within each endpoint section, identify where to find:
Usually in tables with columns: Name, Type, Required, Description
<table>
<tr><th>Name</th><th>Type</th><th>Required</th><th>Description</th></tr>
<tr><td>id</td><td>integer</td><td>yes</td><td>User ID</td></tr>
</table>
Look for section subheadings like "Request Parameters", "Query Parameters", "Body Parameters"
Code blocks with language hints:
<pre><code class="language-json">{"name": "example"}</code></pre>
<pre><code class="language-curl">curl -X GET ...</code></pre>
Or syntax-highlighted divs:
<div class="highlight-json"><pre>...</pre></div>
Prose paragraphs between the heading and first table/code block. May contain important context about authentication, rate limits, or special behavior.
Real documentation has inconsistencies. Look for:
Document any patterns that differ from the main structure.
After analysis, produce a structured document containing:
site:
name: "Workato API Documentation"
base_url: "https://docs.workato.com"
framework: "VuePress" # or Docusaurus, ReadMe, custom, unknown
index_pattern:
type: "quick_reference_table" # or sidebar, headings, list
location: "top of page"
columns: ["Type", "Resource", "Description"]
link_column: "Resource"
anchor_format: "#heading-id"
section_pattern:
type: "heading_based"
heading_level: "h2"
id_source: "explicit" # or generated
text_format: "{description}" # e.g., "Get connection details"
content_elements:
parameters:
type: "table"
columns: ["Name", "Type", "Required", "Description"]
request_examples:
type: "code_block"
languages: ["curl", "json"]
container: "pre > code"
response_examples:
type: "code_block"
languages: ["json"]
container: "div.highlight-json pre"
edge_cases:
- "First endpoint in some pages has broken anchor link"
- "Some tables use 'Required?' instead of 'Required'"
This structured output becomes the input for scraper code generation.
For detailed information on specific topics:
references/html-patterns.md - Common HTML structures for tables, code blocks, navigationreferences/framework-signatures.md - How to identify VuePress, Docusaurus, ReadMe, etc.references/extraction-strategies.md - Parsing approaches for different patternsFor a worked example:
examples/workato-analysis.md - Complete analysis of Workato API documentation