Systematic approach for analyzing API documentation site structure, discovering HTML patterns, framework signatures, and extraction strategies to inform scraper code generation.
From scraper-generatornpx claudepluginhub grailautomation/claude-plugins --plugin scraper-generatorThis skill uses the workspace's default tool permissions.
examples/workato-analysis.mdreferences/extraction-strategies.mdreferences/framework-signatures.mdreferences/html-patterns.mdWhen analyzing an API documentation page to understand its structure and extraction patterns, follow this systematic approach.
Your goal is to discover how a documentation site organizes its API reference content so you can later write code to extract that content deterministically. You're not extracting data now—you're discovering the patterns that will inform scraper code generation.
Fetch the target URL and observe the raw HTML structure. Look for:
Document your observations before proceeding. The goal is to understand the "shape" of the documentation.
API documentation typically provides an index of available endpoints. Common patterns:
Look for tables with columns like Method/Resource/Description or similar. These are gold—they give you a complete list of endpoints in structured form.
<table>
<tr><th>Type</th><th>Resource</th><th>Description</th></tr>
<tr>
<td>GET</td>
<td><a href="#get-connection">connections/:connection_id</a></td>
<td>Get connection details</td>
</tr>
</table>
Key signals:
#heading-id) pointing to detail sectionsSome sites list endpoints in the sidebar. Look for:
When no explicit index exists, the headings themselves form the index:
Once you know where endpoints are listed, understand how detail sections are organized:
For heading-based sections (most common), note:
Within each endpoint section, identify where to find:
Usually in tables with columns: Name, Type, Required, Description
<table>
<tr><th>Name</th><th>Type</th><th>Required</th><th>Description</th></tr>
<tr><td>id</td><td>integer</td><td>yes</td><td>User ID</td></tr>
</table>
Look for section subheadings like "Request Parameters", "Query Parameters", "Body Parameters"
Code blocks with language hints:
<pre><code class="language-json">{"name": "example"}</code></pre>
<pre><code class="language-curl">curl -X GET ...</code></pre>
Or syntax-highlighted divs:
<div class="highlight-json"><pre>...</pre></div>
Prose paragraphs between the heading and first table/code block. May contain important context about authentication, rate limits, or special behavior.
Real documentation has inconsistencies. Look for:
Document any patterns that differ from the main structure.
After analysis, produce a structured document containing:
site:
name: "Workato API Documentation"
base_url: "https://docs.workato.com"
framework: "VuePress" # or Docusaurus, ReadMe, custom, unknown
index_pattern:
type: "quick_reference_table" # or sidebar, headings, list
location: "top of page"
columns: ["Type", "Resource", "Description"]
link_column: "Resource"
anchor_format: "#heading-id"
section_pattern:
type: "heading_based"
heading_level: "h2"
id_source: "explicit" # or generated
text_format: "{description}" # e.g., "Get connection details"
content_elements:
parameters:
type: "table"
columns: ["Name", "Type", "Required", "Description"]
request_examples:
type: "code_block"
languages: ["curl", "json"]
container: "pre > code"
response_examples:
type: "code_block"
languages: ["json"]
container: "div.highlight-json pre"
edge_cases:
- "First endpoint in some pages has broken anchor link"
- "Some tables use 'Required?' instead of 'Required'"
This structured output becomes the input for scraper code generation.
For detailed information on specific topics:
references/html-patterns.md - Common HTML structures for tables, code blocks, navigationreferences/framework-signatures.md - How to identify VuePress, Docusaurus, ReadMe, etc.references/extraction-strategies.md - Parsing approaches for different patternsFor a worked example:
examples/workato-analysis.md - Complete analysis of Workato API documentation