From antigravity-awesome-skills
Extracts structured data like tables, lists, and prices from web pages via multi-strategy scraping. Supports pagination, browser automation, validation, transforms, and CSV/JSON exports.
npx claudepluginhub absjaded/antigravity-awesome-skillsThis skill uses the workspace's default tool permissions.
Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.
Verifies tests pass on completed feature branch, presents options to merge locally, create GitHub PR, keep as-is or discard; executes choice and cleans up worktree.
Guides root cause investigation for bugs, test failures, unexpected behavior, performance issues, and build failures before proposing fixes.
Writes implementation plans from specs for multi-step tasks, mapping files and breaking into TDD bite-sized steps before coding.
Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.
Execute phases in strict order. Each phase feeds the next.
1. CLARIFY -> 2. RECON -> 3. STRATEGY -> 4. EXTRACT -> 5. TRANSFORM -> 6. VALIDATE -> 7. FORMAT
Never skip Phase 1 or Phase 2. They prevent wasted effort and failed extractions.
Fast path: If user provides URL + clear data target + the request is simple (single page, one data type), compress Phases 1-3 into a single action: fetch, classify, and extract in one WebFetch call. Still validate and format.
Multi-strategy web data extraction with intelligent approach selection, automatic fallback escalation, data transformation, and structured output.
Establish extraction parameters before touching any URL.
| Parameter | Resolve | Default |
|---|---|---|
| Target URL(s) | Which page(s) to scrape? | (required) |
| Data Target | What specific data to extract? | (required) |
| Output Format | Markdown table, JSON, CSV, or text? | Markdown table |
| Scope | Single page, paginated, or multi-URL? | Single page |
| Parameter | Resolve | Default |
|---|---|---|
| Pagination | Follow pagination? Max pages? | No, 1 page |
| Max Items | Maximum number of items to collect? | Unlimited |
| Filters | Data to exclude or include? | None |
| Sort Order | How to sort results? | Source order |
| Save Path | Save to file? Which path? | Display only |
| Language | Respond in which language? | User's lang |
| Diff Mode | Compare with previous run? | No |
When user has a topic but no specific URL:
Example: "find and extract pricing data for CRM tools" -> WebSearch("CRM tools pricing comparison 2026") -> Present top results -> User selects -> Extract
Analyze the target page before extraction.
Use WebFetch to retrieve and analyze the page structure:
WebFetch(
url = TARGET_URL,
prompt = "Analyze this page structure and report:
1. Page type: article, product listing, search results, data table,
directory, dashboard, API docs, FAQ, pricing page, job board, events, or other
2. Main content structure: tables, ordered/unordered lists, card grid, free-form text,
accordion/collapsible sections, tabs
3. Approximate number of distinct data items visible
4. JavaScript rendering indicators: empty containers, loading spinners,
SPA framework markers (React root, Vue app, Angular), minimal HTML with heavy JS
5. Pagination: next/prev links, page numbers, load-more buttons,
infinite scroll indicators, total results count
6. Data density: how much structured, extractable data exists
7. List the main data fields/columns available for extraction
8. Embedded structured data: JSON-LD, microdata, OpenGraph tags
9. Available download links: CSV, Excel, PDF, API endpoints"
)
| Signal | Interpretation | Action |
|---|---|---|
| Rich content with data clearly visible | Static page | Strategy A (WebFetch) |
| Empty containers, "loading...", minimal text | JS-rendered | Strategy B (Browser) |
| Login wall, CAPTCHA, 403/401 response | Blocked | Report to user |
| Content present but poorly structured | Needs precision | Strategy B (Browser) |
| JSON or XML response body | API endpoint | Strategy C (Bash/curl) |
| Download links for CSV/Excel available | Direct data file | Strategy C (download) |
Classify into an extraction mode:
| Mode | Indicators | Examples |
|---|---|---|
table | HTML <table>, grid layout with headers | Price comparison, statistics, specs |
list | Repeated similar elements, card grids | Search results, product listings |
article | Long-form text with headings/paragraphs | Blog post, news article, docs |
product | Product name, price, specs, images, rating | E-commerce product page |
contact | Names, emails, phones, addresses, roles | Team page, staff directory |
faq | Question-answer pairs, accordions | FAQ page, help center |
pricing | Plan names, prices, features, tiers | SaaS pricing page |
events | Dates, locations, titles, descriptions | Event listings, conferences |
jobs | Titles, companies, locations, salaries | Job boards, career pages |
custom | User specified CSS selectors or fields | Anything not matching above |
Record: page type, extraction mode, JS rendering needed (yes/no), available fields, structured data present (JSON-LD etc.).
If user asked for "everything", present the available fields and let them choose.
Choose the extraction approach based on recon results.
Structured data (JSON-LD, microdata) has what we need?
|
+-- YES --> STRATEGY E: Extract structured data directly
|
+-- NO: Content fully visible in WebFetch?
|
+-- YES: Need precise element targeting?
| |
| +-- NO --> STRATEGY A: WebFetch + AI extraction
| +-- YES --> STRATEGY B: Browser automation
|
+-- NO: JavaScript rendering detected?
|
+-- YES --> STRATEGY B: Browser automation
+-- NO: API/JSON/XML endpoint or download link?
|
+-- YES --> STRATEGY C: Bash (curl + jq)
+-- NO --> Report access issue to user
Best for: Static pages, articles, simple tables, well-structured HTML.
Use WebFetch with a targeted extraction prompt tailored to the mode:
WebFetch(
url = URL,
prompt = "Extract [DATA_TARGET] from this page.
Return ONLY the extracted data as [FORMAT] with these columns/fields: [FIELDS].
Rules:
- If a value is missing or unclear, use 'N/A'
- Do not include navigation, ads, footers, or unrelated content
- Preserve original values exactly (numbers, currencies, dates)
- Include ALL matching items, not just the first few
- For each item, also extract the URL/link if available"
)
Auto-escalation: If WebFetch returns suspiciously few items (less than 50% of expected from recon), or mostly empty fields, automatically escalate to Strategy B without asking user. Log the escalation in notes.
Best for: JS-rendered pages, SPAs, interactive content, lazy-loaded data.
Sequence:
tabs_context_mcp(createIfEmpty=true) -> get tabIdnavigate(url=TARGET_URL, tabId=TAB)computer(action="wait", duration=3, tabId=TAB)find(query="cookie consent or accept button", tabId=TAB)
read_page(tabId=TAB) or get_page_text(tabId=TAB)find(query="[DESCRIPTION]", tabId=TAB)javascript_tool// Table extraction
const rows = document.querySelectorAll('TABLE_SELECTOR tr');
const data = Array.from(rows).map(row => {
const cells = row.querySelectorAll('td, th');
return Array.from(cells).map(c => c.textContent.trim());
});
JSON.stringify(data);
// List/card extraction
const items = document.querySelectorAll('ITEM_SELECTOR');
const data = Array.from(items).map(item => ({
field1: item.querySelector('FIELD1_SELECTOR')?.textContent?.trim() || null,
field2: item.querySelector('FIELD2_SELECTOR')?.textContent?.trim() || null,
link: item.querySelector('a')?.href || null,
}));
JSON.stringify(data);
computer(action="scroll", scroll_direction="down", tabId=TAB)
then computer(action="wait", duration=2, tabId=TAB)Best for: REST APIs, JSON endpoints, XML feeds, CSV/Excel downloads.
## Json Api
curl -s "API_URL" | jq '[.items[] | {field1: .key1, field2: .key2}]'
## Csv Download
curl -s "CSV_URL" -o /tmp/scraped_data.csv
## Xml Parsing
curl -s "XML_URL" | python3 -c "
import xml.etree.ElementTree as ET, json, sys
tree = ET.parse(sys.stdin)
## ... Parse And Output Json
"
When a single strategy is insufficient, combine:
When JSON-LD, microdata, or OpenGraph is present:
javascript_tool to extract structured data:const scripts = document.querySelectorAll('script[type="application/ld+json"]');
const data = Array.from(scripts).map(s => {
try { return JSON.parse(s.textContent); } catch { return null; }
}).filter(Boolean);
JSON.stringify(data);
When pagination is detected and user wants multiple pages:
Page-number pagination (any strategy):
?page=N, /page/N, &offset=N)Infinite scroll (Browser only):
computer(action="scroll", scroll_direction="down", tabId=TAB)computer(action="wait", duration=2, tabId=TAB)"Load More" button (Browser only):
find(query="load more button", tabId=TAB)computer(action="left_click", ref=REF, tabId=TAB)Execute the selected strategy using mode-specific patterns. See references/extraction-patterns.md for CSS selectors and JavaScript snippets.
WebFetch prompt:
"Extract ALL rows from the table(s) on this page.
Return as a markdown table with exact column headers.
Include every row - do not truncate or summarize.
Preserve numeric precision, currencies, and units."
WebFetch prompt:
"Extract each [ITEM_TYPE] from this page.
For each item, extract: [FIELD_LIST].
Return as a JSON array of objects with these keys: [KEY_LIST].
Include ALL items, not just the first few. Include link/URL for each item if available."
WebFetch prompt:
"Extract article metadata:
- title, author, date, tags/categories, word count estimate
- Key factual data points, statistics, and named entities
Return as structured markdown. Summarize the content; do not reproduce full text."
WebFetch prompt:
"Extract product data with these exact fields:
- name, brand, price, currency, originalPrice (if discounted),
availability, description (first 200 chars), rating, reviewCount,
specifications (as key-value pairs), productUrl, imageUrl
Return as JSON. Use null for missing fields."
Also check for JSON-LD Product schema (Strategy E) first.
WebFetch prompt:
"Extract contact information for each person/entity:
- name, title, role, email, phone, address, organization, website, linkedinUrl
Return as a markdown table. Only extract real contacts visible on the page."
WebFetch prompt:
"Extract all question-answer pairs from this page.
For each FAQ item extract:
- question: the exact question text
- answer: the answer text (first 300 chars if long)
- category: the section/category if grouped
Return as a JSON array of objects."
WebFetch prompt:
"Extract all pricing plans/tiers from this page.
For each plan extract:
- planName, monthlyPrice, annualPrice, currency
- features (array of included features)
- limitations (array of limits or excluded features)
- ctaText (call-to-action button text)
- highlighted (true if marked as recommended/popular)
Return as JSON. Use null for missing fields."
WebFetch prompt:
"Extract all events/sessions from this page.
For each event extract:
- title, date, time, endTime, location, description (first 200 chars)
- speakers (array of names), category, registrationUrl
Return as JSON. Use null for missing fields."
WebFetch prompt:
"Extract all job listings from this page.
For each job extract:
- title, company, location, salary, salaryRange, type (full-time/part-time/contract)
- postedDate, description (first 200 chars), applyUrl, tags
Return as JSON. Use null for missing fields."
When user provides specific selectors or field descriptions:
javascript_tool and user's CSS selectorsWhen extracting from multiple URLs:
source column/field to every record with the origin URLClean, normalize, and enrich extracted data before validation. See references/data-transforms.md for patterns.
| Transform | Action |
|---|---|
| Whitespace cleanup | Trim, collapse multiple spaces, remove \n in cells |
| HTML entity decode | & -> &, < -> <, ' -> ' |
| Unicode normalization | NFKC normalization for consistent characters |
| Empty string to null | "" -> null (for JSON), "" -> N/A (for tables) |
| Transform | When | Action |
|---|---|---|
| Price normalization | Product/pricing modes | Extract numeric value + currency symbol |
| Date normalization | Any dates found | Normalize to ISO-8601 (YYYY-MM-DD) |
| URL resolution | Relative URLs extracted | Convert to absolute URLs |
| Phone normalization | Contact mode | Standardize to E.164 format if possible |
| Deduplication | Multi-page or multi-URL | Remove exact duplicate rows |
| Sorting | User requested or natural | Sort by user-specified field |
| Enrichment | When | Action |
|---|---|---|
| Currency conversion | User asks for single currency | Note original + convert (approximate) |
| Domain extraction | URLs in data | Add domain column from full URLs |
| Word count | Article mode | Count words in extracted text |
| Relative dates | Dates present | Add "X days ago" column if useful |
When combining data from multiple pages or URLs:
Verify extraction quality before delivering results.
| Check | Action |
|---|---|
| Item count | Compare extracted count to expected count from recon |
| Empty fields | Count N/A or null values per field |
| Data type consistency | Numbers should be numeric, dates parseable |
| Duplicates | Flag exact duplicate rows (post-dedup) |
| Encoding | Check for HTML entities, garbled characters |
| Completeness | All user-requested fields present in output |
| Truncation | Verify data wasn't cut off (check last items) |
| Outliers | Flag values that seem anomalous (e.g. $0.00 price) |
Assign to every extraction:
| Rating | Criteria |
|---|---|
| HIGH | All fields populated, count matches expected, no anomalies |
| MEDIUM | Minor gaps (<10% empty fields) or count slightly differs |
| LOW | Significant gaps (>10% empty), structural issues, partial data |
Always report confidence with specifics:
Confidence: HIGH - 47 items extracted, all 6 fields populated, matches expected count from page analysis.
| Issue | Auto-Recovery Action |
|---|---|
| Missing data | Re-attempt with Browser if WebFetch was used |
| Encoding problems | Apply HTML entity decode + unicode normalization |
| Incomplete results | Check for pagination or lazy-loading, fetch more |
| Count mismatch | Scroll/paginate to find remaining items |
| All fields empty | Page likely JS-rendered, switch to Browser strategy |
| Partial fields | Try JSON-LD extraction as supplement |
Log all recovery attempts in delivery notes. Inform user of any irrecoverable gaps with specific details.
Structure results according to user preference. See references/output-templates.md for complete formatting templates.
ALWAYS wrap results with this metadata header:
## Extraction Results
**Source:** [Page Title](http://example.com)
**Date:** YYYY-MM-DD HH:MM UTC
**Items:** N records (M fields each)
**Confidence:** HIGH | MEDIUM | LOW
**Strategy:** A (WebFetch) | B (Browser) | C (API) | E (Structured Data)
**Format:** Markdown Table | JSON | CSV
---
[DATA HERE]
---
**Notes:**
- [Any gaps, issues, or observations]
- [Transforms applied: deduplication, normalization, etc.]
- [Pages scraped if paginated: "Pages 1-5 of 12"]
- [Auto-escalation if it occurred: "Escalated from WebFetch to Browser"]
:---), right-align numbers (---:)... indicatorN/A for missing values, never leave cells emptyproductName, unitPrice){
"metadata": {
"source": "URL",
"title": "Page Title",
"extractedAt": "ISO-8601",
"itemCount": 47,
"fieldCount": 6,
"confidence": "HIGH",
"strategy": "A",
"transforms": ["deduplication", "priceNormalization"],
"notes": []
},
"data": [ ... ]
}
, as delimiter (standard)# Source: URLWhen user requests file save:
.md extension.json extension.csv extensionWhen comparing data across multiple sources:
Source as the first column/fieldWhen user requests change detection (diff mode):
[NEW][REMOVED][WAS: old_value]When extraction fails or is blocked:
| User Says... | Mode | Strategy | Output Default |
|---|---|---|---|
| "extract the table" | table | A or B | Markdown table |
| "get all products/prices" | product | E then A | Markdown table |
| "scrape the listings" | list | A or B | Markdown table |
| "extract contact info / team page" | contact | A | Markdown table |
| "get the article data" | article | A | Markdown text |
| "extract the FAQ" | faq | A or B | JSON |
| "get pricing plans" | pricing | A or B | Markdown table |
| "scrape job listings" | jobs | A or B | Markdown table |
| "get event schedule" | events | A or B | Markdown table |
| "find and extract [topic]" | discovery | WebSearch | Markdown table |
| "compare prices across sites" | multi-URL | A or B | Comparison table |
| "what changed since last time" | diff | any | Diff format |
Extraction patterns: references/extraction-patterns.md CSS selectors, JavaScript snippets, JSON-LD parsing, domain tips.
Output templates: references/output-templates.md Markdown, JSON, CSV templates with complete examples.
Data transforms: references/data-transforms.md Cleaning, normalization, deduplication, enrichment patterns.