Scrape data from websites by inspecting and calling their frontend APIs. Use when asked to "scrape", "fetch data from", "extract data from", "get all X from" a website URL. Automatically discovers API endpoints, fetches data, and outputs JSON or CSV.
Scrapes website data by reverse-engineering frontend API calls and generating Python fetchers.
/plugin marketplace add syftdata/gtm-toolkit/plugin install syftdata-gtm-toolkit@syftdata/gtm-toolkitThis skill is limited to using the following tools:
scripts/fetch_api_data.pyScrape data from websites by reverse-engineering their frontend API calls.
Requires: Chrome DevTools MCP (mcp__chrome-devtools__*)
chrome://inspect/#remote-debugging and enable remote debugging{
"mcpServers": {
"chrome-devtools": {
"command": "npx",
"args": ["chrome-devtools-mcp@latest", "--autoConnect"]
}
}
}
Note: Uses your existing browser session - you stay logged in to all your sites.
1. Call mcp__chrome-devtools__new_page with the target URL
2. Wait for page to load (requests will be captured automatically)
1. Call mcp__chrome-devtools__list_network_requests with resourceTypes: ["fetch", "xhr"]
- This filters to only API calls, excluding static assets
2. Look for data API calls by URL pattern:
- /api/, /v1/, /v2/, /graphql
- algolia.net, search, query
- POST requests returning JSON
3. Note the reqid of interesting requests
For each interesting request, call mcp__chrome-devtools__get_network_request with the reqid.
This returns the full details:
Extract:
Create a Python script using the exact request format discovered:
#!/usr/bin/env python3
import json
import requests
# API configuration extracted from network inspection
API_URL = "extracted_url"
HEADERS = {
"Content-Type": "application/json",
# Add auth headers exactly as seen in request
}
def fetch_data():
all_data = []
# Use exact payload format from request body
payload = {"requests": [...]}
response = requests.post(API_URL, headers=HEADERS, json=payload)
data = response.json()
return data
if __name__ == "__main__":
data = fetch_data()
print(json.dumps(data, indent=2))
Save the fetched data:
{descriptive_name}.json| Tool | Purpose |
|---|---|
list_pages | See open browser pages |
new_page | Open URL in new page |
select_page | Switch to a page |
navigate_page | Navigate current page |
list_network_requests | List captured requests (filter by type) |
get_network_request | Get full request/response details |
evaluate_script | Run JavaScript in page |
take_snapshot | Get DOM snapshot |
Use resourceTypes parameter to filter:
["fetch", "xhr"] # API calls only (recommended)
["document"] # HTML pages
["script"] # JavaScript files
["stylesheet"] # CSS files
1. mcp__chrome-devtools__new_page
url: "https://www.ycombinator.com/companies"
2. mcp__chrome-devtools__list_network_requests
resourceTypes: ["fetch", "xhr"]
Result: reqid=229 POST https://45bwzj1sgc-dsn.algolia.net/...
3. mcp__chrome-devtools__get_network_request
reqid: 229
Result shows:
- Request Body: {"requests":[{"indexName":"YCCompany_production",...}]}
- Response Body: {"results":[{"nbHits":5611,"hits":[...],...}]}
4. Generate Python script with exact API format
5. Output: yc_companies.json
ALGOLIA_URL = "https://{app_id}-dsn.algolia.net/1/indexes/*/queries"
headers = {
"Content-Type": "application/json",
}
# API key often in URL params: x-algolia-api-key=...
payload = {
"requests": [{
"indexName": "YCCompany_production",
"params": "query=&hitsPerPage=1000"
}]
}
page = 0
while True:
response = requests.get(f"{API_URL}?page={page}&limit=100")
data = response.json()
if not data["items"]:
break
all_items.extend(data["items"])
page += 1
query = """
query GetItems($first: Int, $after: String) {
items(first: $first, after: $after) {
edges { node { id, name } }
pageInfo { hasNextPage, endCursor }
}
}
"""
response = requests.post(API_URL, json={"query": query, "variables": {...}})
If no API is found (server-rendered pages like WordPress):
mcp__chrome-devtools__evaluate_script to extract data from DOMmcp__chrome-devtools__clickmcp__chrome-devtools__take_snapshot to get page structure// Example: Extract data from DOM
document.querySelectorAll('.card').forEach(card => {
const name = card.querySelector('h3')?.textContent;
const url = card.querySelector('a')?.href;
// ...
});
["fetch", "xhr"] to see only API callsUse when working with Payload CMS projects (payload.config.ts, collections, fields, hooks, access control, Payload API). Use when debugging validation errors, security issues, relationship queries, transactions, or hook behavior.
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.