Efficient Web Scraping for Agents

Extract data from websites efficiently using uv scripts. Never bloat context by reading raw HTML/JSON into conversation.

Core Principle

Programmatic extraction > Reading thousands of lines

Approach	Tokens	Quality	Speed
Read raw HTML into context	50,000+	Poor	Slow
uv script + structured output	~200	Excellent	Fast

When This Applies

Scraping website content (articles, profiles, feeds)
Fetching API responses
Extracting data from web services
Processing RSS/JSON feeds
Crawling multiple pages

The Pattern

1. Identify Target Data

Before writing code, specify exactly what you need:

**Target**: LinkedIn profile data
**Fields needed**: name, headline, recent posts (last 5)
**Output format**: JSON

2. Write a uv Script

Create a single-file Python script using uv inline metadata:

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.11"
# dependencies = ["requests", "beautifulsoup4", "lxml"]
# ///
"""
Extract specific data from [target].
Output: Structured JSON to stdout.
"""
import json
import sys
import requests
from bs4 import BeautifulSoup

def extract_data(url: str) -> dict:
    """Extract target fields from URL."""
    response = requests.get(url, headers={
        "User-Agent": "Mozilla/5.0 (compatible; research)"
    })
    response.raise_for_status()

    soup = BeautifulSoup(response.text, "lxml")

    # Extract ONLY the fields you need
    return {
        "title": soup.find("h1").get_text(strip=True) if soup.find("h1") else None,
        "description": soup.find("meta", {"name": "description"})["content"]
                       if soup.find("meta", {"name": "description"}) else None,
        # Add only needed fields
    }

if __name__ == "__main__":
    url = sys.argv[1] if len(sys.argv) > 1 else None
    if not url:
        print("Usage: script.py <url>", file=sys.stderr)
        sys.exit(1)

    result = extract_data(url)
    print(json.dumps(result, indent=2, ensure_ascii=False))

3. Run and Use Output

uv run script.py "https://example.com/page" > output.json

Then read only the small JSON output (~200 tokens) instead of the full page (~50,000 tokens).

Common Patterns

Pattern A: API with Authentication

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.11"
# dependencies = ["requests"]
# ///
import json
import os
import requests

def fetch_from_api(endpoint: str) -> dict:
    token = os.environ.get("API_TOKEN")
    response = requests.get(
        endpoint,
        headers={"Authorization": f"Bearer {token}"}
    )
    response.raise_for_status()

    # Extract only needed fields from potentially large response
    data = response.json()
    return {
        "items": [
            {"id": item["id"], "title": item["title"]}
            for item in data.get("items", [])[:10]  # Limit
        ],
        "total": data.get("total_count", 0)
    }

Pattern B: Multiple Pages

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.11"
# dependencies = ["requests", "beautifulsoup4"]
# ///
import json
import sys
import requests
from bs4 import BeautifulSoup

def scrape_list_page(url: str) -> list[dict]:
    """Scrape paginated list, return structured data."""
    results = []
    page = 1

    while len(results) < 50:  # Hard limit
        response = requests.get(f"{url}?page={page}")
        if response.status_code != 200:
            break

        soup = BeautifulSoup(response.text, "html.parser")
        items = soup.select(".item-class")  # Adjust selector

        if not items:
            break

        for item in items:
            results.append({
                "title": item.select_one(".title").get_text(strip=True),
                "link": item.select_one("a")["href"],
            })

        page += 1

    return results

if __name__ == "__main__":
    print(json.dumps(scrape_list_page(sys.argv[1]), indent=2))

Pattern C: Browser-Required Sites (Playwright)

For JavaScript-rendered content:

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.11"
# dependencies = ["playwright"]
# ///
import json
import sys
from playwright.sync_api import sync_playwright

def extract_dynamic_content(url: str) -> dict:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url)
        page.wait_for_selector(".content-loaded")

        # Extract only what you need
        result = {
            "title": page.title(),
            "content": page.locator(".main-content").text_content()[:1000],
        }

        browser.close()
        return result

if __name__ == "__main__":
    print(json.dumps(extract_dynamic_content(sys.argv[1]), indent=2))

Best Practices

DO

Specify exact fields needed before writing script
Limit results (e.g., [:10], < 50 items)
Output structured JSON for easy parsing
Handle errors gracefully with try/except
Add rate limiting for multiple requests
Use selectors (CSS/XPath) not regex for HTML

DON'T

Read full page HTML into context
Fetch data you won't use
Skip error handling
Make unlimited requests
Parse HTML with regex

Script Location

Reusable scraping scripts are in this skill's scripts directory:

chief-of-staff/skills/efficient-scraping/scripts/
└── web-extract.py    # Generic web content extractor

Using web-extract.py

# Basic metadata extraction
uv run scripts/web-extract.py "https://example.com"

# Extract specific elements
uv run scripts/web-extract.py "https://example.com" --selector ".article" --fields "text,href"

Output is always minimal JSON - no raw HTML in context.

Error Handling

Always wrap scraping in try/except and return structured errors:

try:
    result = extract_data(url)
    print(json.dumps(result, indent=2))
except requests.exceptions.HTTPError as e:
    print(json.dumps({"error": f"HTTP {e.response.status_code}", "url": url}))
    sys.exit(1)
except Exception as e:
    print(json.dumps({"error": str(e), "type": type(e).__name__}))
    sys.exit(1)

Token Savings Example

Scenario	Without Script	With Script
LinkedIn profile page	~45,000 tokens	~150 tokens
Twitter feed (20 tweets)	~30,000 tokens	~800 tokens
API response (100 items)	~20,000 tokens	~500 tokens

10-100x token reduction = faster, cheaper, better context for reasoning.

Quick Reference

## Scraping Checklist

- [ ] Define exactly what fields are needed
- [ ] Write uv script with inline dependencies
- [ ] Extract ONLY needed fields
- [ ] Output structured JSON
- [ ] Run script, read JSON output (not raw HTML)
- [ ] Handle errors gracefully

Efficient Web Scraping