Fetches web pages as clean Markdown using local trafilatura, with Exa MCP fallback for JS-rendered or anti-bot sites. Use for reading, scraping, summarizing, or quoting URLs.
npx claudepluginhub codealive-ai/ai-driven-development --plugin ai-driven-developmentThis skill uses the workspace's default tool permissions.
Fetch any web URL and get clean, readable Markdown — main content only, no
Extracts clean Markdown from any URL using ezycopy CLI. Handles JS-rendered pages with headless Chrome, retries on failure, and auto-installs tool if needed.
Converts web URLs to cleaned Markdown with site-specific routing: r.jina.ai for general/X/Twitter pages, defuddle.md for YouTube, browser-impersonated extraction for WeChat/Zhihu/Feishu using Mozilla Readability and Turndown.
Scrapes clean, LLM-optimized markdown from URLs including JavaScript-rendered SPAs. Handles multiple concurrent URLs, main-content extraction, JS rendering waits, and optional queries.
Share bugs, ideas, or general feedback.
Fetch any web URL and get clean, readable Markdown — main content only, no navigation/footer/ads. Local + free by default; smart fallback to Exa MCP when the page can't be extracted locally.
Try trafilatura first:
python3 ~/.claude/skills/fetch-url-as-markdown/scripts/fetch_url.py "<URL>"
If exit code is 1 or 2 → fall back to Exa MCP with the same URL:
mcp__exa__web_search_advanced_exa(
query="<URL>",
includeDomains=["<host of URL>"],
numResults=1,
textMaxCharacters=50000,
type="auto"
)
(mcp__exa__crawling works too if the server exposes it; the web_search_advanced_exa
call above is the always-available variant — pin the host with includeDomains and
use the URL itself as the query.)
Exit code 3 means trafilatura is not installed — install once:
python3 -m pip install --break-system-packages trafilatura
| Code | Meaning | Action |
|---|---|---|
| 0 | Markdown printed to stdout | done |
| 1 | DownloadError — network/HTTP/timeout/anti-bot block at fetch | fall back to Exa |
| 2 | ExtractionError — empty extract, JS/Cloudflare wall, or stub body (<200 chars) | fall back to Exa |
| 3 | trafilatura missing | install (see above), then retry |
| 4 | UnsupportedContentTypeError — URL is binary (PDF, image, archive) | don't fall back to Exa; use the right specialized skill (e.g. pdf for PDFs) |
output_format="markdown", include_formatting=True — keeps headings/lists/code structure where the source HTML uses real <h1..h6> etc.include_links=True, include_tables=Truewith_metadata=True → emits a YAML frontmatter (title, author, date, url, hostname)favor_recall=True, deduplicate=True — readable but trims duplicatesscripts/settings.cfgContent-Type other than text/html|application/xhtml+xml|text/plain|application/xml|text/xml → exit 42--min-body N, 0 to disable) → exit 2... fetch_url.py "<URL>" --no-links # strip hyperlinks
... fetch_url.py "<URL>" --no-tables # strip tables
... fetch_url.py "<URL>" --no-metadata # omit YAML header
... fetch_url.py "<URL>" --comments # include user comments (off by default — usually noise)
... fetch_url.py "<URL>" --images # include image refs (experimental)
... fetch_url.py "<URL>" --precision # terser output, drops borderline content
| Situation | Tool |
|---|---|
| Article, blog post, docs, README, wiki | trafilatura (default) — local, free |
| JS-heavy SPA, login-walled, Cloudflare | Exa fallback (the script will signal exit 2) |
| Bulk / many URLs | trafilatura — no quota, no API key |
| Already failed twice on a domain | Exa directly |