Skill

fetch-url-as-markdown

Fetches web pages as clean Markdown using local trafilatura, with Exa MCP fallback for JS-rendered or anti-bot sites. Use for reading, scraping, summarizing, or quoting URLs.

Python

Bash

cli-tools

automation

npx claudepluginhub codealive-ai/ai-driven-development --plugin ai-driven-development

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Fetch any web URL and get clean, readable Markdown — main content only, no

Supporting Assets

README.mdscripts/fetch_url.pyscripts/settings.cfg

SKILL.md

Similar Skills

convert-to-markdown

129

Extracts clean Markdown from any URL using ezycopy CLI. Handles JS-rendered pages with headless Chrome, retries on failure, and auto-installs tool if needed.

3 tools

claude-utilities

web-to-markdown

667

Converts web URLs to cleaned Markdown with site-specific routing: r.jina.ai for general/X/Twitter pages, defuddle.md for YouTube, browser-impersonated extraction for WeChat/Zhihu/Feishu using Mozilla Readability and Turndown.

8 files

writing-workflows

firecrawl-scrape

Scrapes clean, LLM-optimized markdown from URLs including JavaScript-rendered SPAs. Handles multiple concurrent URLs, main-content extraction, JS rendering waits, and optional queries.

2 tools

firecrawl

Stats

Stars46

Forks2

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

URL to Markdown

Fetch any web URL and get clean, readable Markdown — main content only, no navigation/footer/ads. Local + free by default; smart fallback to Exa MCP when the page can't be extracted locally.

Workflow (the only thing the agent needs to remember)

Try trafilatura first:

python3 ~/.claude/skills/fetch-url-as-markdown/scripts/fetch_url.py "<URL>"

If exit code is 1 or 2 → fall back to Exa MCP with the same URL:
```
mcp__exa__web_search_advanced_exa(
    query="<URL>",
    includeDomains=["<host of URL>"],
    numResults=1,
    textMaxCharacters=50000,
    type="auto"
)
```
(mcp__exa__crawling works too if the server exposes it; the web_search_advanced_exa call above is the always-available variant — pin the host with includeDomains and use the URL itself as the query.)
Exit code 3 means trafilatura is not installed — install once:
```
python3 -m pip install --break-system-packages trafilatura
```

Exit codes (what they mean for the fallback decision)

Code	Meaning	Action
0	Markdown printed to stdout	done
1	DownloadError — network/HTTP/timeout/anti-bot block at fetch	fall back to Exa
2	ExtractionError — empty extract, JS/Cloudflare wall, or stub body (<200 chars)	fall back to Exa
3	trafilatura missing	install (see above), then retry
4	UnsupportedContentTypeError — URL is binary (PDF, image, archive)	don't fall back to Exa; use the right specialized skill (e.g. `pdf` for PDFs)

Defaults baked into the script

output_format="markdown", include_formatting=True — keeps headings/lists/code structure where the source HTML uses real <h1..h6> etc.
include_links=True, include_tables=True
with_metadata=True → emits a YAML frontmatter (title, author, date, url, hostname)
favor_recall=True, deduplicate=True — readable but trims duplicates
Real-browser User-Agent + 30s timeout configured in scripts/settings.cfg
Anti-stub guards (built into the script):
- rejects Content-Type other than text/html|application/xhtml+xml|text/plain|application/xml|text/xml → exit 4
- sniffs raw HTML for Cloudflare / "Please enable JavaScript" / Imperva / DataDome wall markers → exit 2
- rejects extracted bodies under 50 chars (configurable via --min-body N, 0 to disable) → exit 2

Useful flags

... fetch_url.py "<URL>" --no-links     # strip hyperlinks
... fetch_url.py "<URL>" --no-tables    # strip tables
... fetch_url.py "<URL>" --no-metadata  # omit YAML header
... fetch_url.py "<URL>" --comments     # include user comments (off by default — usually noise)
... fetch_url.py "<URL>" --images       # include image refs (experimental)
... fetch_url.py "<URL>" --precision    # terser output, drops borderline content

When to choose what

Situation	Tool
Article, blog post, docs, README, wiki	trafilatura (default) — local, free
JS-heavy SPA, login-walled, Cloudflare	Exa fallback (the script will signal exit 2)
Bulk / many URLs	trafilatura — no quota, no API key
Already failed twice on a domain	Exa directly