Crawl and ingest websites into whorl. Use when scraping a personal site, blog, or extracting web content for the knowledge base.
/plugin marketplace add Uzay-G/whorl/plugin install uzay-g-whorl@Uzay-G/whorlThis skill is limited to using the following tools:
Crawl websites and ingest content into your whorl knowledge base.
Install trafilatura if not already available:
pip install trafilatura
Extract a single page and save to whorl docs:
# Extract content as markdown
trafilatura -u "https://example.com/page" --markdown > ~/.whorl/docs/page-name.md
Or with metadata in frontmatter:
URL="https://example.com/page"
SLUG=$(echo "$URL" | sed 's|https\?://||; s|/|_|g; s|_$||')
OUTPUT=~/.whorl/docs/"$SLUG".md
# Fetch and extract
CONTENT=$(trafilatura -u "$URL" --markdown)
TITLE=$(trafilatura -u "$URL" --json | python3 -c "import sys,json; print(json.load(sys.stdin).get('title','Untitled'))" 2>/dev/null || echo "Untitled")
# Write with frontmatter
cat > "$OUTPUT" << EOF
---
title: "$TITLE"
source_url: $URL
fetched_at: $(date -u +%Y-%m-%dT%H:%M:%SZ)
---
$CONTENT
EOF
echo "Saved to $OUTPUT"
Crawl up to 30 pages from a site:
trafilatura --crawl "https://example.com" --markdown -o ~/.whorl/docs/site-name/
Or with sitemap:
trafilatura --sitemap "https://example.com/sitemap.xml" --markdown -o ~/.whorl/docs/site-name/
For more control, use Python:
import os
from pathlib import Path
from datetime import datetime, timezone
import trafilatura
from trafilatura.spider import focused_crawler
WHORL_DOCS = Path.home() / ".whorl" / "docs"
site_dir = WHORL_DOCS / "my-site"
site_dir.mkdir(parents=True, exist_ok=True)
for url in focused_crawler("https://example.com", max_seen_urls=50):
downloaded = trafilatura.fetch_url(url)
if not downloaded:
continue
content = trafilatura.extract(downloaded, output_format='markdown')
metadata = trafilatura.extract_metadata(downloaded)
if not content:
continue
# Generate filename from URL
slug = url.split("//")[-1].replace("/", "_").rstrip("_")[:80]
filepath = site_dir / f"{slug}.md"
# Write with frontmatter
title = metadata.title if metadata else "Untitled"
frontmatter = f"""---
title: "{title}"
source_url: {url}
fetched_at: {datetime.now(timezone.utc).isoformat()}
---
"""
filepath.write_text(frontmatter + content)
print(f"+ {filepath.name}")
Run whorl sync to process new documents with ingestion agents:
whorl sync
Or if running locally without auth:
curl -X POST http://localhost:8000/api/sync
curl -Omax_seen_urls to limit scope, or target specific sitemaps