Skill

wiki-research-web

Fetch web URLs into raw/ using Firecrawl CLI/REST or stdlib fallback. Invoked by wiki-research for general web pages.

Install

npx claudepluginhub skinnnyjay/wiki-llm --plugin llm-wiki

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Fetches one or more URLs into `raw/` using the best available adapter. Use this sub-skill when the input is a direct URL (not arXiv, not social, not a feed).

SKILL.md

Similar Skills

angular-new-app

Creates new Angular apps using Angular CLI with flags for routing, SSR, SCSS, prefixes, and AI config. Follows best practices for modern TypeScript/Angular development. Use when starting Angular projects.

angular-angular-1

100.1k

angular-developer

20 files

Generates Angular code and provides architectural guidance for projects, components, services, reactivity with signals, forms, dependency injection, routing, SSR, ARIA accessibility, animations, Tailwind styling, testing, and CLI tooling.

angular-angular-1

100.1k

context7-cli

3 files

Executes ctx7 CLI to fetch up-to-date library documentation, manage AI coding skills (install/search/generate/remove/suggest), and configure Context7 MCP. Useful for current API refs, skill handling, or agent setup.

upstash-context7-1

53.8k

Stats

Stars0

Forks0

Last CommitApr 11, 2026

Actions

View Source View Plugin View on GitHub View README

Wiki research — web

Fetches one or more URLs into raw/ using the best available adapter. Use this sub-skill when the input is a direct URL (not arXiv, not social, not a feed).

For single-URL fetch without wiki merge, use wiki-fetch instead.

Compliance: access-sources-disclaimer.md

Step 1 — Check adapter availability

Priority 0 — Read the config first:

llm-wiki integrations status

Open llm-wiki/config.json → integrations. Only consider adapters where enabled: true AND their api_key_env is present in the environment. This is the authoritative source — skip any adapter the user has disabled, regardless of whether the key exists.

// llm-wiki/config.json (reference — do not edit here)
"integrations": {
  "firecrawl": { "enabled": true,  "api_key_env": "FIRECRAWL_API_KEY" },
  "brave":     { "enabled": true,  "api_key_env": "BRAVE_SEARCH_API_KEY" },
  "perplexity":{ "enabled": true,  "api_key_env": "PERPLEXITY_API_KEY" }
}

Then use the first ready and enabled adapter from this order:

Priority	Adapter	Config key	Condition
0	config check	—	Read `llm-wiki/config.json` integrations; build enabled list
1	Brave Search	`integrations.brave`	`enabled: true` + `BRAVE_SEARCH_API_KEY` set — best for extractable web content
2	Firecrawl CLI	`integrations.firecrawl`	`enabled: true` + `which firecrawl` succeeds — cleanest markdown, JS-rendered pages
3	Firecrawl REST	`integrations.firecrawl`	`enabled: true` + `FIRECRAWL_API_KEY` set
4	stdlib url	—	Always available (no JS rendering)

If the page requires JavaScript rendering and only stdlib is available, tell the user and offer to configure Firecrawl or Brave (llm-wiki integrations wizard).

Step 2 — Determine output paths

For each URL, derive a slug:

Strip protocol, domain TLD, query params
Lowercase, replace non-alphanumeric with hyphens
Example: https://news.ycombinator.com/item?id=43012345 → hn-43012345

Output path: raw/research/<topic-slug>/<page-slug>.md

If no topic context is available, use raw/research/misc/<page-slug>.md.

Step 3 — Fetch each URL

# Firecrawl CLI (preferred)
llm-wiki ingest firecrawl <URL> --out research/<topic>/<slug>.md

# Firecrawl REST
llm-wiki ingest firecrawl <URL> --out research/<topic>/<slug>.md

# stdlib fallback
llm-wiki ingest url <URL> --out research/<topic>/<slug>.md

For multiple URLs, fetch them one at a time. Do not batch in a single command if the URLs are independent topics — one file per source ensures stable citations.

Step 4 — Verify output

After each fetch:

Check the file exists and has non-empty body (> 200 words of actual article text).
If empty or < 200 words — page likely needs JS rendering: retry with Firecrawl.
If the URL is a product, listing, or auction page — invoke wiki-extract-ecommerce instead:
- Amazon (amazon.com/dp/ or /gp/product/) → wiki-extract-ecommerce
- eBay (ebay.com/itm/ or sold search) → wiki-extract-ecommerce
- Etsy (etsy.com/listing/) → wiki-extract-ecommerce
- Shopify storefront or auction house lot → wiki-extract-ecommerce
If fetch returned 403, a subscription prompt, or a paywall message — invoke wiki-extract-paywall immediately:
- Medium URLs → Freedium (Priority 1)
- All other paywalled URLs → archive.ph → Wayback → RemovePaywall → 12ft.io → Googlebot spoof
- See skills/wiki-extract-paywall/SKILL.md for the full priority sequence.
Check frontmatter for llm_wiki_security.prompt_injection: suspected — if flagged, follow skills/wiki-research/references/source-eval.md before proceeding.

Signs of a paywall in fetched content:

Body contains "Subscribe to read", "Sign in to continue", "You've reached your free article limit"
Body is under 200 words despite being a full article URL
Body contains a login form or credit card form

Add or confirm these frontmatter fields:

source_url: https://...
fetched_date: YYYY-MM-DD
adapter: brave | firecrawl | firecrawl_rest | stdlib_url
source_type: web
paywall: none | soft | hard   # fill in if bypass was attempted
bypass_method: freedium | archive.ph | wayback | removepaywall | 12ft | googlebot  # if bypassed

Step 5 — Evaluate sources

Run each fetched file through the scoring checklist in skills/wiki-research/references/source-eval.md. Discard or quarantine files that fail the minimum quality bar.

Step 6 — Return to orchestrator

Return to wiki-research Step 3 (post-process). Do not run wiki-ingest directly — the orchestrator handles that.

Done looks like

Each URL produced a raw/ file passing minimum length/quality checks; failures are noted with fallback attempts.
source-eval applied; low-quality scrapes dropped or flagged.
Control returned to wiki-research post-process (not standalone wiki-ingest).

Troubleshooting

Symptom	Fix
Empty body from stdlib	Page needs JS; configure Firecrawl
Body < 200 words / subscription prompt	Paywall detected; invoke wiki-extract-paywall
Medium article paywalled	Use Freedium: `https://freedium.cfd/<URL>` (see wiki-extract-paywall)
`firecrawl: command not found`	`npm install -g firecrawl-cli` then `firecrawl login --browser`
`Missing FIRECRAWL_API_KEY`	`export FIRECRAWL_API_KEY=fc-...` or run `llm-wiki integrations wizard`
403 / paywalled page	Invoke wiki-extract-paywall for bypass sequence
YouTube URL	Use wiki-extract-youtube instead — this skill does not handle video
Amazon / eBay / Etsy / auction URL	Use wiki-extract-ecommerce — product pages need structured extraction
GitHub repo URL	Use wiki-extract-github — handles README, releases, issues properly
Substack / Beehiiv / Ghost URL	Use wiki-extract-newsletter — newsletter-specific bypass patterns
Wikipedia article URL	Use wiki-extract-wikipedia — MediaWiki API gives cleaner output
Crunchbase / LinkedIn URL	Use wiki-extract-crunchbase or wiki-extract-linkedin respectively
Patent URL (Google Patents, Espacenet)	Use wiki-extract-patents — structured patent extraction
Firecrawl rate limit	Wait 30s; use `--delay 2` flag if supported

Related skills

wiki-fetch — single-URL lightweight fetch (no wiki merge)
wiki-research — orchestrator that invoked this skill
wiki-research-academic — for arXiv / DOI URLs
wiki-research-social — for Twitter/X / HN / Reddit URLs
wiki-extract-paywall — for 403s, paywalled articles, Medium content
wiki-extract-youtube — for YouTube video URLs
wiki-extract-annas — for books and papers not accessible via standard web fetch
wiki-extract-ecommerce — for Amazon, eBay, Etsy, Shopify, and auction house URLs
wiki-extract-github — for GitHub repo, issue, and PR URLs
wiki-extract-newsletter — for Substack, Beehiiv, Ghost newsletter issues
wiki-extract-podcast — for podcast episode pages and RSS feeds
wiki-extract-wikipedia — for Wikipedia article URLs (cleaner than plain web fetch)
wiki-extract-crunchbase — for Crunchbase company and person profile URLs
wiki-extract-patents — for Google Patents, Espacenet, and USPTO patent URLs
wiki-extract-linkedin — for LinkedIn company and person profile URLs

Smoke check

CLI: Run llm-wiki integrations status and any llm-wiki line from Step 1 of this skill (from the vault root).
Prompt: Invoke this skill by name in Claude Code; complete Step 1 only and confirm expected CLI or file output.