From workflow-tools
Retrieve content from a URL using a four-tier escalation strategy: WebFetch → curl → Playwright (JS rendering) → BrightData (anti-bot bypass). Use when standard WebFetch fails or when the target URL is known to require JS rendering or bot mitigation bypass.
npx claudepluginhub hpsgd/turtlestack --plugin workflow-toolsThis skill is limited to using the following tools:
Retrieve the content at $ARGUMENTS using the appropriate retrieval method.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
Designs and optimizes AI agent action spaces, tool definitions, observation formats, error recovery, and context for higher task completion rates.
Designs, implements, and audits WCAG 2.2 AA accessible UIs for Web (ARIA/HTML5), iOS (SwiftUI traits), and Android (Compose semantics). Audits code for compliance gaps.
Retrieve the content at $ARGUMENTS using the appropriate retrieval method.
Before attempting retrieval, classify the target:
| Signal | Likely tier needed |
|---|---|
| Standard website, no login required | Tier 1 (WebFetch) |
| JavaScript-rendered SPA (React, Vue, Angular) | Tier 3 (Playwright) |
| Known anti-bot protection (Cloudflare, Datadome, PerimeterX) | Tier 4 (BrightData) |
| Previously failed WebFetch with 403/429 | Start at Tier 2 |
| News article, blog, documentation | Tier 1 first |
If classification is uncertain, start at Tier 1 and escalate on failure.
Use the WebFetch tool. This works for the majority of public content.
Success: content returned with meaningful text — proceed.
Escalate to Tier 2 if:
<div id="root"></div> with no text)Simulate a real browser request by adding common headers.
curl -s -L \
-H "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" \
-H "Accept: text/html,application/xhtml+xml,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8" \
-H "Accept-Language: en-AU,en;q=0.9" \
-H "Accept-Encoding: gzip, deflate, br" \
"[URL]"
Success: HTML returned with content — extract the relevant text and proceed.
Escalate to Tier 3 if:
For JavaScript-rendered content, use Playwright to render the page fully before extracting content.
npx playwright chromium --no-sandbox "[URL]"
Or via a Playwright script if more control is needed:
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('[URL]', { waitUntil: 'networkidle' });
const content = await page.content();
console.log(content);
await browser.close();
})();
Prerequisites: Playwright must be installed (npm install playwright or pip install playwright && playwright install chromium). Check first — don't assume it's available.
Success: fully rendered HTML returned with content — extract and proceed.
Escalate to Tier 4 if:
For sites with aggressive anti-bot protection that defeats all standard approaches.
BrightData's Scraping Browser routes requests through residential proxies with full browser fingerprinting. This is a paid service — confirm it's available in the environment before attempting.
const { chromium } = require('playwright-core');
(async () => {
const browser = await chromium.connectOverCDP(
'wss://brd-customer-[CUSTOMER_ID]-zone-scraping_browser:PASSWORD@brd.superproxy.io:9222'
);
const page = await browser.newPage();
await page.goto('[URL]', { waitUntil: 'domcontentloaded' });
const content = await page.content();
console.log(content);
await browser.close();
})();
Credentials are loaded from environment variables (BRIGHTDATA_CUSTOMER_ID, BRIGHTDATA_PASSWORD). Never hardcode credentials.
If Tier 4 fails: the content is not retrievable by automated means. Report the failure with the specific error, note what was attempted, and suggest manual retrieval or an alternative source.
Once raw HTML is retrieved (any tier), extract the meaningful content:
For structured data (tables, JSON-LD, microdata), extract the structure rather than the raw HTML.
When retrieving content from a domain with a known Apify actor that produces cleaner output than raw HTML scraping, prefer the actor over raw retrieval. Check the Apify store with a targeted search before writing a custom extraction:
apify.com/store?search=[domain name]
Existing actors for common targets (LinkedIn, Amazon, social platforms) produce better-structured output than HTML parsing. Use them when available — don't build custom scrapers for already-solved problems.
robots.txt for Tier 1 and 2. Tiers 3 and 4 bypass these — use only when there is a legitimate purpose and the requester has confirmed compliance with the target site's terms of service.### Content retrieval: [URL]
**Date:** [today]
**Tier used:** [1 / 2 / 3 / 4]
**Escalation path:** [e.g., Tier 1 failed (403) → Tier 2 succeeded]
### Retrieved content
[Extracted text content with structure preserved]
### Metadata
- **Title:** —
- **Publication date:** —
- **Author:** —
- **Word count (approximate):** —
### Notes
[Any content quality issues, partial retrieval, or access limitations]