From docs-tools
Downloads HTML from websites and extracts article content from <article> tags like aria-live='polite', removing bloat. Outputs clean HTML, Markdown, or text for documentation sites.
npx claudepluginhub redhat-documentation/redhat-docs-agent-tools --plugin docs-toolsThis skill is limited to using the following tools:
This skill downloads HTML from websites and extracts the article content, removing unnecessary HTML bloat. It's particularly useful for documentation websites that have large amounts of navigation, styling, and other non-content HTML.
Extracts clean Markdown from web pages by stripping navigation, ads, sidebars, footers, and boilerplate using Defuddle. Use for URLs to documentation, articles, blog posts, research papers, release notes.
Extracts clean markdown from web pages using Defuddle CLI, removing ads, navigation, and clutter to save tokens. Prefer for user-provided URLs of docs, articles, or blogs over WebFetch.
Fetches any URL or PDF as clean Markdown, handling paywalls, JS-heavy pages, Twitter/X, and Chinese platforms via proxy cascade. Saves to ~/Downloads; prefer over WebFetch.
Share bugs, ideas, or general feedback.
This skill downloads HTML from websites and extracts the article content, removing unnecessary HTML bloat. It's particularly useful for documentation websites that have large amounts of navigation, styling, and other non-content HTML.
<article> tagsThe skill uses a Python script that downloads and parses HTML content.
Extract article from a URL:
python3 scripts/article_extractor.py --url "https://example.com/page"
Extract with specific output format:
python3 scripts/article_extractor.py --url "https://example.com/page" --format markdown
Save to file:
python3 scripts/article_extractor.py --url "https://example.com/page" --output article.md
Extract with custom article selector:
python3 scripts/article_extractor.py --url "https://example.com/page" --selector "article.main-content"
--url URL: The URL to fetch HTML from (required)--format {html,markdown,text}: Output format (default: markdown)--output FILE: Save output to file instead of stdout--selector SELECTOR: CSS selector for article content (default: article[aria-live="polite"])--pretty: Pretty-print HTML output with indentation--strip-links: Remove all hyperlinks from outputHTML (default): Extracts the article HTML content with all tags preserved but removes surrounding bloat.
Markdown: Converts the article content to Markdown format for easy reading and documentation.
Plain Text: Strips all HTML tags and returns plain text content.
# Extract from Red Hat OpenShift Lightspeed documentation
python3 scripts/article_extractor.py \
--url "https://docs.redhat.com/en/documentation/red_hat_openshift_lightspeed/1.0/html/install/ols-installing-lightspeed" \
--format markdown \
--output openshift-lightspeed-install.md
# Extract from any site with article tags
python3 scripts/article_extractor.py \
--url "https://example.com/docs/guide" \
--selector "article.documentation" \
--format text
This skill requires the following Python packages:
requests: For downloading HTML contentbeautifulsoup4: For parsing and extracting HTMLhtml2text: For converting HTML to Markdown (optional, for markdown format)Install dependencies:
python3 -m pip install requests beautifulsoup4 html2text
The skill downloads and processes HTML efficiently:
<article> tags or similar semantic HTML