html-to-markdown is a high-performance HTML→Markdown converter with a Rust core and 12 native language bindings. It converts HTML to CommonMark Markdown, Djot, or plain text in a single pass, optionally extracting metadata, tables, inline images, and a structured document tree.

Supporting Files

references/cli-reference.mdreferences/configuration.mdreferences/other-bindings.mdreferences/python-api.mdreferences/rust-api.mdreferences/typescript-api.md

SKILL.md

236 lines · ~2.6k tokens

Stats

LanguageShell

Parent stars5

MaintenanceExcellent

Last CommitJun 24, 2026

Actions

View Source View Plugin View on GitHub View README

html-to-markdown

Use this skill when writing code that:

Converts HTML strings, files, or live URLs to Markdown, Djot, or plain text
Extracts metadata (title, OG tags, JSON-LD/Microdata/RDFa, headers, links, images, language) from HTML
Extracts structured table data (GFM markdown + cell grids) from HTML
Extracts a structured document-structure tree
Extracts inline images (data URIs, SVGs) from HTML
Uses preprocessing to clean noisy HTML (ads, navigation, forms) before conversion

Capability map

Capability	CLI	SDKs
HTML→Markdown / Djot / plain text	`html-to-markdown FILE`	`convert(html, options)`
Read HTML from stdin / file / URL	`cat f \| …`, `FILE`, `--url URL`	`convert(htmlString, …)`
~30 config options (headings, code blocks, lists, escaping, wrapping…)	flags	`ConversionOptions`
Metadata extraction	`--json` (default-extracted)	`result.metadata`
Table extraction	`--json` → `tables[]`	`result.tables`
Document structure tree	`--json --include-structure`	`include_document_structure=true`
Inline image extraction	`--json --extract-inline-images`	`extract_images=true`
HTML preprocessing	`--preprocess [--preset …]`	`PreprocessingOptions`
Extraction-only (no Markdown body)	`--json --no-content`	options + read fields

Installation

CLI

# (Homebrew 6.0+ requires explicit trust for third-party taps)
brew trust xberg-io/tap
brew install xberg-io/tap/html-to-markdown
# or run without a persistent install (the CLI proxy package self-installs the binary):
npx @kreuzberg/html-to-markdown-cli --help
uvx --from html-to-markdown-cli html-to-markdown --help
# or download a prebuilt binary from the latest GitHub release:
#   https://github.com/xberg-io/html-to-markdown/releases/latest
# or build from source:
cargo install --git https://github.com/xberg-io/html-to-markdown html-to-markdown-cli

Language SDKs

pip install html-to-markdown                                # Python
npm install @kreuzberg/html-to-markdown                      # TypeScript / Node.js
cargo add html-to-markdown-rs                                # Rust (features: metadata default; full = all)
gem install html-to-markdown                                 # Ruby
composer require xberg-io/html-to-markdown              # PHP
go get github.com/xberg-io/html-to-markdown/packages/go/v3   # Go
dotnet add package KreuzbergDev.HtmlToMarkdown               # C#
npm install @kreuzberg/html-to-markdown-wasm                 # WASM

Java (Maven): dev.kreuzberg:html-to-markdown
Elixir: {:html_to_markdown, "~> 3.6"} in mix.exs
R: install.packages("htmltomarkdown", repos = "https://xberg-io.r-universe.dev")
C (FFI): pre-built .so / .dll / .dylib from GitHub releases

CLI vs SDK — which to use

CLI — one-shot conversions, shell pipelines, fetching a single URL, ad-hoc metadata/table extraction via --json | jq. Flags only, no subcommands. FILE is positional; omit or use - for stdin.
SDK — embedding conversion in application code, batch processing, custom element conversion (visitor pattern, Rust), and tight loops where process spawn overhead matters.
MCP server — html-to-markdown mcp exposes convert/extract as agent tools, so an MCP client can convert HTML directly with no shell-out. This plugin auto-registers it; see the using-the-mcp-server skill.

Both share the same ConversionResult shape, so output is interchangeable.

When to use html-to-markdown vs kreuzberg vs kreuzcrawl

html-to-markdown — you already have HTML (a string, a file, or a single URL) and want clean Markdown plus structured metadata/tables. No OCR, no document parsing, no crawling.
kreuzberg — you have documents (PDF, Office, images, email, archives) and need full text/table/metadata extraction with optional OCR. Use it when the input is not already HTML.
kreuzcrawl — you need to crawl or scrape many pages, follow links, and handle JS-rendered sites with a headless-Chrome fallback. It uses html-to-markdown internally for the HTML→Markdown step.

Rule of thumb: single HTML in → Markdown out = html-to-markdown. Many URLs / a site = kreuzcrawl. Non-HTML documents = kreuzberg.

CLI quick start

# Convert a file to stdout
html-to-markdown input.html

# Convert and save
html-to-markdown input.html -o output.md

# Read from stdin
cat page.html | html-to-markdown

# Fetch and convert a URL
html-to-markdown --url https://example.com > out.md

# Full ConversionResult as JSON (content, tables, metadata, images, warnings)
html-to-markdown --json input.html

# JSON with document structure tree
html-to-markdown --json --include-structure input.html

# Extraction-only (no Markdown body)
html-to-markdown --json --no-content input.html

# Aggressive web-page cleanup
html-to-markdown input.html --preprocess --preset aggressive

SDK quick start

Rust

use html_to_markdown_rs::convert;

let result = convert("<h1>Hello World</h1><p>A paragraph.</p>", None)?;
println!("{}", result.content.unwrap_or_default());

Python

from html_to_markdown import convert

result = convert("<h1>Hello World</h1><p>A paragraph.</p>")
print(result.content)   # # Hello World\n\nA paragraph.
print(result.metadata)  # title, links, headers, …

TypeScript / Node.js

import { convert } from "@kreuzberg/html-to-markdown";

// Node's convert() returns a JSON string — always JSON.parse() it.
const result = JSON.parse(convert("<h1>Hello World</h1><p>A paragraph.</p>"));
console.log(result.content);

ConversionResult fields

All languages return the same structure (dict, object, or struct).

Field	Description
`content`	Converted text (Markdown/Djot/plain). `null` only in extraction-only mode.
`metadata`	Title, OG, headers, links, images, structured data.
`tables`	Tables with `grid` (structured cells) and `markdown` fields.
`images`	Extracted inline images (requires inline-image extraction).
`document`	Structured document tree when structure extraction is enabled.
`warnings`	Non-fatal processing warnings (`message`, `kind`).

Configuration

All languages expose the same ~30 options. See references/configuration.md for the complete table. Common ones:

Option	Values	Default
`heading_style`	`atx`, `underlined`, `atx-closed`	`atx`
`code_block_style`	`backticks`, `indented`, `tildes`	`backticks`
`output_format`	`markdown`, `djot`	`markdown`
`wrap` / `wrap_width`	bool / 20–500	off / `80`
`autolinks` (SDK) / `--no-autolinks` (CLI)	bool / flag	`true` (on); disable in CLI with `--no-autolinks`
preprocessing	`minimal` / `standard` / `aggressive`	off

Rust (builder)

use html_to_markdown_rs::{convert, ConversionOptions, HeadingStyle, OutputFormat};

let options = ConversionOptions::builder()
    .heading_style(HeadingStyle::Atx)
    .output_format(OutputFormat::Markdown)
    .wrap(true)
    .wrap_width(100)
    .build();
let result = convert(html, Some(options))?;

Python (dataclass)

from html_to_markdown import convert, ConversionOptions, PreprocessingOptions

result = convert(
    html,
    ConversionOptions(heading_style="atx", wrap=True, wrap_width=100),
    PreprocessingOptions(enabled=True, preset="aggressive"),
)

Metadata extraction

Metadata is extracted by default; in the CLI it appears under metadata in --json output. Fields include document (title, description, language, charset, open_graph), headers, links (with link_type), images, and structured_data (JSON-LD/Microdata/RDFa). See the extracting-metadata skill for details.

Table extraction

Tables appear in result.tables, each with a pre-rendered markdown string and a structured cell grid. Markdown tables also appear inline in content. See the extracting-tables skill.

Document structure extraction

Enable structure extraction (--include-structure on the CLI, include_document_structure=true in SDKs) to get a semantic node tree under document. Node types include heading, paragraph, list, list_item, table, image, code, quote, group, metadata_block.

Common pitfalls

convert() returns a result object, not a string. Access .content for the Markdown text.
Node.js convert() returns a JSON string. Always JSON.parse(convert(html)) — NAPI-RS serializes the result for performance.
--json outputs JSON, not Markdown. Omit --json for plain Markdown.
--include-structure, --extract-inline-images, and --no-content require --json.
CLI has no subcommands. Everything is a flag; FILE is positional.
--preset, --keep-navigation, --keep-forms require --preprocess.

Additional resources

CLI Reference — every flag, JSON shape, exit codes
Configuration Reference — all 30+ options with defaults
Rust API Reference — signatures, builder, feature flags
Python API Reference — functions, dataclasses, type hints
TypeScript API Reference — functions, interfaces, Buffer support
Other Bindings — Go, Ruby, PHP, Java, C#, Elixir, R, WASM, C FFI

GitHub: https://github.com/xberg-io/html-to-markdown

html-to-markdown

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

html-to-markdown

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

html-to-markdown

Capability map

Installation

CLI

Language SDKs

CLI vs SDK — which to use

When to use html-to-markdown vs kreuzberg vs kreuzcrawl

CLI quick start

SDK quick start

Rust

Python

TypeScript / Node.js

ConversionResult fields

Configuration

Rust (builder)

Python (dataclass)

Metadata extraction

Table extraction

Document structure extraction

Common pitfalls

Additional resources

Similar Skills

html-to-markdown

Capability map

Installation

CLI

Language SDKs

CLI vs SDK — which to use

When to use html-to-markdown vs kreuzberg vs kreuzcrawl

CLI quick start

SDK quick start

Rust

Python

TypeScript / Node.js

ConversionResult fields

Configuration

Rust (builder)

Python (dataclass)

Metadata extraction

Table extraction

Document structure extraction

Common pitfalls

Additional resources

Similar Skills