Help us improve
Share bugs, ideas, or general feedback.
Converts PDF, DOCX, PPTX, XLSX, images (OCR), audio (transcription), HTML, CSV, JSON, XML, ZIP, EPUB, and YouTube transcripts to clean Markdown using Microsoft MarkItDown. Useful for preparing documents for LLM ingestion or batch conversion.
npx claudepluginhub alterlab-ieu/alterlab-academic-skills --plugin alterlab-bioinformaticsHow this skill is triggered — by the user, by Claude, or both
Slash command
/alterlab-writing-tools:alterlab-markitdownThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
MarkItDown is a Python tool developed by Microsoft for converting various file formats to Markdown. It's particularly useful for converting documents into LLM-friendly text format, as Markdown is token-efficient and well-understood by modern language models.
Converts files and office documents (PDF, DOCX, PPTX, XLSX, images with OCR, audio with transcription, HTML, CSV, JSON, XML, ZIP, YouTube URLs, EPubs) to Markdown using Microsoft MarkItDown.
Converts files and URLs to clean Markdown using MarkItDown. Supports PDF, DOCX, XLSX, PPTX, HTML, images (OCR), audio, CSV, and YouTube transcripts. Optimized for LLM ingestion pipelines.
Converts local PDF, DOCX, XLSX, PPTX, images via OCR, and audio files to clean Markdown using Microsoft's markitdown CLI. Best for text extraction from local documents.
Share bugs, ideas, or general feedback.
MarkItDown is a Python tool developed by Microsoft for converting various file formats to Markdown. It's particularly useful for converting documents into LLM-friendly text format, as Markdown is token-efficient and well-understood by modern language models.
Key Benefits:
| Format | Description | Notes |
|---|---|---|
| Portable Document Format | Full text extraction | |
| DOCX | Microsoft Word | Tables, formatting preserved |
| PPTX | PowerPoint | Slides with notes |
| XLSX | Excel spreadsheets | Tables and data |
| Images | JPEG, PNG, GIF, WebP | EXIF metadata + OCR |
| Audio | WAV, MP3 | Metadata + transcription |
| HTML | Web pages | Clean conversion |
| CSV | Comma-separated values | Table format |
| JSON | JSON data | Structured representation |
| XML | XML documents | Structured format |
| ZIP | Archive files | Iterates contents |
| EPUB | E-books | Full text extraction |
| YouTube | Video URLs | Fetch transcriptions |
# Install with all features
pip install 'markitdown[all]'
# Or from source
git clone https://github.com/microsoft/markitdown.git
cd markitdown
pip install -e 'packages/markitdown[all]'
# Basic conversion
markitdown document.pdf > output.md
# Specify output file
markitdown document.pdf -o output.md
# Pipe content
cat document.pdf | markitdown > output.md
# Enable plugins
markitdown --list-plugins # List available plugins
markitdown --use-plugins document.pdf -o output.md
from markitdown import MarkItDown
# Basic usage
md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)
# Convert from stream
with open("document.pdf", "rb") as f:
result = md.convert_stream(f, file_extension=".pdf")
print(result.text_content)
Use LLMs via OpenRouter to generate detailed image descriptions (for PPTX and image files):
from markitdown import MarkItDown
from openai import OpenAI
# Initialize OpenRouter client (OpenAI-compatible API)
client = OpenAI(
api_key="your-openrouter-api-key",
base_url="https://openrouter.ai/api/v1"
)
md = MarkItDown(
llm_client=client,
llm_model="anthropic/claude-opus-4.5", # recommended for scientific vision
llm_prompt="Describe this image in detail for scientific documentation"
)
result = md.convert("presentation.pptx")
print(result.text_content)
For enhanced PDF conversion with Microsoft Document Intelligence:
# Command line
markitdown document.pdf -o output.md -d -e "<document_intelligence_endpoint>"
# Python API
from markitdown import MarkItDown
md = MarkItDown(docintel_endpoint="<document_intelligence_endpoint>")
result = md.convert("complex_document.pdf")
print(result.text_content)
MarkItDown supports 3rd-party plugins for extending functionality:
# List installed plugins
markitdown --list-plugins
# Enable plugins
markitdown --use-plugins file.pdf -o output.md
Find plugins on GitHub with hashtag: #markitdown-plugin
Control which file formats you support:
# Install specific formats
pip install 'markitdown[pdf, docx, pptx]'
# All available options:
# [all] - All optional dependencies
# [pptx] - PowerPoint files
# [docx] - Word documents
# [xlsx] - Excel spreadsheets
# [xls] - Older Excel files
# [pdf] - PDF documents
# [outlook] - Outlook messages
# [az-doc-intel] - Azure Document Intelligence
# [audio-transcription] - WAV and MP3 transcription
# [youtube-transcription] - YouTube video transcription
from markitdown import MarkItDown
md = MarkItDown()
# Convert PDF paper
result = md.convert("research_paper.pdf")
with open("paper.md", "w") as f:
f.write(result.text_content)
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("data.xlsx")
# Result will be in Markdown table format
print(result.text_content)
from markitdown import MarkItDown
import os
from pathlib import Path
md = MarkItDown()
# Process all PDFs in a directory
pdf_dir = Path("papers/")
output_dir = Path("markdown_output/")
output_dir.mkdir(exist_ok=True)
for pdf_file in pdf_dir.glob("*.pdf"):
result = md.convert(str(pdf_file))
output_file = output_dir / f"{pdf_file.stem}.md"
output_file.write_text(result.text_content)
print(f"Converted: {pdf_file.name}")
from markitdown import MarkItDown
from openai import OpenAI
# Use OpenRouter for access to multiple AI models
client = OpenAI(
api_key="your-openrouter-api-key",
base_url="https://openrouter.ai/api/v1"
)
md = MarkItDown(
llm_client=client,
llm_model="anthropic/claude-opus-4.5", # recommended for presentations
llm_prompt="Describe this slide image in detail, focusing on key visual elements and data"
)
result = md.convert("presentation.pptx")
with open("presentation.md", "w") as f:
f.write(result.text_content)
from markitdown import MarkItDown
from pathlib import Path
md = MarkItDown()
# Files to convert
files = [
"document.pdf",
"spreadsheet.xlsx",
"presentation.pptx",
"notes.docx"
]
for file in files:
try:
result = md.convert(file)
output = Path(file).stem + ".md"
with open(output, "w") as f:
f.write(result.text_content)
print(f"✓ Converted {file}")
except Exception as e:
print(f"✗ Error converting {file}: {e}")
from markitdown import MarkItDown
md = MarkItDown()
# Convert YouTube video to transcript
result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID")
print(result.text_content)
# Build image
docker build -t markitdown:latest .
# Run conversion
docker run --rm -i markitdown:latest < ~/document.pdf > output.md
MarkItDown()from markitdown import MarkItDown
md = MarkItDown()
try:
result = md.convert("document.pdf")
print(result.text_content)
except FileNotFoundError:
print("File not found")
except Exception as e:
print(f"Conversion error: {e}")
from markitdown import MarkItDown
md = MarkItDown()
# For large files, use streaming
with open("large_file.pdf", "rb") as f:
result = md.convert_stream(f, file_extension=".pdf")
# Process in chunks or save directly
with open("output.md", "w") as out:
out.write(result.text_content)
Markdown output is already token-efficient, but you can:
from markitdown import MarkItDown
import re
md = MarkItDown()
result = md.convert("document.pdf")
# Clean up extra whitespace
clean_text = re.sub(r'\n{3,}', '\n\n', result.text_content)
clean_text = clean_text.strip()
print(clean_text)
from markitdown import MarkItDown
from pathlib import Path
md = MarkItDown()
# Convert all papers in literature folder
papers_dir = Path("literature/pdfs")
output_dir = Path("literature/markdown")
output_dir.mkdir(exist_ok=True)
for paper in papers_dir.glob("*.pdf"):
result = md.convert(str(paper))
# Save with metadata
output_file = output_dir / f"{paper.stem}.md"
content = f"# {paper.stem}\n\n"
content += f"**Source**: {paper.name}\n\n"
content += "---\n\n"
content += result.text_content
output_file.write_text(content)
# For AI-enhanced conversion with figures
from openai import OpenAI
client = OpenAI(
api_key="your-openrouter-api-key",
base_url="https://openrouter.ai/api/v1"
)
md_ai = MarkItDown(
llm_client=client,
llm_model="anthropic/claude-opus-4.5",
llm_prompt="Describe scientific figures with technical precision"
)
from markitdown import MarkItDown
import re
md = MarkItDown()
result = md.convert("data_tables.xlsx")
# Markdown tables can be parsed or used directly
print(result.text_content)
Missing dependencies: Install feature-specific packages
pip install 'markitdown[pdf]' # For PDF support
Binary file errors: Ensure files are opened in binary mode
with open("file.pdf", "rb") as f: # Note the "rb"
result = md.convert_stream(f, file_extension=".pdf")
OCR not working: Install tesseract
# macOS
brew install tesseract
# Ubuntu
sudo apt-get install tesseract-ocr
references/api_reference.md for complete API documentationreferences/file_formats.md for format-specific detailsscripts/batch_convert.py for automation examplesscripts/convert_with_ai.py for AI-enhanced conversionspackages/markitdown-sample-plugin