Help us improve
Share bugs, ideas, or general feedback.
From project
This skill should be used when the user says "process documents", "extract text from PDF", "OCR this document", "convert PDF to markdown", "extract emails from documents", "parse document", "document conversion", "batch OCR", "extract structured data from PDF", "read PDF", "extract tables from PDF", "convert Word document", "convert docx to markdown", or wants to extract, convert, or process documents and scanned images.
npx claudepluginhub neuromechanist/research-skills --plugin projectHow this skill is triggered — by the user, by Claude, or both
Slash command
/project:document-processingThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Extract, convert, and structure content from PDFs, images, and other document formats. Handles OCR, text extraction, markdown conversion, email extraction, and structured data output.
Converts PDF files to Markdown using opendataloader-pdf, extracting text, tables, headings, lists, and images in correct reading order. For PDF parsing, document extraction, and AI/LLM/RAG data preparation.
Parses PDFs, DOCX, PPTX, HTML, images (20+ formats) to Markdown/HTML/JSON/text with layout/tables/OCR. Chunks for RAG pipelines; batch converts via DocumentConverter.
Multi-format document parsing tools for PDF, DOCX, HTML, and Markdown with support for LlamaParse, Unstructured.io, PyPDF2, PDFPlumber, and python-docx. Use when parsing documents, extracting text from PDFs, processing Word documents, converting HTML to text, extracting tables from documents, building RAG pipelines, chunking documents, or when user mentions document parsing, PDF extraction, DOCX processing, table extraction, OCR, LlamaParse, Unstructured.io, or document ingestion.
Share bugs, ideas, or general feedback.
Extract, convert, and structure content from PDFs, images, and other document formats. Handles OCR, text extraction, markdown conversion, email extraction, and structured data output.
Determine the processing approach:
| Input | Method | Tool |
|---|---|---|
| Native PDF (has text layer) | Direct extraction | pdftotext, pymupdf |
| Scanned PDF (images only) | OCR | Mistral OCR API, tesseract |
| Image files (PNG, JPG, TIFF) | OCR | Mistral OCR API, tesseract |
| Word documents (.docx) | Conversion | python-docx, pandoc |
| HTML | Conversion | pandoc, beautifulsoup4 |
Detection:
# Check if PDF has text content
pdftotext input.pdf - | head -20
# If output is empty or garbled, it's a scanned PDF -> use OCR
import pymupdf
doc = pymupdf.open("input.pdf")
for page in doc:
text = page.get_text("markdown") # or "text", "html"
print(text)
Requires MISTRAL_API_KEY environment variable. Falls back to tesseract for offline processing if unavailable.
import base64
import httpx
def ocr_page(image_path: str, api_key: str) -> str:
with open(image_path, "rb") as f:
image_data = base64.b64encode(f.read()).decode()
response = httpx.post(
"https://api.mistral.ai/v1/chat/completions",
headers={"Authorization": f"Bearer {api_key}"},
json={
"model": "mistral-ocr-latest",
"messages": [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_data}"}},
{"type": "text", "text": "Extract all text from this image. Preserve formatting, tables, and structure. Output as markdown."}
]
}]
}
)
return response.json()["choices"][0]["message"]["content"]
# Single page
tesseract input.png output -l eng --oem 3 --psm 6
# PDF to text via tesseract
pdftoppm input.pdf page -png
for f in page-*.png; do tesseract "$f" "${f%.png}" -l eng; done
cat page-*.txt > output.txt
Convert extracted text to structured formats:
# Extract emails
import re
emails = re.findall(r'[\w.+-]+@[\w-]+\.[\w.]+', text)
# Extract dates
dates = re.findall(r'\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b', text)
# Extract phone numbers
phones = re.findall(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', text)
Save results in the requested format:
.md) - default for text content.txt) - for simple text extractionFor document collections:
# Process all PDFs in a directory
for pdf in /path/to/docs/*.pdf; do
name=$(basename "$pdf" .pdf)
pdftotext "$pdf" "/path/to/output/${name}.txt"
done
For large collections, track progress:
After extraction, verify: