Stats

Actions

Tags

Help us improve

Share bugs, ideas, or general feedback.

document-processing | project | ClaudePluginHub

Skill

document-processing

This skill should be used when the user says "process documents", "extract text from PDF", "OCR this document", "convert PDF to markdown", "extract emails from documents", "parse document", "document conversion", "batch OCR", "extract structured data from PDF", "read PDF", "extract tables from PDF", "convert Word document", "convert docx to markdown", or wants to extract, convert, or process documents and scanned images.

$

npx claudepluginhub neuromechanist/research-skills --plugin project

Popularity

Parent stars

26

Parent forks

5

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/project:document-processing

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Extract, convert, and structure content from PDFs, images, and other document formats. Handles OCR, text extraction, markdown conversion, email extraction, and structured data output.

Supporting Files

references/ocr-configuration.mdreferences/pdf-tools.md

SKILL.md

155 lines · ~1.2k tokens

Similar Skills

oma-pdf

1.0k

Converts PDF files to Markdown using opendataloader-pdf, extracting text, tables, headings, lists, and images in correct reading order. For PDF parsing, document extraction, and AI/LLM/RAG data preparation.

docling

59

Parses PDFs, DOCX, PPTX, HTML, images (20+ formats) to Markdown/HTML/JSON/text with layout/tables/OCR. Chunks for RAG pipelines; batch converts via DocumentConverter.

4 files

document-parsers

10

Multi-format document parsing tools for PDF, DOCX, HTML, and Markdown with support for LlamaParse, Unstructured.io, PyPDF2, PDFPlumber, and python-docx. Use when parsing documents, extracting text from PDFs, processing Word documents, converting HTML to text, extracting tables from documents, building RAG pipelines, chunking documents, or when user mentions document parsing, PDF extraction, DOCX processing, table extraction, OCR, LlamaParse, Unstructured.io, or document ingestion.

9 files6 tools

Stats

LanguagePython

Parent stars26

Parent forks5

MaintenanceExcellent

Last CommitApr 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Document Processing

Extract, convert, and structure content from PDFs, images, and other document formats. Handles OCR, text extraction, markdown conversion, email extraction, and structured data output.

When to Use

Converting scanned documents to searchable text
Extracting text from PDFs (native or scanned)
Converting documents to markdown for further processing
Extracting emails, addresses, or other structured data from documents
Batch processing document collections

Processing Pipeline

Step 1: Identify Document Type

Determine the processing approach:

Input	Method	Tool
Native PDF (has text layer)	Direct extraction	`pdftotext`, `pymupdf`
Scanned PDF (images only)	OCR	Mistral OCR API, `tesseract`
Image files (PNG, JPG, TIFF)	OCR	Mistral OCR API, `tesseract`
Word documents (.docx)	Conversion	`python-docx`, `pandoc`
HTML	Conversion	`pandoc`, `beautifulsoup4`

Detection:

# Check if PDF has text content
pdftotext input.pdf - | head -20
# If output is empty or garbled, it's a scanned PDF -> use OCR

Step 2: Extract Content

Native PDF Extraction

import pymupdf

doc = pymupdf.open("input.pdf")
for page in doc:
    text = page.get_text("markdown")  # or "text", "html"
    print(text)

OCR with Mistral (for scanned documents)

Requires MISTRAL_API_KEY environment variable. Falls back to tesseract for offline processing if unavailable.

import base64
import httpx

def ocr_page(image_path: str, api_key: str) -> str:
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode()

    response = httpx.post(
        "https://api.mistral.ai/v1/chat/completions",
        headers={"Authorization": f"Bearer {api_key}"},
        json={
            "model": "mistral-ocr-latest",
            "messages": [{
                "role": "user",
                "content": [
                    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_data}"}},
                    {"type": "text", "text": "Extract all text from this image. Preserve formatting, tables, and structure. Output as markdown."}
                ]
            }]
        }
    )
    return response.json()["choices"][0]["message"]["content"]

OCR with Tesseract (offline fallback)

# Single page
tesseract input.png output -l eng --oem 3 --psm 6

# PDF to text via tesseract
pdftoppm input.pdf page -png
for f in page-*.png; do tesseract "$f" "${f%.png}" -l eng; done
cat page-*.txt > output.txt

Step 3: Structure Output

Convert extracted text to structured formats:

Markdown cleanup

Fix OCR artifacts (broken words, spurious line breaks)
Reconstruct tables from aligned text
Identify headers from font size/weight changes
Preserve list formatting

Structured data extraction

# Extract emails
import re
emails = re.findall(r'[\w.+-]+@[\w-]+\.[\w.]+', text)

# Extract dates
dates = re.findall(r'\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b', text)

# Extract phone numbers
phones = re.findall(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', text)

Step 4: Output

Save results in the requested format:

Markdown (.md) - default for text content
JSON - for structured data extraction
Plain text (.txt) - for simple text extraction

Batch Processing

For document collections:

# Process all PDFs in a directory
for pdf in /path/to/docs/*.pdf; do
  name=$(basename "$pdf" .pdf)
  pdftotext "$pdf" "/path/to/output/${name}.txt"
done

For large collections, track progress:

Create a manifest of input files
Process each file, recording success/failure
Report summary (processed, failed, skipped)

Quality Checks

After extraction, verify:

Text is readable (not garbled encoding)
Tables preserved their structure
No pages were skipped
Special characters rendered correctly
Headers and sections identified

Additional Resources