This skill should be used when the user asks to "extract text from PDF", "convert PDF to text", "parse PDF", "read PDF contents", "extract data from documents", "batch PDF extraction", "PDF to markdown", "OCR PDF", "get text from PDF files", "I have a PDF", "can you read this PDF", "what's in this PDF", "summarize this PDF", "open PDF file", "extract from [filename].pdf", or needs to process PDF documents for data extraction. Handles single-file extraction, batch processing, and OCR for scanned documents with automatic backend selection.
From pdf-extractornpx claudepluginhub ahundt/autorun --plugin pdf-extractorThis skill uses the workspace's default tool permissions.
references/backends.mdGuides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Optimizes Bazel builds for large monorepos with remote caching/execution setups, custom rules, .bazelrc configs, and WORKSPACE templates for JS/TS/Python.
Implements auth patterns like JWT, OAuth2, sessions, and RBAC for securing APIs. Use for user auth, API protection, social login, or debugging security issues.
Extract text and structured data from PDF documents using a multi-backend approach with automatic fallback.
This skill provides PDF text extraction with 9 different backends, automatic GPU detection, and intelligent backend selection. The extraction system tries backends in order until one succeeds, producing markdown output optimized for further processing.
To extract text from PDFs:
Single file extraction (installed CLI - recommended):
extract-pdfs /path/to/document.pdf
Output: Creates document.md in the same directory.
Batch extraction (directory):
extract-pdfs /path/to/pdfs/ /path/to/output/
Output: Creates .md files for all PDFs in output directory.
Custom output file:
extract-pdfs document.pdf output.md
Specific backends:
extract-pdfs document.pdf --backends markitdown pdfplumber
List available backends:
extract-pdfs --list-backends
Output: Shows available backends and GPU status.
If the extract-pdfs CLI isn't installed, install it first (recommended):
# Install as global UV tool (from repo root):
cd "${CLAUDE_PLUGIN_ROOT}/../.." && uv tool install --force --editable plugins/pdf-extractor
extract-pdfs --list-backends # verify
Or use these fallback methods without installing:
# uv run (recommended fallback — no install required):
uv run --project "${CLAUDE_PLUGIN_ROOT}" python -m pdf_extraction document.pdf
# Standalone script execution
python "${CLAUDE_PLUGIN_ROOT}/src/pdf_extraction/cli.py" document.pdf
Specify backends in any order with --backends. The system tries each in order, stopping on first success:
# Tables first, then general extraction
extract-pdfs document.pdf --backends pdfplumber markitdown pdfminer
# Scanned documents: vision-based first
extract-pdfs scanned.pdf --backends marker docling markitdown
# Most permissive fallback order (handles problematic PDFs)
extract-pdfs document.pdf --backends pdfminer pypdf2 markitdown
# Single backend only (no fallback)
extract-pdfs document.pdf --backends markitdown
For systems without GPU, the recommended backend order:
markitdown - Microsoft's lightweight converter (MIT, fast, no models)pdfplumber - Excellent for tables (MIT)pdfminer - Pure Python, reliable (MIT)pypdf2 - Basic extraction, always available (BSD-3)For systems with CUDA-enabled GPU:
docling - IBM layout analysis (MIT, ~500MB models)marker - Vision-based, best for scanned docs (GPL-3.0, ~1GB models)| Backend | License | Models | Best For | Speed |
|---|---|---|---|---|
| markitdown | MIT | None | General text, forms | Fast |
| pdfplumber | MIT | None | Tables, structured data | Fast |
| pdfminer | MIT | None | Simple text documents | Fast |
| pypdf2 | BSD-3 | None | Basic extraction | Fast |
| docling | MIT | ~500MB | Layout analysis | Medium |
| marker | GPL-3.0 | ~1GB | Scanned documents | Slow |
| pymupdf4llm | AGPL-3.0 | None | LLM-optimized output | Fast |
| pdfbox | Apache-2.0 | None | Tables (Java-based) | Medium |
| pdftotext | System | None | Simple text (CLI) | Fast |
| Document Type | Recommended Backend(s) | Why |
|---|---|---|
| Digital text PDF (default) | markitdown, pdfplumber | Fast, accurate |
| PDF with tables/invoices | pdfplumber, pdfbox | Best table structure |
| Complex layouts/columns | docling (GPU) | Layout analysis |
| Scanned documents/images | marker, docling (GPU) | OCR/vision required |
| Insurance policies/forms | markitdown, pdfplumber | Handles form fields |
| Academic papers | docling | Equations, figures |
| Maximum compatibility | pdfminer, pypdf2 | Fewest dependencies |
| Commercial use required | markitdown, pdfplumber | MIT license |
To use the extraction library directly in Python code:
from pdf_extraction import extract_single_pdf, pdf_to_txt, detect_gpu_availability
# Check available backends
gpu_info = detect_gpu_availability()
print(f"Recommended backends: {gpu_info['recommended_backends']}")
# Extract single file
result = extract_single_pdf(
input_file='/path/to/document.pdf',
output_file='/path/to/output.md',
backends=['markitdown', 'pdfplumber']
)
if result['success']:
print(f"Extracted with {result['backend_used']}")
print(f"Quality metrics: {result['quality_metrics']}")
# Batch extract directory
output_files, metadata = pdf_to_txt(
input_dir='/path/to/pdfs/',
output_dir='/path/to/output/',
resume=True, # Skip already-extracted files
return_metadata=True
)
Every extraction returns metadata for quality assessment:
{
'success': True,
'backend_used': 'markitdown',
'extraction_time_seconds': 2.5,
'output_size_bytes': 15234,
'quality_metrics': {
'char_count': 15234,
'line_count': 450,
'word_count': 2800,
'table_markers': 12, # Count of | (tables)
'has_structure': True # Has markdown structure
},
'encrypted': False,
'error': None
}
The system detects encrypted PDFs and reports them:
if result['encrypted']:
print("PDF is password-protected")
Encrypted PDFs cannot be extracted without the password.
When all backends fail:
--backends pdfminer pypdf2 (most permissive)To continue interrupted batch extraction:
extract-pdfs /path/to/pdfs/ /path/to/output/
The resume=True default skips already-extracted files.
To force re-extraction:
extract-pdfs /path/to/pdfs/ --no-resume
For PDFs with tables, prioritize:
extract-pdfs document.pdf --backends pdfplumber markitdown
The output will contain markdown tables when detected:
| Column1 | Column2 | Column3 |
|---------|---------|---------|
| Data | Data | Data |
Location: ${CLAUDE_PLUGIN_ROOT}/src/pdf_extraction/
| File | Purpose |
|---|---|
__init__.py | Package exports (extract_single_pdf, pdf_to_txt, etc.) |
__main__.py | Support for python -m pdf_extraction |
cli.py | CLI entry point with argparse |
backends.py | BackendExtractor base class + 9 backend implementations |
extractors.py | extract_single_pdf(), pdf_to_txt() functions |
utils.py | GPU detection, quality metrics, encryption check |
| Component | Location | Purpose |
|---|---|---|
BackendExtractor | backends.py:35-123 | Base class with Template Method pattern |
DoclingExtractor | backends.py:130-142 | IBM Docling backend (MIT, GPU) |
MarkerExtractor | backends.py:145-158 | Vision-based marker backend (GPL-3.0, GPU) |
MarkItDownExtractor | backends.py:161-173 | Microsoft MarkItDown (MIT, CPU) |
PdfplumberExtractor | backends.py:244-253 | Table-focused extraction (MIT) |
PdfminerExtractor | backends.py:219-226 | Pure Python fallback (MIT) |
Pypdf2Extractor | backends.py:229-241 | Basic extraction, always available (BSD-3) |
BACKEND_REGISTRY | backends.py:279-292 | Dict mapping backend names to factories |
detect_gpu_availability() | utils.py:9-40 | Auto-detect GPU and recommend backends |
extract_single_pdf() | extractors.py:13-80 | Extract one PDF with backend fallback |
pdf_to_txt() | extractors.py:83-170 | Batch extract directory with resume |
Key implementation details:
extractors.py:55-78 - Tries each backend in order, stops on first successbackends.py:77-79 - Converters created only when first usedutils.py:43-76 - Calculates char/word/table countsFor detailed backend documentation and advanced patterns:
references/backends.md - Detailed backend comparison and selection guideWorking examples in the insurance analysis that prompted this skill:
The extraction system handles errors gracefully:
All errors are captured in metadata rather than raising exceptions.
Core dependencies (always available):
pdfminer.six - Pure Python PDF parserpdfplumber - Table-aware extractionPyPDF2 - Basic PDF operationstqdm - Progress barsOptional dependencies:
markitdown - Microsoft multi-format converterdocling - IBM document processor (GPU-accelerated)marker-pdf - Vision-based extraction (GPU-accelerated)pymupdf4llm - LLM-optimized outputpdfbox - Java-based extractionInstall all dependencies:
uv pip install "markitdown>=0.1.0" "pdfplumber>=0.10.0" "pdfminer.six>=20221105" "PyPDF2>=3.0.0" tqdm
For GPU backends:
uv pip install docling marker-pdf
extract-pdfs: command not found# Install as global UV tool from repo root:
cd plugins/pdf-extractor && uv tool install --force --editable . && cd ../..
extract-pdfs --list-backends # verify
ModuleNotFoundError: No module named 'pdf_extraction' (or 'markitdown', 'pdfplumber')# Re-install with all base dependencies:
cd plugins/pdf-extractor && uv tool install --force --editable . && cd ../..
# Or install explicitly:
uv pip install "markitdown>=0.1.0" "pdfplumber>=0.10.0" "pdfminer.six>=20221105" "PyPDF2>=3.0.0" tqdm
# Requires PyTorch; install GPU extras:
cd plugins/pdf-extractor && uv tool install --force --editable ".[gpu]" && cd ../..
extract-pdfs --list-backends # verify gpu backends appear
# Note: docling downloads ~500MB models on first use; marker downloads ~1GB
# Scanned PDFs require OCR (GPU backends):
extract-pdfs scanned.pdf --backends marker docling
# If GPU unavailable, try pdftotext (system tool):
brew install poppler # macOS
# apt install poppler-utils # Ubuntu/Debian
extract-pdfs scanned.pdf --backends pdftotext
# Install correct package (name has .six suffix):
uv pip install "pdfminer.six>=20221105"
# Import is still: from pdfminer.high_level import extract_text (no .six)
# API changed significantly in 0.1.0; ensure correct version:
uv pip install "markitdown>=0.1.0"