From docs
Use this skill when the user wants to convert PDF files to Markdown using Docling. This includes converting single PDFs, batch-converting directories of PDFs, extracting structured content (headings, tables, lists) from academic papers, converting scanned documents with OCR, or configuring Docling pipeline options for table detection and image extraction.
npx claudepluginhub bauhaus-infau/infau-skill-base --plugin docsThis skill uses the workspace's default tool permissions.
Convert PDF documents to structured Markdown using the [Docling](https://docling-project.github.io/docling/) library. Docling uses ML models to understand document layout, preserving headings, tables, lists, and reading order. For advanced pipeline configuration, OCR settings, and performance tuning, see reference.md.
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
Convert PDF documents to structured Markdown using the Docling library. Docling uses ML models to understand document layout, preserving headings, tables, lists, and reading order. For advanced pipeline configuration, OCR settings, and performance tuning, see reference.md.
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("document.pdf")
markdown = result.document.export_to_markdown()
with open("document.md", "w", encoding="utf-8") as f:
f.write(markdown)
Install: pip install docling
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("paper.pdf")
# Export as Markdown
md = result.document.export_to_markdown()
# Page count
print(f"{len(result.document.pages)} pages")
from pathlib import Path
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
output_dir = Path("markdown_outputs")
output_dir.mkdir(exist_ok=True)
for pdf in Path("papers/").glob("*.pdf"):
result = converter.convert(str(pdf))
md = result.document.export_to_markdown()
(output_dir / f"{pdf.stem}.md").write_text(md, encoding="utf-8")
print(f"Converted {pdf.name}")
from pathlib import Path
from docling.document_converter import DocumentConverter
def convert_pdf(pdf_path, output_path=None):
pdf_path = Path(pdf_path)
if output_path is None:
output_path = pdf_path.with_suffix(".md")
converter = DocumentConverter()
result = converter.convert(str(pdf_path))
Path(output_path).write_text(
result.document.export_to_markdown(), encoding="utf-8"
)
return output_path
from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
pipeline_options = PdfPipelineOptions()
# Use accurate table detection (slower but better for complex tables)
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
result = converter.convert("report.pdf")
from pathlib import Path
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
pipeline_options = PdfPipelineOptions()
pipeline_options.generate_picture_images = True
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
result = converter.convert("paper.pdf")
image_dir = Path("images")
image_dir.mkdir(exist_ok=True)
for element, _level in result.document.iterate_items():
if hasattr(element, 'image') and element.image is not None:
img_path = image_dir / f"{element.self_ref}.png"
element.image.pil_image.save(str(img_path))
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
result = converter.convert("scanned_document.pdf")
md = result.document.export_to_markdown()
# Single file
python scripts/pdf_to_markdown.py report.pdf
# Single file with custom output
python scripts/pdf_to_markdown.py report.pdf output.md
# Batch convert a directory
python scripts/pdf_to_markdown.py ./papers/
# Batch with custom output directory
python scripts/pdf_to_markdown.py ./papers/ ./converted/
# With image extraction
python scripts/pdf_to_markdown_advanced.py paper.pdf paper.md --with-images ./images
# With OCR for scanned documents
python scripts/pdf_to_markdown_advanced.py scanned.pdf output.md --ocr
# With accurate table detection
python scripts/pdf_to_markdown_advanced.py report.pdf report.md --accurate-tables
# Combine options
python scripts/pdf_to_markdown_advanced.py paper.pdf paper.md --ocr --accurate-tables --with-images ./img
Academic papers with sections, references, figures, and tables:
from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
pipeline_options = PdfPipelineOptions()
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
result = converter.convert("paper.pdf")
md = result.document.export_to_markdown()
Slides often have sparse text with images — enable image extraction:
pipeline_options = PdfPipelineOptions()
pipeline_options.generate_picture_images = True
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
result = converter.convert("slides.pdf")
For documents where table accuracy is critical:
pipeline_options = PdfPipelineOptions()
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
# Convert and check tables
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
result = converter.convert("data_report.pdf")
md = result.document.export_to_markdown()
For scanned documents or image-based PDFs:
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
result = converter.convert("scanned.pdf")
| Task | Code |
|---|---|
| Basic conversion | DocumentConverter().convert("file.pdf") |
| Export to Markdown | result.document.export_to_markdown() |
| Page count | len(result.document.pages) |
| Batch convert | Loop over Path("dir").glob("*.pdf") |
| Accurate tables | TableFormerMode.ACCURATE in pipeline options |
| Enable OCR | pipeline_options.do_ocr = True |
| Extract images | pipeline_options.generate_picture_images = True |
| CLI single file | python scripts/pdf_to_markdown.py file.pdf |
| CLI batch | python scripts/pdf_to_markdown.py ./dir/ |
| CLI advanced | python scripts/pdf_to_markdown_advanced.py in.pdf out.md --ocr |
| Error | Cause | Solution |
|---|---|---|
ModuleNotFoundError: docling | Docling not installed | pip install docling |
| File not found | Invalid path | Check file path exists |
| Invalid PDF | Corrupted or non-PDF file | Verify file is a valid PDF |
| Write permission error | Cannot write output | Use a different output directory |
| Slow first conversion | Model download on first use | Wait for ~500 MB model download; models are cached after |
| GPU not detected | CUDA/PyTorch not configured | Falls back to CPU automatically; install torch with CUDA for GPU |
UnicodeEncodeError | Non-UTF-8 characters in output | Ensure encoding='utf-8' when writing files |
MemoryError | PDF too large or complex | Process fewer pages at a time; close other applications |
| Table detection issues | Complex or borderless tables | Use TableFormerMode.ACCURATE |
| Poor OCR quality | Low-resolution scan | Use higher-DPI source; try EasyOCR backend (see reference.md) |
| Windows symlink warnings | HuggingFace cache issue | Set HF_HUB_DISABLE_SYMLINKS_WARNING=1 (scripts do this automatically) |