Skill

ocrmypdf

Adds searchable OCR text layer to scanned PDFs using OCRmyPDF and Tesseract. Supports 100+ languages. Use for OCRing PDFs, converting images to searchable PDFs, or extracting text from scans.

Python

Bash

automation

developer-tools

Install

npx claudepluginhub partme-ai/full-stack-skills --plugin t2ui-skills

Tool Access

This skill uses the workspace's default tool permissions.

Preview

[OCRmyPDF](https://github.com/ocrmypdf/OCRmyPDF) adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted. It uses Tesseract OCR, supports 100+ languages, produces PDF/A by default, and distributes work across all CPU cores.

SKILL.md

Similar Skills

using-git-worktrees

Creates isolated Git worktrees for feature branches with prioritized directory selection, gitignore safety checks, auto project setup for Node/Python/Rust/Go, and baseline verification.

superpowers

168.3k

subagent-driven-development

3 files

Executes implementation plans in current session by dispatching fresh subagents per independent task, with two-stage reviews: spec compliance then code quality.

superpowers

168.3k

dispatching-parallel-agents

Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.

superpowers

168.3k

Stats

Stars328

Forks62

Last CommitMar 23, 2026

Actions

View Source View Plugin View on GitHub View README

OCRmyPDF — Core OCR Guide

Overview

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted. It uses Tesseract OCR, supports 100+ languages, produces PDF/A by default, and distributes work across all CPU cores.

For image processing (deskew, rotate, clean), see the ocrmypdf-image skill. For optimization and PDF/A options, see ocrmypdf-optimize. For batch/Docker/scripting, see ocrmypdf-batch. For Python API and plugins, see ocrmypdf-api.

Installation

One-liner installs (recommended)

OS	Command
Debian / Ubuntu	`apt install ocrmypdf`
Fedora	`dnf install ocrmypdf tesseract-osd`
macOS (Homebrew)	`brew install ocrmypdf`
macOS (MacPorts)	`port install ocrmypdf`
FreeBSD	`pkg install py-ocrmypdf`
Snap	`snap install ocrmypdf`

pip install (latest version)

# After installing system dependencies (Tesseract, Ghostscript)
pip install ocrmypdf

Verify

ocrmypdf --version
ocrmypdf --help

Requirements

Python 3.11+
Tesseract 4.1.1+ (OCR engine)
Ghostscript 9.54+ or pypdfium2 (PDF rasterization)
Optional: jbig2enc (compression), pngquant (image optimization), unpaper (cleaning)

Quick Start

# Basic OCR — input scanned PDF, output searchable PDF/A
ocrmypdf input.pdf output.pdf

# OCR an image file directly
ocrmypdf --image-dpi 300 scan.png output.pdf

# OCR in place (only overwrites on success)
ocrmypdf myfile.pdf myfile.pdf

Language Support

OCRmyPDF uses Tesseract language packs. Install them for your OS:

# Debian / Ubuntu
apt-cache search tesseract-ocr          # List all language packs
apt install tesseract-ocr-chi-sim       # Chinese Simplified
apt install tesseract-ocr-fra           # French

# macOS (Homebrew)
brew install tesseract-lang             # All languages

# Fedora
dnf search tesseract-langpack
dnf install tesseract-langpack-ita      # Italian

Using languages

# Single language
ocrmypdf -l fra document.pdf output.pdf

# Multiple languages
ocrmypdf -l eng+fra bilingual.pdf output.pdf

# Chinese Simplified + English
ocrmypdf -l chi_sim+eng chinese-doc.pdf output.pdf

Note: Use ISO 639-3 codes for language identifiers.

OCR Modes

Default mode (skip existing text)

# Skip pages that already have text — only OCR pages without text
ocrmypdf input.pdf output.pdf

Force OCR (`--force-ocr` or `-m force`)

# Rasterize and OCR all pages, even those with existing text
ocrmypdf --force-ocr input.pdf output.pdf
# v17+ short form:
ocrmypdf -m force input.pdf output.pdf

Redo OCR (`--redo-ocr` or `-m redo`)

# Replace existing OCR without rasterizing (preserves quality)
ocrmypdf --redo-ocr input.pdf output.pdf
# v17+ short form:
ocrmypdf -m redo input.pdf output.pdf

Skip text (`--skip-text` or `-m skip`)

# Skip pages with any text, only OCR blank/image pages
ocrmypdf --skip-text input.pdf output.pdf
# v17+ short form:
ocrmypdf -m skip input.pdf output.pdf

No OCR (image processing only)

# Apply image processing / PDF/A conversion without OCR
ocrmypdf --ocr-engine none input.pdf output.pdf

Page Selection

# OCR only specific pages
ocrmypdf --pages 1,3,5-10 input.pdf output.pdf

# OCR only the first page, minimal changes elsewhere
ocrmypdf --pages 1 --output-type pdf --optimize 0 input.pdf output.pdf

Output Types

# PDF/A (default) — for archival
ocrmypdf --output-type pdfa input.pdf output.pdf

# Standard PDF
ocrmypdf --output-type pdf input.pdf output.pdf

# Auto (v17+) — speculative PDF/A, falls back to standard PDF
ocrmypdf --output-type auto input.pdf output.pdf

# No output PDF — only produce sidecar text
ocrmypdf --output-type none --sidecar text.txt input.pdf -

Sidecar Text File

# Produce a companion text file with OCR text
ocrmypdf --sidecar output.txt input.pdf output.pdf

Metadata

# Set output PDF metadata
ocrmypdf --title "My Document" --author "Author Name" --subject "Subject" input.pdf output.pdf

Parallel Processing

# Use 4 CPU cores (default: all available)
ocrmypdf --jobs 4 input.pdf output.pdf

# Single-threaded
ocrmypdf --jobs 1 input.pdf output.pdf

Common Recipes

Make a scanned PDF searchable

ocrmypdf scanned.pdf searchable.pdf

Convert image to searchable PDF

ocrmypdf --image-dpi 300 scan.jpg output.pdf

OCR a multilingual document

ocrmypdf -l eng+deu+fra multilingual.pdf output.pdf

Re-OCR with newer Tesseract

ocrmypdf --redo-ocr old-ocr.pdf updated.pdf

Strip all text/OCR from a PDF

ocrmypdf --ocr-engine none --force-ocr input.pdf stripped.pdf

Quick Reference

Task	Command
Basic OCR	`ocrmypdf input.pdf output.pdf`
Specify language	`ocrmypdf -l fra input.pdf output.pdf`
Multiple languages	`ocrmypdf -l eng+fra input.pdf output.pdf`
Force re-OCR all pages	`ocrmypdf --force-ocr input.pdf output.pdf`
Replace existing OCR	`ocrmypdf --redo-ocr input.pdf output.pdf`
Skip pages with text	`ocrmypdf --skip-text input.pdf output.pdf`
Specific pages only	`ocrmypdf --pages 1,3,5-10 input.pdf output.pdf`
Output standard PDF	`ocrmypdf --output-type pdf input.pdf output.pdf`
Extract text sidecar	`ocrmypdf --sidecar text.txt input.pdf output.pdf`
Image to PDF	`ocrmypdf --image-dpi 300 image.png output.pdf`
In-place OCR	`ocrmypdf myfile.pdf myfile.pdf`
Set metadata	`ocrmypdf --title "Title" input.pdf output.pdf`
Parallel jobs	`ocrmypdf --jobs 4 input.pdf output.pdf`

Troubleshooting

"Tesseract not found": Install Tesseract and ensure it's on PATH.
Poor OCR quality: Check language packs (-l), try --deskew (see ocrmypdf-image), or --oversample 300.
"Input file has text": Use --force-ocr, --redo-ocr, or --skip-text as appropriate.
Large output files: See ocrmypdf-optimize for --optimize levels and JBIG2.
Signed PDFs: Use --invalidate-digital-signatures to override (signatures will be invalidated).

ocrmypdf

Install

Tool Access

Preview

SKILL.md

Similar Skills

ocrmypdf

Install

Tool Access

Preview

SKILL.md

OCRmyPDF — Core OCR Guide

Overview

Installation

One-liner installs (recommended)

pip install (latest version)

Verify

Requirements

Quick Start

Language Support

Using languages

OCR Modes

Default mode (skip existing text)

Force OCR (--force-ocr or -m force)

Redo OCR (--redo-ocr or -m redo)

Skip text (--skip-text or -m skip)

No OCR (image processing only)

Page Selection

Output Types

Sidecar Text File

Metadata

Parallel Processing

Common Recipes

Make a scanned PDF searchable

Convert image to searchable PDF

OCR a multilingual document

Re-OCR with newer Tesseract

Strip all text/OCR from a PDF

Quick Reference

Troubleshooting

References

Similar Skills

OCRmyPDF — Core OCR Guide

Overview

Installation

One-liner installs (recommended)

pip install (latest version)

Verify

Requirements

Quick Start

Language Support

Using languages

OCR Modes

Default mode (skip existing text)

Force OCR (--force-ocr or -m force)

Redo OCR (--redo-ocr or -m redo)

Skip text (--skip-text or -m skip)

No OCR (image processing only)

Page Selection

Output Types

Sidecar Text File

Metadata

Parallel Processing

Common Recipes

Make a scanned PDF searchable

Convert image to searchable PDF

OCR a multilingual document

Re-OCR with newer Tesseract

Strip all text/OCR from a PDF

Quick Reference

Troubleshooting

References

Force OCR (`--force-ocr` or `-m force`)

Redo OCR (`--redo-ocr` or `-m redo`)

Skip text (`--skip-text` or `-m skip`)

Force OCR (`--force-ocr` or `-m force`)

Redo OCR (`--redo-ocr` or `-m redo`)

Skip text (`--skip-text` or `-m skip`)