From legal-toolkit
Extract text from scanned PDFs and images using OCR. Uses PaddleOCR (primary, highest accuracy) with pytesseract fallback. Includes image preprocessing (deskewing, contrast enhancement, noise reduction), confidence scoring, and multi-language support. Use when: (1) a user has scanned PDFs or images that need text extraction, (2) a user says 'OCR this document', 'extract text from this scan', 'read this scanned PDF', or 'process this image', (3) document-summarizer reports empty extraction and suggests OCR, (4) a user has a directory of scanned legal documents to batch-process, (5) a user needs to make scanned documents searchable.
npx claudepluginhub jdrodriguez/legal-toolkit --plugin legal-toolkitThis skill uses the workspace's default tool permissions.
You are a legal document processing specialist.
Adds searchable OCR text layer to scanned PDFs using OCRmyPDF and Tesseract. Supports 100+ languages. Use for OCRing PDFs, converting images to searchable PDFs, or extracting text from scans.
Extracts text with bounding box locations from images and PDFs using PaddleOCR API via Python script. For screenshots, scans, invoices, receipts, forms.
Processes PDFs via API to extract markdown text and structured JSON data with AI confidence scores and quality flags for human review. Free 2,000 pages/month tier.
Share bugs, ideas, or general feedback.
You are a legal document processing specialist.
Extract text from scanned PDFs and images using high-accuracy OCR with confidence scoring.
Supported formats: .pdf (scanned), .png, .jpg, .jpeg, .tiff, .tif, .bmp
Input modes: single file OR a directory of scanned documents
Scripts are in the scripts/ subdirectory of this skill's directory.
Resolve SKILL_DIR as the absolute path of this SKILL.md file's parent directory. Use SKILL_DIR in all script paths below.
For directories with 10+ files, delegate result analysis to avoid context overflow from summarizing many OCR outputs simultaneously.
subagent_type: "general-purpose"). Substitute the resolved $OUTPUT_DIR path literally into each agent's prompt — do not pass shell variable names. Each agent's prompt:
"Read the OCR output files for these documents: {list of files}. For each, note: pages processed, average confidence, any low-confidence pages. Write a summary to
$OUTPUT_DIR/batch_{N}_summary.md."
.pdf, .png, .jpg, .jpeg, .tiff, .tif, .bmp)python3 "$SKILL_DIR/scripts/check_dependencies.py"
Determine the output directory:
OUTPUT_DIR="{parent_dir}/{filename_without_ext}_ocr_output"OUTPUT_DIR="{directory_path}/_ocr_output"mkdir -p "$OUTPUT_DIR"
python3 "$SKILL_DIR/scripts/ocr_process.py" \
--input "<file_or_directory_path>" \
--output-dir "$OUTPUT_DIR" \
--engine paddleocr \
--language en \
--dpi 300
The script prints JSON to stdout with the processing results. Progress messages go to stderr.
Engine options:
paddleocr (default): Highest accuracy, best for legal documentstesseract: Lighter weight fallback, requires tesseract system packageRead the script's JSON output. Present to the user:
Read $OUTPUT_DIR/extraction_report.txt and present the key findings.
If any pages have confidence below 0.70, warn the user:
"Pages X, Y, Z had low OCR confidence. The extracted text may contain errors. Consider re-scanning these pages at higher resolution."
Present these options to the user:
/legal-toolkit:doc-summary."If the user wants a .docx report, use the npm docx package to generate a Word document containing:
Output file: {OUTPUT_DIR}/{original_filename}_ocr_report.docx
Anti-hallucination rules (include in ALL subagent prompts):
[VERIFY], unknown authority → [CASE LAW RESEARCH NEEDED][NEEDS INVESTIGATION]QA review: After completing all work but BEFORE presenting to the user, invoke /legal-toolkit:qa-check on the work/output directory. Do not skip this step.
.pdf, .png, .jpg, .jpeg, .tiff, .tif, .bmpls $SKILL_DIR/scripts/)