Skill

ocr

Extract text from scanned PDFs and images using OCR. Uses PaddleOCR (primary, highest accuracy) with pytesseract fallback. Includes image preprocessing (deskewing, contrast enhancement, noise reduction), confidence scoring, and multi-language support. Use when: (1) a user has scanned PDFs or images that need text extraction, (2) a user says 'OCR this document', 'extract text from this scan', 'read this scanned PDF', or 'process this image', (3) document-summarizer reports empty extraction and suggests OCR, (4) a user has a directory of scanned legal documents to batch-process, (5) a user needs to make scanned documents searchable.

npx claudepluginhub jdrodriguez/legal-toolkit --plugin legal-toolkit

Tool Access

This skill uses the workspace's default tool permissions.

Preview

You are a legal document processing specialist.

Supporting Assets

scripts/check_dependencies.pyscripts/ocr_process.py

SKILL.md

Similar Skills

ocrmypdf

328

Adds searchable OCR text layer to scanned PDFs using OCRmyPDF and Tesseract. Supports 100+ languages. Use for OCRing PDFs, converting images to searchable PDFs, or extracting text from scans.

partme-ai-full-stack-skills

paddleocr-text-recognition

Extracts text with bounding box locations from images and PDFs using PaddleOCR API via Python script. For screenshots, scans, invoices, receipts, forms.

5 files

aidenwu0209-paddleocr-skills

deepread

586

Processes PDFs via API to extract markdown text and structured JSON data with AI confidence scores and quality flags for human review. Free 2,000 pages/month tier.

1 file

sundial-org-awesome-openclaw-skills-4

Stats

Parent Repo Stars2

Parent Repo Forks0

Last CommitMar 11, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Legal OCR Engine

You are a legal document processing specialist.

Extract text from scanned PDFs and images using high-accuracy OCR with confidence scoring.

Supported formats: .pdf (scanned), .png, .jpg, .jpeg, .tiff, .tif, .bmp Input modes: single file OR a directory of scanned documents

Skill Directory

Scripts are in the scripts/ subdirectory of this skill's directory. Resolve SKILL_DIR as the absolute path of this SKILL.md file's parent directory. Use SKILL_DIR in all script paths below.

Agent Delegation (Conditional)

For directories with 10+ files, delegate result analysis to avoid context overflow from summarizing many OCR outputs simultaneously.

When to Delegate

1-9 files: Process and present results directly — no delegation needed.
10+ files: After OCR processing completes, delegate result analysis to subagents.

Orchestrator Workflow (When Delegating)

You handle: Steps 1-3 below (validate, check deps, run OCR on the entire directory).
After OCR completes, read the JSON output to get the list of processed files and their results.
Divide files into groups of ~10 and launch one subagent per group (Agent tool, subagent_type: "general-purpose"). Substitute the resolved $OUTPUT_DIR path literally into each agent's prompt — do not pass shell variable names. Each agent's prompt:

"Read the OCR output files for these documents: {list of files}. For each, note: pages processed, average confidence, any low-confidence pages. Write a summary to $OUTPUT_DIR/batch_{N}_summary.md."
Collect and present: Read batch summaries, compile overall statistics, present to the user per Steps 4-5.

Process

Step 1: Validate Input

Confirm the user provided a path. If not, ask: "Please provide the path to the scanned PDF or image file you want to OCR."
Determine if the path is a file or directory:
- File: verify it exists and has a supported extension (.pdf, .png, .jpg, .jpeg, .tiff, .tif, .bmp)
- Directory: verify it exists; the script will find all supported files inside it
If a directory, tell the user which files were found before proceeding.
Ask the user if they have a language preference (default: English). PaddleOCR supports 80+ languages.

Step 2: Check Dependencies

python3 "$SKILL_DIR/scripts/check_dependencies.py"

Exit 0: all good. Exit 1: packages were installed (proceed). Exit 2: failed (report to user).
Important: PaddleOCR + PaddlePaddle is ~738MB on first install. Warn the user this may take a few minutes.

Step 3: Run OCR Processing

Determine the output directory:

Single file: OUTPUT_DIR="{parent_dir}/{filename_without_ext}_ocr_output"
Directory: OUTPUT_DIR="{directory_path}/_ocr_output"

mkdir -p "$OUTPUT_DIR"

python3 "$SKILL_DIR/scripts/ocr_process.py" \
  --input "<file_or_directory_path>" \
  --output-dir "$OUTPUT_DIR" \
  --engine paddleocr \
  --language en \
  --dpi 300

The script prints JSON to stdout with the processing results. Progress messages go to stderr.

Engine options:

paddleocr (default): Highest accuracy, best for legal documents
tesseract: Lighter weight fallback, requires tesseract system package

Step 4: Present Results

Read the script's JSON output. Present to the user:

Documents processed: count and filenames
Per-document results:
- Pages processed
- Average confidence score (flag anything below 0.85 as potentially unreliable)
- Any warnings (low confidence pages, skipped pages)
Overall statistics: total pages, average confidence, processing time

Read $OUTPUT_DIR/extraction_report.txt and present the key findings.

If any pages have confidence below 0.70, warn the user:

"Pages X, Y, Z had low OCR confidence. The extracted text may contain errors. Consider re-scanning these pages at higher resolution."

Step 5: Offer Next Steps

Present these options to the user:

Summarize the extracted text: "Would you like me to summarize the extracted content? I can pipe it to /legal-toolkit:doc-summary."
Generate a searchable PDF: "I can create a searchable PDF with the OCR text layer embedded."
Generate a Word document: "I can create a .docx report with the extracted text, organized by page."
Review specific pages: "Would you like to review the extracted text for specific pages?"

Step 6: Generate Output Report (if requested)

If the user wants a .docx report, use the npm docx package to generate a Word document containing:

Title: "OCR Extraction Report: {filename}"
Metadata: Source file, pages processed, OCR engine, average confidence
Per-page content: Page number heading, extracted text, confidence score
Flagged sections: Pages with low confidence highlighted

Output file: {OUTPUT_DIR}/{original_filename}_ocr_report.docx

Accuracy and QA (Required)

Anti-hallucination rules (include in ALL subagent prompts):

Every factual claim must cite a source document — unsourced claims are prohibited
Never fabricate legal citations — all case law → [VERIFY], unknown authority → [CASE LAW RESEARCH NEEDED]
Never assume facts not in source material — missing info → [NEEDS INVESTIGATION]
Quote exactly when comparing documents — label analysis vs. facts distinctly

QA review: After completing all work but BEFORE presenting to the user, invoke /legal-toolkit:qa-check on the work/output directory. Do not skip this step.

Error Handling

Path not found: Ask user to verify the path
Unsupported format: Supported types are .pdf, .png, .jpg, .jpeg, .tiff, .tif, .bmp
Empty directory: No supported files found -- ask user to check the folder
PaddleOCR not available: Fall back to tesseract. If neither available, report to user.
Very low confidence (<0.5): Document may be too degraded. Suggest re-scanning or manual review.
Script not found: Verify the skill is installed (ls $SKILL_DIR/scripts/)