Skill

vision-ocr

Transcribes hand-written or scanned answer PDFs to markdown. Supports three OCR tiers: Claude native vision, local Ollama Qwen3-VL, and pytesseract fallback. Engine selectable via .course-meta or per-call override.

automation

cli-tools

npx claudepluginhub taewooopark/paideia --plugin paideia

Popularity

Parent stars

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/paideia:vision-ocr

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

- `/grade` needs to convert `answers/*.pdf` → `answers/converted/*.md`

SKILL.md

134 lines · ~1.8k tokens

Similar Skills

answer-processing

Converts hand-written/scanned answer PDFs to markdown via OCR, then grades against reference solutions using strategy-based comparison.

paideia

pdf-ocr

Converts PDF files to markdown using local GLM-OCR via Ollama. Renders each page to image, runs OCR, assembles clean text output. Use for extracting text from PDFs.

1 file4 tools

ac-document-gen

phd-deepread

Processes academic PDFs into structured Obsidian literature notes and 9-node critical-thinking canvases. Useful for researchers who want to deeply read, summarize, or critique papers and save results in Obsidian.

6 tools

phd-deepread

Stats

LanguagePython

Parent stars52

Parent forks2

MaintenanceExcellent

Last CommitMay 24, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stats

Actions

Help us improve

Share bugs, ideas, or general feedback.

Vision-OCR

When to load

/grade needs to convert answers/*.pdf → answers/converted/*.md
Any hand-written / scanned document whose previous tesseract pass was garbled
answer-processing skill's step-2 conversion

Engine choice

.course-meta holds a single line OCR_ENGINE: <engine> written by /paideia:init-course. The grade command reads it and dispatches. Users can override per-call with /paideia:grade --ocr=<engine> [path].

Engine	Default?	How it runs	When to pick it
`claude`	Yes	`pdftoppm` → Claude reads each PNG via the Read tool → synthesizes markdown inline. No external model. No subprocess.	The out-of-the-box path. Nothing to install. Highest fidelity on messy handwriting because Claude vision handles mixed-script (English/Korean) prose with LaTeX well.
`ollama`	opt-in	`python3 ${CLAUDE_PLUGIN_ROOT}/scripts/vision_ocr.py --engine=ollama <pdf> <md>` — local Qwen3-VL 8B, with an automatic tesseract fall-back if ollama is unreachable. Reads `INTERFACE_LANG` from `.course-meta` to set the prose-language rule.	You want the PDF to never leave the machine and you don't want to burn Claude tokens on OCR. Requires one-time `ollama pull qwen3-vl:8b` (~6 GB).
`tesseract`	opt-in	`python3 ${CLAUDE_PLUGIN_ROOT}/scripts/vision_ocr.py --engine=tesseract <pdf> <md>` — pytesseract (`eng` for en, `eng+kor` for ko, derived from `.course-meta`).	Zero cloud + no GPU/VRAM budget. Lowest fidelity on handwriting; fine for typed scans.

All three emit answers/converted/<stem>.md with a  /  header comment that lets /grade caveat the confidence.

Tier 0 — Claude native vision (default)

Pipeline (driven by the /grade command, not this script):

answers/<stem>.pdf
  ↓ pdftoppm -r 200 -png <pdf> <tmpdir>/page   # rasterize to PNG per page
  ↓ Claude reads <tmpdir>/page-1.png, page-2.png, ... via the Read tool
  ↓ Claude synthesizes clean MD following the prompt contract below
answers/converted/<stem>.md
   └── header:  <!-- SOURCE: <stem>.pdf, claude-vision (native), N pages -->

The grade command handles the orchestration — rasterize, Read each page, synthesize into one markdown file in a single pass. No standalone driver script is required.

Tier 1 — Ollama Qwen3-VL 8B (opt-in)

answers/<stem>.pdf
  ↓ pdf2image @ 300dpi
  ↓ resize to ≤1200px wide (VLMs dislike huge inputs)
  ↓ base64 JPEG per page
  ↓ [Tier 1a] ollama qwen3-vl:8b
  ↓    (on timeout / ollama down)
  ↓ [Tier 1b] pytesseract (eng or eng+kor, from .course-meta INTERFACE_LANG)  ← auto-fallback inside the same script
answers/converted/<stem>.md
   └── header:  <!-- SOURCE: <stem>.pdf, qwen3-vl:8b @ 300dpi, N pages -->
                <!-- TIER: tesseract fallback -->  (only when 1a bombed)

Entrypoint:

python3 "${CLAUDE_PLUGIN_ROOT}/scripts/vision_ocr.py" --engine=ollama <input.pdf> <output.md>

${CLAUDE_PLUGIN_ROOT}/scripts/vision_ocr.py is the single source of truth. It:

Warms up the VLM with a 1-token /api/generate so the first real page isn't stalled by model load.
Sends each page as a JPEG-encoded base64 image with a language-aware, LaTeX-first prompt (prose-language rule selected by INTERFACE_LANG).
Sets keep_alive: "15m" so the model stays in memory across pages within a session.
On any exception (timeout, connection refused) falls back to pytesseract and marks the file.

Tier 2 — Tesseract explicit

python3 "${CLAUDE_PLUGIN_ROOT}/scripts/vision_ocr.py" --engine=tesseract <input.pdf> <output.md>

Skips ollama entirely. Header: .

Prompt contract (must preserve across Tier 0 + Tier 1)

Whether synthesized by Claude inline (Tier 0) or by Qwen3-VL through this script (Tier 1), the transcription prompt must:

Keep prose in its original language (English, Korean, etc.) — do not translate
Emit math as $...$ / $$...$$
Preserve problem numbering (P1, (1), (a), ...)
NOT grade or interpret — just transcribe
Write [?] for ambiguous glyphs instead of guessing
Skip crossed-out work
Return markdown only, no <think>, no commentary

If you edit the prompt, keep these six clauses — they're what separates useful transcription from hallucination.

Dependencies

All engines need:

poppler binaries (pdftoppm, used by pdf2image). brew install poppler / apt-get install poppler-utils.

Tier 0 (claude): nothing beyond Claude Code itself.

Tier 1 (ollama) extras:

ollama CLI + model qwen3-vl:8b (~6.1 GB). brew install ollama && ollama serve & && ollama pull qwen3-vl:8b.
Python: pdf2image, pytesseract, pillow.

Tier 2 (tesseract) extras:

tesseract + tesseract-lang (or tesseract-ocr-kor on Debian). Python: pdf2image, pytesseract, pillow.

Performance notes

Tier 0 (claude): depends on Claude's per-image processing; typically a few seconds per page with no model-load stall.
Tier 1 (ollama) on Mac M-series: first page ~2–5 min (model load + decode); subsequent pages ~20–60 s.
A 2-page hand-written answer typically takes 3–7 min total on Tier 1, a few seconds on Tier 0, and <10 s on Tier 2 (but fidelity falls off a cliff).
300dpi input is downscaled to 1200px before encoding — keeps the base64 payload under ~500 KB on Tier 1.

Failure modes + fixes

Symptom	Cause	Fix
Tier 0 produces garbage	Scan too dim / skewed / low-res	Re-scan at 300dpi with the page flat, re-run
Tier 1 `timed out` on page 1	first-load stall on cold ollama	re-run; warmup + `keep_alive` should help on 2nd try
Tier 1 empty response / `<think>...` leaks	prompt contract violated	re-check prompt; add "Return ONLY markdown, no "
Tier 1 base64 error / 413	image too large	drop `MAX_IMG_WIDTH` from 1200 → 1000
Tier 1 ollama 404	`qwen3-vl:8b` not pulled	`ollama pull qwen3-vl:8b`
Tier 1 tesseract fallback kept firing	ollama server not running	`ollama serve &`

Anti-patterns

❌ Don't pass base64 via curl -d <arg> — ARG_MAX overflow. Use stdlib urllib with POST body.
❌ Don't send PNG to Qwen3-VL — JPEG q=90 is 5–10× smaller with no impact on VLM accuracy.
❌ Don't ask any tier to grade or solve. That's /grade's job; OCR must stay pure transcription.
❌ Don't trust Tier 1b (silent tesseract fallback) without reading the header — the file comment tells /grade to caveat its verdict.

Integration

Called by /grade via answer-processing skill step 2
Called by /ingest (future) for hand-written lecture notes, if any appear in materials/
Writes to answers/converted/ only; does not modify originals in answers/

vision-ocr

Popularity

Invocation

Context Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

Find plugins for your project

vision-ocr

Popularity

Invocation

Context Preview

SKILL.md

Vision-OCR

When to load

Engine choice

Tier 0 — Claude native vision (default)

Tier 1 — Ollama Qwen3-VL 8B (opt-in)

Tier 2 — Tesseract explicit

Prompt contract (must preserve across Tier 0 + Tier 1)

Dependencies

Performance notes

Failure modes + fixes

Anti-patterns

Integration

Similar Skills

Help us improve

Vision-OCR

When to load

Engine choice

Tier 0 — Claude native vision (default)

Tier 1 — Ollama Qwen3-VL 8B (opt-in)

Tier 2 — Tesseract explicit

Prompt contract (must preserve across Tier 0 + Tier 1)

Dependencies

Performance notes

Failure modes + fixes

Anti-patterns

Integration