Help us improve
Share bugs, ideas, or general feedback.
From paideia
Reads, extracts, converts, merges, splits, and creates PDFs. Includes OCR for scanned/hand-written PDFs using pytesseract and pdf2image.
npx claudepluginhub taewooopark/paideia --plugin paideiaHow this skill is triggered — by the user, by Claude, or both
Slash command
/paideia:pdfThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Load this skill whenever the workflow involves PDF input or output. In the paideia context specifically:
Process PDF files: extract text, create new PDFs, merge and split documents using pdftotext, PyMuPDF, ReportLab, and pdfkit.
Reads, creates, and edits PDF files using Python libraries (pypdf, pdfplumber). Handles text/table extraction, merging, splitting, rotation, watermarks, forms, encryption, OCR.
Processes PDF files: extracts text and tables, fills forms, merges/splits documents, batch-processes, converts to images, and generates PDFs programmatically using pypdf, pdfplumber, reportlab, and CLI tools.
Share bugs, ideas, or general feedback.
Load this skill whenever the workflow involves PDF input or output. In the paideia context specifically:
materials/**/*.pdf to markdown in converted/**/*.md (via /ingest)answers/*.pdf to markdown in answers/converted/*.md (via /grade)What kind of PDF?
├─ Course material (materials/**/*.pdf) → VISION pipeline (see VISION.md)
│ pdfplumber is unreliable on course
│ content — even "prose-heavy"
│ textbook pages mix in equations,
│ figures, and multi-column layouts
│ that break digital extraction
│ silently. We route everything
│ through vision instead of
│ maintaining a per-category heuristic.
├─ Hand-written answer PDF → vision-ocr skill (see vision-ocr/)
└─ Arbitrary outside-the-plugin PDF → pdfplumber / pypdf / pytesseract
per the sections below, case-by-case
Within this plugin, /paideia:ingest routes all materials/**/*.pdf through the vision pipeline. The pdfplumber / pypdf / pytesseract blocks below remain for reference and for ad-hoc PDF work outside the ingest flow (e.g., quick text dumps, PDF merge/split, producing the cheatsheet PDF).
import pdfplumber
with pdfplumber.open("input.pdf") as pdf:
text_by_page = []
for page in pdf.pages:
text_by_page.append(page.extract_text() or "")
full_text = "\n\n---\n\n".join(text_by_page)
Simpler alternative using pypdf:
from pypdf import PdfReader
reader = PdfReader("input.pdf")
full_text = "\n\n".join(p.extract_text() or "" for p in reader.pages)
Install deps once:
pip install --break-system-packages pytesseract pdf2image
# Also needs system tesseract: apt-get install tesseract-ocr poppler-utils
import pytesseract
from pdf2image import convert_from_path
images = convert_from_path("scanned.pdf", dpi=200)
text = ""
for i, image in enumerate(images):
text += f"\n\n## Page {i+1}\n\n"
text += pytesseract.image_to_string(image, lang="eng+kor") # multi-lang
For best OCR quality on math/physics hand-writing, use dpi=300 and consider preprocessing (deskew, binarize) with opencv before OCR.
# Requires: apt-get install poppler-utils
pdftotext -layout input.pdf output.txt
from pypdf import PdfReader, PdfWriter
# Merge
writer = PdfWriter()
for f in ["chap1.pdf", "chap2.pdf"]:
for page in PdfReader(f).pages:
writer.add_page(page)
with open("merged.pdf", "wb") as out:
writer.write(out)
# Split single page
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
w = PdfWriter()
w.add_page(page)
with open(f"page_{i+1}.pdf", "wb") as out:
w.write(out)
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer
from reportlab.lib.styles import getSampleStyleSheet
doc = SimpleDocTemplate("output.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = [Paragraph("Title", styles['Title']), Spacer(1, 12)]
# Use <sub> and <super> tags, NEVER Unicode subscripts (they render as black boxes)
story.append(Paragraph("H<sub>2</sub>O and E = mc<super>2</super>", styles['Normal']))
doc.build(story)
When converting PDF materials to markdown for this project:
Preserve structure. Section headers (##), numbered lists, tables. Do NOT reflow paragraphs — keep line breaks roughly aligned with source for verifiability.
Math formatting. Convert inline math to $...$, display math to $$...$$. If extraction produces garbled LaTeX, mark with [?] and move on — don't guess.
Name convention. materials/lectures/chapter03.pdf → converted/lectures/chapter03.md. Preserve subfolder structure.
Provenance markers. Prepend the output file with a source comment tagging the extraction method:
<!-- SOURCE: materials/<cat>/<stem>.pdf, extracted <YYYY-MM-DD>, method: pdfplumber|vision|ocr -->
For OCR specifically, append: accuracy may vary. Verify math expressions manually.
Idempotence. If converted/X.md already exists and is newer than materials/X.pdf, skip (unless user passes --force).
Default route for all materials/**/*.pdf is the vision pipeline (see VISION.md). pdfplumber was tried as a fast path for prose-heavy material and proved unreliable in practice — even textbook pages silently word-salad when they mix equations, multi-column layouts, or figure captions. Uniform vision routing is simpler and more reliable than per-category heuristics with fallbacks.
Hand-written answer PDFs. Output to answers/converted/<name>.md. Expect garbled math; the grading step handles ambiguity via strategy-matching, not exact algebra.
page.extract_text() returns "") → it's scanned. Fall through to OCR.<sub>/<super> XML tags instead.qpdf --password=... --decrypt in.pdf out.pdf first.page.extract_text(layout=True) or crop bboxes per column.convert_from_path uses a lot of memory. Set dpi=150 for first pass, re-run at 300 only if OCR quality is poor.Standard install for paideia use:
pip install --break-system-packages pypdf pdfplumber pytesseract pdf2image reportlab
apt-get install -y poppler-utils tesseract-ocr tesseract-ocr-kor
The Korean language pack (tesseract-ocr-kor) is needed if the user writes solutions in Korean/Hangul.
Full skill at https://github.com/anthropics/skills/tree/main/skills/pdf with REFERENCE.md covering pypdfium2, JavaScript libraries, and FORMS.md covering PDF form filling.