Skill

pdf

Reads, extracts, converts, merges, splits, and creates PDFs. Includes OCR for scanned/hand-written PDFs using pytesseract and pdf2image.

Python

documentation

developer-tools

Popularity

Parent stars

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/paideia:pdf

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Load this skill whenever the workflow involves PDF input or output. In the paideia context specifically:

Supporting Files

VISION.md

SKILL.md

165 lines · ~1.8k tokens

Stats

LanguagePython

Parent stars91

Parent forks3

MaintenanceExcellent

Last CommitJul 11, 2026

Actions

View Source View Plugin View on GitHub View README

PDF Processing Guide

When to use this skill

Load this skill whenever the workflow involves PDF input or output. In the paideia context specifically:

Converting materials/**/*.pdf to markdown in converted/**/*.md (via /ingest)
Converting hand-written answer PDFs in answers/*.pdf to markdown in answers/converted/*.md (via /grade)
OCR for scanned lecture notes, textbook chapters, or hand-written work

Quick decision tree

What kind of PDF?
├─ Course material (materials/**/*.pdf)  → VISION pipeline (see VISION.md)
│                                          pdfplumber is unreliable on course
│                                          content — even "prose-heavy"
│                                          textbook pages mix in equations,
│                                          figures, and multi-column layouts
│                                          that break digital extraction
│                                          silently. We route everything
│                                          through vision instead of
│                                          maintaining a per-category heuristic.
├─ Hand-written answer PDF              → vision-ocr skill (see vision-ocr/)
└─ Arbitrary outside-the-plugin PDF     → pdfplumber / pypdf / pytesseract
                                          per the sections below, case-by-case

Within this plugin, /paideia:ingest routes all materials/**/*.pdf through the vision pipeline. The pdfplumber / pypdf / pytesseract blocks below remain for reference and for ad-hoc PDF work outside the ingest flow (e.g., quick text dumps, PDF merge/split, producing the cheatsheet PDF).

Core operations

Text extraction (digital PDF)

import pdfplumber

with pdfplumber.open("input.pdf") as pdf:
    text_by_page = []
    for page in pdf.pages:
        text_by_page.append(page.extract_text() or "")
full_text = "\n\n---\n\n".join(text_by_page)

Simpler alternative using pypdf:

from pypdf import PdfReader
reader = PdfReader("input.pdf")
full_text = "\n\n".join(p.extract_text() or "" for p in reader.pages)

OCR (scanned or hand-written PDF)

Install deps once:

pip install --break-system-packages pytesseract pdf2image
# Also needs system tesseract: apt-get install tesseract-ocr poppler-utils

import pytesseract
from pdf2image import convert_from_path

images = convert_from_path("scanned.pdf", dpi=200)
text = ""
for i, image in enumerate(images):
    text += f"\n\n## Page {i+1}\n\n"
    text += pytesseract.image_to_string(image, lang="eng+kor")  # multi-lang

For best OCR quality on math/physics hand-writing, use dpi=300 and consider preprocessing (deskew, binarize) with opencv before OCR.

Command-line text extraction (fast path)

# Requires: apt-get install poppler-utils
pdftotext -layout input.pdf output.txt

Merge / split

from pypdf import PdfReader, PdfWriter

# Merge
writer = PdfWriter()
for f in ["chap1.pdf", "chap2.pdf"]:
    for page in PdfReader(f).pages:
        writer.add_page(page)
with open("merged.pdf", "wb") as out:
    writer.write(out)

# Split single page
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
    w = PdfWriter()
    w.add_page(page)
    with open(f"page_{i+1}.pdf", "wb") as out:
        w.write(out)

PDF creation (for producing clean cheatsheets)

from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer
from reportlab.lib.styles import getSampleStyleSheet

doc = SimpleDocTemplate("output.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = [Paragraph("Title", styles['Title']), Spacer(1, 12)]
# Use <sub> and <super> tags, NEVER Unicode subscripts (they render as black boxes)
story.append(Paragraph("H<sub>2</sub>O and E = mc<super>2</super>", styles['Normal']))
doc.build(story)

Course-cram specific conventions

When converting PDF materials to markdown for this project:

Preserve structure. Section headers (##), numbered lists, tables. Do NOT reflow paragraphs — keep line breaks roughly aligned with source for verifiability.
Math formatting. Convert inline math to $...$ , display math to $$...$$. If extraction produces garbled LaTeX, mark with [?] and move on — don't guess.
Name convention. materials/lectures/chapter03.pdf → converted/lectures/chapter03.md. Preserve subfolder structure.
Provenance markers. Prepend the output file with a source comment tagging the extraction method:
```

```
For OCR specifically, append: accuracy may vary. Verify math expressions manually.
Idempotence. If converted/X.md already exists and is newer than materials/X.pdf, skip (unless user passes --force).
Default route for all materials/**/*.pdf is the vision pipeline (see VISION.md). pdfplumber was tried as a fast path for prose-heavy material and proved unreliable in practice — even textbook pages silently word-salad when they mix equations, multi-column layouts, or figure captions. Uniform vision routing is simpler and more reliable than per-category heuristics with fallbacks.
Hand-written answer PDFs. Output to answers/converted/<name>.md. Expect garbled math; the grading step handles ambiguity via strategy-matching, not exact algebra.

Error patterns to watch for

Empty extracted text (page.extract_text() returns "") → it's scanned. Fall through to OCR.
Unicode subscript/superscript in reportlab → renders as solid black boxes. Use <sub>/<super> XML tags instead.
Protected PDFs → qpdf --password=... --decrypt in.pdf out.pdf first.
Multi-column academic PDFs → pdfplumber's default extraction interleaves columns. Use page.extract_text(layout=True) or crop bboxes per column.
Image-heavy scans → convert_from_path uses a lot of memory. Set dpi=150 for first pass, re-run at 300 only if OCR quality is poor.

Dependencies

Standard install for paideia use:

pip install --break-system-packages pypdf pdfplumber pytesseract pdf2image reportlab
apt-get install -y poppler-utils tesseract-ocr tesseract-ocr-kor

The Korean language pack (tesseract-ocr-kor) is needed if the user writes solutions in Korean/Hangul.

Reference

Full skill at https://github.com/anthropics/skills/tree/main/skills/pdf with REFERENCE.md covering pypdfium2, JavaScript libraries, and FORMS.md covering PDF form filling.

pdf

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

pdf

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

PDF Processing Guide

When to use this skill

Quick decision tree

Core operations

Text extraction (digital PDF)

OCR (scanned or hand-written PDF)

Command-line text extraction (fast path)

Merge / split

PDF creation (for producing clean cheatsheets)

Course-cram specific conventions

Error patterns to watch for

Dependencies

Reference

Similar Skills

PDF Processing Guide

When to use this skill

Quick decision tree

Core operations

Text extraction (digital PDF)

OCR (scanned or hand-written PDF)

Command-line text extraction (fast path)

Merge / split

PDF creation (for producing clean cheatsheets)

Course-cram specific conventions

Error patterns to watch for

Dependencies

Reference

Similar Skills