Document Processing Guide
Work with office documents: PDF, Excel, Word, and PowerPoint.
Format Overview
| Format | Extension | Structure | Best For |
|---|
| PDF | .pdf | Binary/text | Reports, forms, archives |
| Excel | .xlsx | XML in ZIP | Data, calculations, models |
| Word | .docx | XML in ZIP | Text documents, contracts |
| PowerPoint | .pptx | XML in ZIP | Presentations, slides |
Key concept: XLSX, DOCX, and PPTX are all ZIP archives containing XML files. You can unzip them to access raw content.
PDF Processing
PDF Tools
| Task | Best Tool |
|---|
| Basic read/write | pypdf |
| Text extraction | pdfplumber |
| Table extraction | pdfplumber |
| Create PDFs | reportlab |
| OCR scanned PDFs | pytesseract + pdf2image |
| Command line | qpdf, pdftotext |
Common Operations
| Operation | Approach |
|---|
| Merge | Loop through files, add pages to writer |
| Split | Create new writer per page |
| Extract tables | Use pdfplumber, convert to DataFrame |
| Rotate | Call .rotate(degrees) on page |
| Encrypt | Use writer's .encrypt() method |
| OCR | Convert to images, run pytesseract |
Excel Processing
Excel Tools
| Task | Best Tool |
|---|
| Data analysis | pandas |
| Formulas & formatting | openpyxl |
| Simple CSV | pandas |
| Financial models | openpyxl |
Critical Rule: Use Formulas
| Approach | Result |
|---|
| Wrong: Calculate in Python, write value | Static number, breaks when data changes |
| Right: Write Excel formula | Dynamic, recalculates automatically |
Financial Model Standards
| Convention | Meaning |
|---|
| Blue text | Hardcoded inputs |
| Black text | Formulas |
| Green text | Links to other sheets |
| Yellow fill | Needs attention |
Common Formula Errors
| Error | Cause |
|---|
| #REF! | Invalid cell reference |
| #DIV/0! | Division by zero |
| #VALUE! | Wrong data type |
| #NAME? | Unknown function name |
Word Processing
Word Tools
| Task | Best Tool |
|---|
| Text extraction | pandoc |
| Create new | python-docx or docx-js |
| Simple edits | python-docx |
| Tracked changes | Direct XML editing |
Document Structure
| File | Contains |
|---|
word/document.xml | Main content |
word/comments.xml | Comments |
word/media/ | Images |
Tracked Changes (Redlining)
| Element | XML Tag |
|---|
| Deletion | <w:del><w:delText>...</w:delText></w:del> |
| Insertion | <w:ins><w:t>...</w:t></w:ins> |
Key concept: For professional/legal documents, use tracked changes XML rather than replacing text directly.
PowerPoint Processing
PowerPoint Tools
| Task | Best Tool |
|---|
| Text extraction | markitdown |
| Create new | pptxgenjs (JS) or python-pptx |
| Edit existing | Direct XML or python-pptx |
Slide Structure
| Path | Contains |
|---|
ppt/slides/slide{N}.xml | Slide content |
ppt/notesSlides/ | Speaker notes |
ppt/slideMasters/ | Master templates |
ppt/media/ | Images |
Design Principles
| Principle | Guideline |
|---|
| Fonts | Use web-safe: Arial, Helvetica, Georgia |
| Layout | Two-column preferred, avoid vertical stacking |
| Hierarchy | Size, weight, color for emphasis |
| Consistency | Repeat patterns across slides |
Converting Between Formats
| Conversion | Tool |
|---|
| Any → PDF | LibreOffice headless |
| PDF → Images | pdftoppm |
| DOCX → Markdown | pandoc |
| Any → Text | Appropriate extractor |
Best Practices
| Practice | Why |
|---|
| Use formulas in Excel | Dynamic calculations |
| Preserve formatting on edit | Don't lose styles |
| Test output opens correctly | Catch corruption early |
| Use tracked changes for contracts | Audit trail |
| Extract to markdown for analysis | Easier to process |
Common Packages
| Language | Packages |
|---|
| Python | pypdf, pdfplumber, openpyxl, python-docx, python-pptx |
| JavaScript | docx, pptxgenjs |
| CLI | pandoc, qpdf, pdftotext, libreoffice |