Help us improve
Share bugs, ideas, or general feedback.
From journalism-tools
Extract structured data from documents that resist standard parsing, such as redacted records, scanned forms, inconsistent tables, and OCR artifacts. Use this skill when a journalist needs to transform messy PDFs or images into structured JSON with full provenance tracking. Triggers on requests involving FOIA documents, court records, financial disclosures, government forms, leaked documents, or any document described as "hard to parse," "scanned," "redacted," or "inconsistent."
npx claudepluginhub nhagar/claude-plugins-journalism --plugin journalism-toolsHow this skill is triggered — by the user, by Claude, or both
Slash command
/journalism-tools:difficult-document-extractionThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Extract structured data from messy documents while maintaining provenance and human oversight.
Routes PDF conversions through analysis to select the best extraction strategy and tools based on document type and output format.
This skill should be used when the user says "process documents", "extract text from PDF", "OCR this document", "convert PDF to markdown", "extract emails from documents", "parse document", "document conversion", "batch OCR", "extract structured data from PDF", "read PDF", "extract tables from PDF", "convert Word document", "convert docx to markdown", or wants to extract, convert, or process documents and scanned images.
Converts legal PDFs and scanned documents to Markdown using PaddleOCR, preserving structured output and an archive for audit. Designed for legal case files, medical records, invoices, and complex layouts with tables and formulas.
Share bugs, ideas, or general feedback.
Extract structured data from messy documents while maintaining provenance and human oversight.
Run the conversion script:
uv run --with pdf2image --with pillow scripts/convert_to_images.py input.pdf --output-dir ./pages --dpi 300
Output: pages/page_001.png, pages/page_002.png, etc.
For image-based documents (TIFF, scanned images), copy directly to the pages directory with sequential naming.
NOTE: For large files (>50 pages), you need to use automated tooling, rather than reading all pages yourself. Refer to references/automated-extraction.md for details on that workflow. Otherwise, read and transcribe EVERY page yourself. Don't skip any.
Read the image file for each page in parallel. For each page image, output a markdown file preserving:
[REDACTED][ILLEGIBLE] or [UNCLEAR: partial text?][HANDWRITTEN: transcription] or [HANDWRITTEN: ILLEGIBLE][X] for checked, [ ] for unchecked[STAMP: text] or [SIGNATURE]<!-- Page N of document: filename.pdf -->
<!-- Document type: [form/letter/table/mixed] -->
<!-- Quality notes: [any OCR issues, damage, etc.] -->
[Transcribed content here, preserving structure]
<!-- Page 1 of document: foia_response_2024.pdf -->
<!-- Document type: form -->
<!-- Quality notes: Slight skew, stamp partially cut off -->
# FREEDOM OF INFORMATION ACT REQUEST RESPONSE
**Date:** March 15, 2024
**Case Number:** FOIA-2024-00142
**Requester:** [REDACTED: ~2 words]
## Responsive Documents
| Doc ID | Date | Description | Pages | Disposition |
|--------|------|-------------|-------|-------------|
| A-001 | 2023-01-15 | Email correspondence | 3 | Released in full |
| A-002 | 2023-02-20 | [REDACTED] | 7 | Withheld (b)(6) |
| A-003 | [ILLEGIBLE] | Meeting notes | 2 | Released with redactions |
[STAMP: APPROVED FOR RELEASE - partially visible]
[SIGNATURE]
Save each transcription as transcripts/page_001.md, transcripts/page_002.md, etc.
Combine all page transcripts into a single file:
# Full Document Transcript
**Source:** filename.pdf
**Total Pages:** N
**Processed:** YYYY-MM-DD
---
[Contents of page_001.md]
---
<!-- PAGE BREAK: 1 → 2 -->
---
[Contents of page_002.md]
...
Save as full_transcript.md.
Analyze the transcript and propose one or more schemas. Present to journalist for review.
## Proposed Extraction Schema(s)
### Schema 1: [Name]
**Applies to:** Pages X-Y (or "all pages," "pages containing tables," etc.)
**Purpose:** [What this schema captures]
| Field | Type | Description | Required | Example |
|-------|------|-------------|----------|---------|
| field_name | string/number/date/boolean/array | What it represents | Yes/No | "example value" |
### Schema 2: [Name]
...
## Open Questions for Review
1. [Question about ambiguous data]
2. [Question about handling edge cases]
3. [Question about field naming preferences]
## Notes
- [Any patterns observed]
- [Potential data quality issues]
- [Recommendations]
See references/schema-patterns.md for detailed guidance. Key principles:
source_page, source_documentnull, not empty stringsSTOP HERE - Present schema to journalist and await approval before proceeding.
After journalist approval, transform the markdown transcript to JSON. Do this YOURSELF, not with a script.
{
"extraction_metadata": {
"source_document": "filename.pdf",
"extraction_date": "2024-03-15",
"schema_version": "1.0",
"total_records": 42,
"notes": ["Any extraction notes"]
},
"records": [
{
"source_page": 1,
"field1": "value1",
"field2": "value2"
}
]
}
{
"date": "2024-03-15",
"date_raw": "3/15/24",
"date_confidence": "high"
}
{
"name": null,
"name_note": "REDACTED in source",
"amount": 1500,
"amount_note": "Partially illegible, interpreted from context"
}
Save to output/ directory:
output/[schema_name].json - Extracted dataoutput/extraction_report.md - Summary of extraction with any issuesworking_directory/ ├── input.pdf # Original document ├── pages/ # Page images │ ├── page_001.png │ └── ... ├── transcripts/ # Individual page transcripts │ ├── page_001.md │ └── ... ├── full_transcript.md # Stitched transcript ├── schema_proposal.md # Schema for journalist review └── output/ ├── [schema_name].json # Final extracted data ├── extraction_report.md # Extraction summary └── review_[document].html # Interactive review interface
After extraction, generate a self-contained HTML review interface:
uv run scripts/generate_review_interface.py ./pages output/extracted.json \
--output output/review_document.html \
--document-name "FOIA Response 2024-001"
This creates a single HTML file the journalist can open in any browser—no server, no installation, no technical setup required.