doc-to-everything | document-to-markdown

Stats

Actions

Tags

doc-to-everything | document-to-markdown

Doc to Everything

Orchestrate a complete document extraction pipeline. Converts PDF to Markdown, chunks the output, extracts tables, applies OCR if needed, and produces a manifest summarizing the entire workflow.

When to use

Converting a complex PDF document into a complete structured workspace
Need full extraction: text, chunks, tables, and metadata in one step
Setting up a reusable document archive

Inputs to gather

PDF file path (required)
OCR flag: --ocr to force OCR on all documents (optional)
Workspace directory: --workspace (optional; defaults to the directory containing the PDF)

Procedure

Create the output workspace directory at <workspace>/<stem>/, where <stem> is the PDF filename without extension. If --workspace is not specified, use the directory containing the PDF.
Copy the source PDF into the workspace as source.pdf.
Check for text layer in the PDF. If missing or sparse and --ocr was not passed, ask the user; if --ocr is passed or confirmed, run ocr-scanned-pdf on the PDF and work from the OCR'd version.
Run pdf-to-markdown on the (possibly OCR'd) PDF. Save output as <workspace>/<stem>/full.md.
Run chunk-markdown on full.md. Output chunks go to <workspace>/<stem>/chunks/.
Run extract-tables on the (possibly OCR'd) PDF. Output tables go to <workspace>/<stem>/tables/.
After all steps, create <workspace>/<stem>/manifest.toon with fields:
- source: relative path to source.pdf
- extractor: which extractor was used for PDF-to-markdown (marker, docling, etc.)
- ocr_applied: boolean, true if OCR was run
- chunk_count: number of chunks created
- table_count: number of tables extracted
- generated_at: ISO 8601 timestamp
- tool_versions: dict with versions of key tools (ocrmypdf, marker, etc.)
Print summary to console showing workspace path, file counts, and manifest location.

Output / side effects

Workspace directory structure created at <workspace>/<stem>/
source.pdf copied into workspace
full.md with complete document text
chunks/ directory with chunked Markdown and index
tables/ directory with extracted CSVs and index
manifest.toon with extraction metadata
If OCR was applied, the OCR'd PDF is also in the workspace (optional; can be kept or deleted)

Safety / constraints

Failures in individual steps (e.g., table extraction) should not abort the pipeline; report the failure and continue with other steps
Workspace directory must be writable; fail clearly if it is not
Do not overwrite existing workspaces without asking the user
Original source PDF is never modified; only copied