From crucible
Converts PDF, DOCX, PPTX, XLSX and 10+ formats to token-efficient Markdown/CSV digests with structural compression. Use for feeding documents to Claude without excessive context costs.
npx claudepluginhub raddue/crucibleThis skill uses the workspace's default tool permissions.
<!-- CANONICAL: shared/dispatch-convention.md -->
Converts PDF, DOCX, PPTX, XLSX, HTML, images to structured Markdown or JSON using Docling. Preserves tables, layout, hierarchy for RAG pipelines and knowledge bases.
Converts PDF, DOCX, PPTX, XLSX, images (with OCR), audio (with transcription), HTML, CSV, JSON, XML, ZIP, YouTube URLs, EPubs to structured Markdown for LLM processing.
Converts PDFs, DOCX, PPTX, XLSX, HTML, images, URLs, CSV, JSON, and more to markdown via tiered fallbacks: MCP markitdown, native tools, or user notice. For ingesting non-plain-text files.
Share bugs, ideas, or general feedback.
All subagent dispatches use disk-mediated dispatch. See shared/dispatch-convention.md for the full protocol.
Convert heavy document formats to token-efficient representations (Markdown, CSV) for LLM consumption. The core deliverable is the .digest.md — a structurally-aware compression at 20-30% of token count.
Skill type: Rigid — follow exactly, no shortcuts.
Models:
Announce at start: "I'm using the distill skill to convert documents to token-efficient formats."
/distill <path> [path2 ...]
/distill <directory>
Examples:
/distill docs/report.pdf — convert one file/distill docs/report.pdf data/sheet.xlsx slides/deck.pptx — convert multiple files/distill docs/ — convert all supported files in directory (single-level, not recursive)Mixed mode is supported: /distill docs/ extra/report.pdf
Execute phases in this order. Each phase completes for all files before the next begins.
At skill start, before processing any files, check for required tools:
| Check | Command | If Missing |
|---|---|---|
| Tier 1 | which pandoc | "pandoc not found. Install: apt install pandoc (Debian/Ubuntu) or brew install pandoc (macOS). Tier 1 formats will be skipped." |
| Tier 2 | which pdftotext | "pdftotext not found. Install: apt install poppler-utils (Debian/Ubuntu) or brew install poppler (macOS). PDF conversion will be skipped." |
| Tier 3 | which python3 | "python3 not found. PPTX and XLSX conversion will be skipped." |
| Pre-flight | which unzip | Skip zip bomb detection with note. Not a conversion blocker. |
| Pre-flight | which pdfdetach | Skip PDF attachment detection with note. Not a conversion blocker. |
Build a set of available tiers. Route files only to available tiers. Files targeting unavailable tiers get routed to unsupported-with-guidance (Phase 1b).
Individual file paths: Use directly. Verify each file exists.
Directory paths: Single-level glob for files with supported extensions (not recursive). Build file list sorted alphabetically. Report: "Found {N} convertible files in {directory}: {list}."
Supported extensions for glob: .pdf, .docx, .rtf, .html, .htm, .odt, .epub, .rst, .org, .tex, .ipynb, .pptx, .xlsx
Mixed mode: Process both directory globs and individual paths. Deduplicate by absolute path.
For each file, determine the conversion tier by extension:
| Extension | Tier | Format Flag |
|---|---|---|
.docx | 1 | docx |
.rtf | 1 | rtf |
.html | 1 | html |
.htm | 1 | html |
.odt | 1 | odt |
.epub | 1 | epub |
.rst | 1 | rst |
.org | 1 | org |
.tex | 1 | latex |
.ipynb | 1 | ipynb |
.pdf | 2 | — |
.pptx | 3 | — |
.xlsx | 3 | — |
Unsupported formats: Output actionable guidance per this table, then continue with remaining files:
| Extension | Guidance |
|---|---|
.xls | "Legacy Excel format. Export as .xlsx from Excel/LibreOffice, then re-run /distill." |
.ods | "OpenDocument Spreadsheet. Export as .csv (single-sheet) or .xlsx (multi-sheet), then re-run /distill." |
.odp | "OpenDocument Presentation. Export as .pptx, then re-run /distill." |
.key | "Apple Keynote. Export as .pptx from Keynote, then re-run /distill." |
.numbers | "Apple Numbers. Export as .xlsx from Numbers, then re-run /distill." |
.pages | "Apple Pages. Export as .docx from Pages, then re-run /distill." |
Unknown extensions: "Unsupported format: {ext}. Supported formats: docx, rtf, html, odt, epub, rst, org, tex, ipynb, pdf, pptx, xlsx."
Unavailable tier: If a file's tier is unavailable (tool missing from Phase 0), report: "{file}: requires {tool} (not installed). Skipping."
Run per-file safety checks before conversion. Failures are per-file — do not halt the batch.
Office formats are ZIP archives. If unzip is available:
UNCOMPRESSED=$(unzip -l "$INPUT_PATH" 2>/dev/null | tail -1 | awk '{print $1}')
If uncompressed size exceeds 500MB (524288000 bytes), abort this file: "File uncompressed size ({size}) exceeds 500MB safety limit. Skipping."
If unzip is not available, skip this check (noted in Phase 0).
For PDF files, if pdfdetach is available:
ATTACHMENTS=$(pdfdetach -list "$INPUT_PATH" 2>/dev/null | grep -c "^[0-9]")
If attachments found, warn: "PDF contains {N} embedded attachments. These are not extracted — only text content is converted." Continue with conversion.
After conversion (not before), verify output is valid UTF-8:
file --mime-encoding "$OUTPUT_PATH"
If not UTF-8, attempt re-encoding: iconv -f <detected-charset> -t UTF-8 "$OUTPUT_PATH" -o "$OUTPUT_PATH.tmp" && mv "$OUTPUT_PATH.tmp" "$OUTPUT_PATH". If re-encoding fails, report and skip.
Process files sequentially. For each file:
INPUT_PATH="$1"
OUTPUT_PATH="${INPUT_PATH%.*}.md"
FORMAT="$2" # from routing table
pandoc -f "$FORMAT" -t markdown --wrap=none "$INPUT_PATH" -o "$OUTPUT_PATH"
Shell safety: All file paths via quoted shell variables. Never inline interpolation. Never use unquoted $() or backtick interpolation of file paths.
Error handling:
Idempotency: Overwrites existing output files without warning.
Step 1 — Extract:
INPUT_PATH="$1"
TEXT_PATH="${INPUT_PATH%.*}.txt"
OUTPUT_PATH="${INPUT_PATH%.*}.md"
pdftotext -layout "$INPUT_PATH" "$TEXT_PATH"
Scanned PDF detection: Count total characters and pages:
CHARS=$(wc -c < "$TEXT_PATH")
PAGES=$(pdfinfo "$INPUT_PATH" 2>/dev/null | grep "^Pages:" | awk '{print $2}')
If pdfinfo is unavailable, estimate pages from pdftotext output (count form-feed characters). If average chars/page < 50, report: "This PDF appears to be scanned/image-based. Text extraction produced minimal content. Consider OCR processing externally before distilling." Skip structuring pass. Clean up temp .txt file.
Step 2 — Structure: Dispatch a Sonnet agent using skills/distill/pdf-structurer-prompt.md to transform the raw pdftotext output into clean Markdown with recovered headings, lists, tables, and code blocks. Write result to OUTPUT_PATH. Clean up temp .txt file.
Venv setup (once per invocation, only if Tier 3 files exist):
VENV="/tmp/crucible-distill-venv"
# Health check
if [ -d "$VENV" ]; then
"$VENV/bin/python3" -c "import sys" 2>/dev/null || rm -rf "$VENV"
fi
# Create if missing
if [ ! -d "$VENV" ]; then
echo "Installing Python dependencies (one-time setup, ~15 seconds)..."
python3 -m venv "$VENV"
"$VENV/bin/pip" install --quiet python-pptx==1.0.2 openpyxl==3.1.5
if [ $? -ne 0 ]; then
echo "Failed to install Python dependencies."
echo "Manual install: pip install python-pptx==1.0.2 openpyxl==3.1.5"
echo "PPTX and XLSX conversion will be skipped."
# Route remaining Tier 3 files to unsupported
return
fi
fi
PPTX conversion:
"$VENV/bin/python3" skills/distill/convert_pptx.py --input "$INPUT_PATH" --output "$OUTPUT_PATH"
XLSX conversion:
"$VENV/bin/python3" skills/distill/convert_xlsx.py --input "$INPUT_PATH" --output-dir "$(dirname "$INPUT_PATH")"
Output: one CSV per sheet at {basename}-{sheetname}.csv. Sheetnames sanitized (spaces → hyphens, special chars stripped).
After all conversions complete, run the digest pass on eligible files.
Eligibility:
.md (not .csv)Dispatch: For each eligible file, dispatch a Sonnet digest agent using skills/distill/digest-prompt.md. Before dispatching, fill template placeholders: replace {{ORIGINAL_WORDS}} with the converted file's word count and {{TARGET_WORDS}} with 25% of that count. The raw pdftotext output (for pdf-structurer-prompt.md) or converted .md content (for digest-prompt.md) is included as a content block below the prompt template in the dispatch file.
Quality check: After the digest agent returns, count words in the digest:
Output: Write digest to {original-path-without-ext}.digest.md.
Word count is a proxy for token count. These diverge for code-heavy or CJK content, but word count is sufficient for v1.
After all conversions and digests complete, output:
## Distill Summary
| File | Format | Tier | Converted | Digest | Token Savings |
|---|---|---|---|---|---|
| {file} | {format} | {tier} | {output} ({words} words) | {digest} ({words} words) | ~{pct}% |
**Total:** {N} files converted, {M} digests produced, ~{pct}% average token savings on digestible content.
Generated files can be added to .gitignore if not needed in version control.
Token savings per file = 1 - (digest words / converted words) expressed as percentage.
Files that were skipped (unsupported, tool missing, pre-flight failure) are listed separately:
**Skipped:** {N} files
- {file}: {reason}
Every Bash command that touches file paths MUST use quoted shell variables:
# CORRECT
pandoc -f "$FORMAT" -t markdown --wrap=none "$INPUT_PATH" -o "$OUTPUT_PATH"
# WRONG — never do this
pandoc -f $FORMAT -t markdown --wrap=none $INPUT_PATH -o $OUTPUT_PATH
"$VAR", never bare $VAR$() or backtick interpolation of paths| Failure | Behavior |
|---|---|
| Tool not installed | Skip tier, report with install guidance, continue |
| Conversion fails (non-zero exit) | Report per-file, continue with remaining files |
| Empty conversion output | Report per-file, continue |
| Zip bomb detected | Skip file, report, continue |
| Scanned PDF | Report, skip digest, continue |
| Venv/pip failure | Skip Tier 3, report with manual install instructions |
| Digest out of range | One retry, accept second result regardless |
| File not found | Report, continue with remaining files |
| Permission denied | Report, continue |
| Encoding error | Attempt re-encode, skip on failure, continue |
Principle: Never halt the batch for a single file failure. Report and continue.
Standalone usage:
/distill <path> — convert one or more files/distill <directory> — convert all supported files in directoryCalled by:
Dispatches:
skills/distill/pdf-structurer-prompt.mdskills/distill/digest-prompt.mdDoes not dispatch: No quality gate, no red-team, no review loop. Distill is a utility skill — it converts and compresses. Quality is ensured by the digest quality metric (word count check + one retry).