From ai-generation-skills
Parses complex documents with PaddleOCR, extracting text, tables, formulas, charts, and layout. For invoices, reports, academic papers, multi-column layouts.
npx claudepluginhub freestylefly/canghe-skills --plugin utility-skillsThis skill uses the workspace's default tool permissions.
**Use Document Parsing for**:
Parses complex PDFs and document images into Markdown and JSON using PaddleOCR API, preserving tables, formulas, charts, diagrams, and multi-column layouts.
Generates LiteParse CLI commands and scripts to parse unstructured files (PDF, DOCX, PPTX, XLSX, images) locally for text/JSON extraction, batch processing, screenshots, OCR without cloud dependencies.
Parses PDFs, DOCX, PPTX, HTML, images (20+ formats) to Markdown/HTML/JSON/text with layout/tables/OCR. Chunks for RAG pipelines; batch converts via DocumentConverter.
Share bugs, ideas, or general feedback.
Use Document Parsing for:
Use Text Recognition instead for:
⛔ MANDATORY RESTRICTIONS - DO NOT VIOLATE ⛔
python scripts/vl_caller.pyIf the script execution fails (API not configured, network error, etc.):
Execute document parsing:
python scripts/vl_caller.py --file-url "URL provided by user" --pretty
Or for local files:
python scripts/vl_caller.py --file-path "file path" --pretty
Optional: explicitly set file type:
python scripts/vl_caller.py --file-url "URL provided by user" --file-type 0 --pretty
--file-type 0: PDF--file-type 1: imageDefault behavior: save raw JSON to a temp file:
--output is omitted, the script saves automatically under the system temp directory<system-temp>/paddleocr/doc-parsing/results/result_<timestamp>_<id>.json--output is provided, it overrides the default temp-file destination--stdout is provided, JSON is printed to stdout and no file is savedResult saved to: /absolute/path/...--stdout only when you explicitly want to skip file persistenceThe output JSON contains COMPLETE content with all document data:
Input type note:
Extract what the user needs from the output JSON using these fields:
textresult[n].markdownresult[n].prunedResultCRITICAL: You must display the COMPLETE extracted content to the user based on their needs.
text fieldWhat this means:
text, result[n].markdown, and result[n].prunedResultExample - Correct:
User: "Extract all the text from this document"
Agent: I've parsed the complete document. Here's all the extracted text:
[Display entire text field or concatenated regions in reading order]
Document Statistics:
- Total regions: 25
- Text blocks: 15
- Tables: 3
- Formulas: 2
Quality: Excellent (confidence: 0.92)
Example - Incorrect:
User: "Extract all the text"
Agent: "I found a document with multiple sections. Here's the beginning:
'Introduction...' (content truncated for brevity)"
The output JSON uses an envelope wrapping the raw API result:
{
"ok": true,
"text": "Full markdown/HTML text extracted from all pages",
"result": { ... }, // raw provider response
"error": null
}
Key fields:
text — extracted markdown text from all pages (use this for quick text display)result - raw provider response objectresult[n].prunedResult - structured parsing output for each page (layout/content/confidence and related metadata)result[n].markdown — full rendered page output in markdown/HTMLRaw result location (default): the temp-file path printed by the script on stderr
Example 1: Extract Full Document Text
python scripts/vl_caller.py \
--file-url "https://example.com/paper.pdf" \
--pretty
Then use:
text for quick full-text outputresult[n].markdown when page-level output is neededExample 2: Extract Structured Page Data
python scripts/vl_caller.py \
--file-path "./financial_report.pdf" \
--pretty
Then use:
result[n].prunedResult for structured parsing data (layout/content/confidence)result[n].markdown for rendered page contentExample 3: Print JSON Without Saving
python scripts/vl_caller.py \
--file-url "URL" \
--stdout \
--pretty
Then return:
text when user asks for full document contentresult[n].prunedResult and result[n].markdown when user needs complete structured page dataWhen API is not configured:
The error will show:
PADDLEOCR_DOC_PARSING_API_URL not configured. Get your API at: https://paddleocr.com
Configuration workflow:
Show the exact error message to the user (including the URL).
Guide the user to configure securely:
- PADDLEOCR_DOC_PARSING_API_URL
- PADDLEOCR_ACCESS_TOKEN
- Optional: PADDLEOCR_DOC_PARSING_TIMEOUT
If the user provides credentials in chat anyway (accept any reasonable format):
PADDLEOCR_DOC_PARSING_API_URL=https://xxx.paddleocr.com/layout-parsing, PADDLEOCR_ACCESS_TOKEN=abc123...Here's my API: https://xxx and token: abc123Parse and validate the values:
PADDLEOCR_DOC_PARSING_API_URL (look for URLs with paddleocr.com or similar)PADDLEOCR_DOC_PARSING_API_URL is a full endpoint ending with /layout-parsingPADDLEOCR_ACCESS_TOKEN (long alphanumeric string, usually 40+ chars)Ask the user to confirm the environment is configured:
configure.py or create a local .env file by default if the skill is installed under a host application directory (for example, ~/.claude/skills)Retry only after confirmation:
IMPORTANT: The error message format is STRICT and must be shown exactly as provided by the script. Do not modify or paraphrase it.
There is no file size limit for the API. For PDFs, the maximum is 100 pages per request.
Tips for large files:
For very large local files, prefer --file-url over --file-path to avoid base64 encoding overhead:
python scripts/vl_caller.py --file-url "https://your-server.com/large_file.pdf"
If you only need certain pages from a large PDF, extract them first:
# Extract pages 1-5
python scripts/split_pdf.py large.pdf pages_1_5.pdf --pages "1-5"
# Mixed ranges are supported
python scripts/split_pdf.py large.pdf selected_pages.pdf --pages "1-5,8,10-12"
# Then process the smaller file
python scripts/vl_caller.py --file-path "pages_1_5.pdf"
Authentication failed (403):
error: Authentication failed
→ Token is invalid, reconfigure with correct credentials
API quota exceeded (429):
error: API quota exceeded
→ Daily API quota exhausted, inform user to wait or upgrade
Unsupported format:
error: Unsupported file format
→ File format not supported, convert to PDF/PNG/JPG
references/output_schema.md - Output format specificationNote: Model version and capabilities are determined by your API endpoint (
PADDLEOCR_DOC_PARSING_API_URL).
Load these reference documents into context when:
To verify the skill is working properly:
python scripts/smoke_test.py
This tests configuration and optionally API connectivity.