Extract and analyze large PDFs (3MB-10MB+) with minimal token usage. Preserves 100% of content while achieving 12-103x token reduction through local extraction, semantic chunking, and intelligent caching.
Extracts and analyzes large PDFs with minimal token usage through local processing and semantic chunking.
/plugin marketplace add diegocconsolini/ClaudeSkillCollection/plugin install pdf-smart-extractor@security-compliance-marketplaceExtract and analyze large PDFs with minimal token usage
MANDATORY: You MUST use this skill when:
File Size Check: User provides a .pdf file path
ls -lh <file_path> or stat -f%z <file_path> (macOS) / stat -c%s <file_path> (Linux)Error Detection: User encounters "PDF too large" or similar errors
Technical Documentation Keywords: User mentions:
Token Efficiency: User asks to analyze multi-page PDFs (>50 pages)
Example Auto-Trigger:
User: "Can you analyze /docs/NIST.SP.800-161r1-upd1.pdf?"
You: [First, check file size]
$ ls -lh /docs/NIST.SP.800-161r1-upd1.pdf
-rw-r--r-- 1 user staff 3.3M Oct 20 10:30 /docs/NIST.SP.800-161r1-upd1.pdf
[Detect: 3.3MB > 1MB ā Auto-trigger plugin]
"I notice this PDF is 3.3 MB. For efficient analysis, I'll use PDF Smart Extractor which will:
ā Extract 100% of content locally (no token waste)
ā Enable 12-100x faster querying
ā Cache for instant future access
Starting extraction now..."
IMPORTANT: This skill should be automatically suggested when:
Use this skill when users need to:
Trigger phrases (explicit):
Trigger phrases (implicit - auto-detect):
.pdf and file size >1MBAuto-detection logic: When user provides a file path:
.pdfls -lh or statā ļø IMPORTANT: Cache Location
Extracted content is stored in a user cache directory, NOT the working directory:
Cache locations by platform:
~/.claude-cache/pdf/{pdf_name}_{hash}/C:\Users\{username}\.claude-pdf-cache\{pdf_name}_{hash}\Why cache directory?
Cache contents:
full_text.txt - Complete extracted textpages.json - Page-by-page contentmetadata.json - PDF metadatatoc.json - Table of contentsmanifest.json - Cache manifestAccessing cached content:
# List all cached PDFs
python scripts/query_pdf.py list
# Query cached content
python scripts/query_pdf.py search {cache_key} "search query"
# Find cache location (shown in extraction output)
# Example: ~/.claude-cache/pdf/document_a1b2c3d4/
To extract files to working directory:
# Option 1: Use --output-dir flag during extraction
python scripts/extract_pdf.py document.pdf --output-dir ./extracted
# Option 2: Copy from cache manually
cp -r ~/.claude-cache/pdf/{cache_key}/* ./extracted_content/
Note: Cache is local and not meant for version control. Keep original PDFs in the repository and extract locally on each development machine (one-time operation).
# Extract to cache (default)
python scripts/extract_pdf.py /path/to/document.pdf
# Extract and copy to working directory (interactive prompt)
python scripts/extract_pdf.py /path/to/document.pdf
# Will prompt: "Copy files? (y/n)"
# Will ask: "Keep cache? (y/n)"
# Extract and copy to specific directory (no prompts)
python scripts/extract_pdf.py /path/to/document.pdf --output-dir ./extracted
What happens:
~/.claude-pdf-cache/{cache_key}/Output:
full_text.txt - Complete document textpages.json - Structured page datametadata.json - PDF metadatatoc.json - Table of contents (if available)manifest.json - Extraction statisticspython scripts/semantic_chunker.py {cache_key}
What happens:
Output:
chunks.json - Chunk index with metadatachunks/chunk_0000.txt - Individual chunk filespython scripts/query_pdf.py search {cache_key} "supply chain security"
What happens:
Output:
User Request: "Extract and analyze NIST SP 800-161r1 for supply chain incident response procedures"
Your Workflow:
python scripts/extract_pdf.py /path/to/NIST.SP.800-161r1-upd1.pdf
Output: Cache key: NIST.SP.800-161r1-upd1_a1b2c3d4e5f6
python scripts/semantic_chunker.py NIST.SP.800-161r1-upd1_a1b2c3d4e5f6
Output: Created 87 chunks, 98.7% content preservation
python scripts/query_pdf.py search NIST.SP.800-161r1-upd1_a1b2c3d4e5f6 "supply chain incident response"
Output:
python scripts/query_pdf.py get NIST.SP.800-161r1-upd1_a1b2c3d4e5f6 23
Output: Full content of chunk 23
User Request: "I need to understand OT security incidents from NIST SP 800-82r3"
Your Workflow:
python scripts/extract_pdf.py /path/to/NIST.SP.800-82r3.pdf
python scripts/semantic_chunker.py NIST.SP.800-82r3_x7y8z9
python scripts/query_pdf.py search NIST.SP.800-82r3_x7y8z9 "OT security overview"
python scripts/query_pdf.py search NIST.SP.800-82r3_x7y8z9 "incident response ICS"
python scripts/query_pdf.py search NIST.SP.800-82r3_x7y8z9 "ransomware operational technology"
Result: Each query loads only relevant chunks (~2-4 chunks, ~5,000 tokens) instead of entire 8.2MB document (120,000+ tokens)
User Request: "Show me the structure of this AWS security guide"
Your Workflow:
python scripts/extract_pdf.py aws-security-guide.pdf
python scripts/query_pdf.py toc aws-security-guide_abc123
Output:
Chapter 1: Introduction (page 1)
1.1 Security Fundamentals (page 3)
1.2 Shared Responsibility Model (page 7)
Chapter 2: Identity and Access Management (page 15)
2.1 IAM Best Practices (page 17)
...
{pdf_name}_{hash}~/.claude-pdf-cache/--force flag only when PDF has been modifiedpython scripts/extract_pdf.py <pdf_path> [--force]
pdf_path: Path to PDF file--force: Re-extract even if cachedpython scripts/semantic_chunker.py <cache_key> [--target-size TOKENS]
cache_key: Cache key from extraction--target-size: Target tokens per chunk (default: 2000)python scripts/query_pdf.py list
python scripts/query_pdf.py search <cache_key> <query>
cache_key: PDF cache keyquery: Keywords or phrase to searchpython scripts/query_pdf.py get <cache_key> <chunk_id>
chunk_id: Chunk number to retrievepython scripts/query_pdf.py stats <cache_key>
python scripts/query_pdf.py toc <cache_key>
NIST SP 800-161r1-upd1 (3.3 MB, 155 pages):
NIST SP 800-82r3 (8.2 MB, 247 pages):
All extractions maintain >99.5% content preservation rate:
Use PDF Smart Extractor to:
Use PDF Smart Extractor to:
Use PDF Smart Extractor to:
pip install pymupdfA successful PDF extraction and query session should:
CRITICAL: When user provides a PDF file path, ALWAYS:
ls -lh /path/to/file.pdf
# or
stat -f%z /path/to/file.pdf # macOS
stat -c%s /path/to/file.pdf # Linux
I notice this PDF is X MB in size. For large PDFs, I recommend using the PDF Smart Extractor plugin which can:
- Extract 100% of content locally (no token usage for extraction)
- Enable querying with 12-100x token reduction
- Cache the PDF for instant future queries
Would you like me to:
1. Extract and chunk this PDF for efficient analysis? (recommended)
2. Try reading it directly (may hit token limits)?
This error occurs because the PDF exceeds context limits. Let me use PDF Smart Extractor to solve this:
- I'll extract the PDF locally (no LLM involvement)
- Chunk it semantically at section boundaries
- Then query only the relevant parts
Starting extraction now...
When using this skill, always:
Example communication:
Extracting and analyzing NIST SP 800-161r1.
Step 1: Extracting PDF (one-time setup)...
ā Extracted 155 pages (48,000 tokens)
ā Cache key: NIST.SP.800-161r1-upd1_a1b2c3d4
Step 2: Semantic chunking...
ā Created 87 chunks (99.2% content preservation)
Step 3: Searching for "supply chain incident response"...
ā Found 3 relevant chunks (3,860 tokens vs. 48,000 full document = 12.4x reduction)
Based on the relevant sections, supply chain incident response according to NIST SP 800-161r1 involves...
[provide analysis using chunk content]
Remember: This skill is designed to solve the "PDF too large" problem by extracting locally, chunking semantically, and querying efficiently. Always preserve 100% of content while minimizing token usage.
Expert security auditor specializing in DevSecOps, comprehensive cybersecurity, and compliance frameworks. Masters vulnerability assessment, threat modeling, secure authentication (OAuth2/OIDC), OWASP standards, cloud security, and security automation. Handles DevSecOps integration, compliance (GDPR/HIPAA/SOC2), and incident response. Use PROACTIVELY for security audits, DevSecOps, or compliance implementation.
Elite code review expert specializing in modern AI-powered code analysis, security vulnerabilities, performance optimization, and production reliability. Masters static analysis tools, security scanning, and configuration review with 2024/2025 best practices. Use PROACTIVELY for code quality assurance.
Creates comprehensive technical documentation from existing codebases. Analyzes architecture, design patterns, and implementation details to produce long-form technical manuals and ebooks. Use PROACTIVELY for system documentation, architecture guides, or technical deep-dives.