Extract and analyze Word documents (1MB-50MB+) with minimal token usage through local extraction, semantic chunking by headings, and intelligent caching.
Extracts and analyzes Word documents with semantic chunking and intelligent caching for token-efficient queries.
/plugin marketplace add diegocconsolini/ClaudeSkillCollection/plugin install docx-smart-extractor@security-compliance-marketplaceThe DOCX Smart Extractor enables efficient analysis of Word documents through local extraction, semantic chunking, and intelligent caching. Extract once, query forever.
Use this plugin for:
Extract document
# Extract to cache (default)
python scripts/extract_docx.py /path/to/document.docx
# Extract and copy to working directory (interactive prompt)
python scripts/extract_docx.py /path/to/document.docx
# Will prompt: "Copy files? (y/n)"
# Will ask: "Keep cache? (y/n)"
# Extract and copy to specific directory (no prompts)
python scripts/extract_docx.py /path/to/document.docx --output-dir ./extracted
Output: Cache key (e.g., policy_document_a8f9e2c1)
Chunk content
python scripts/semantic_chunker.py policy_document_a8f9e2c1
Query content
# Search for keyword
python scripts/query_docx.py search policy_document_a8f9e2c1 "data retention"
# Get specific heading
python scripts/query_docx.py heading policy_document_a8f9e2c1 "Security Controls"
# Get summary
python scripts/query_docx.py summary policy_document_a8f9e2c1
Typical reductions:
⚠️ IMPORTANT: Cache Location
Extracted content is stored in a user cache directory, NOT the working directory:
Cache locations by platform:
~/.claude-cache/docx/{document_name}_{hash}/C:\Users\{username}\.claude-cache\docx\{document_name}_{hash}\Why cache directory?
Cache contents:
full_document.json - Complete document text with formattingheadings.json - Document heading structuretables.json - Extracted tablesmetadata.json - Document metadatamanifest.json - Cache manifestAccessing cached content:
# List all cached documents
python scripts/query_docx.py list
# Query cached content
python scripts/query_docx.py search {cache_key} "search query"
# Find cache location (shown in extraction output)
# Example: ~/.claude-cache/docx/policy_document_a1b2c3d4/
To extract files to working directory:
# Option 1: Use --output-dir flag during extraction
python scripts/extract_docx.py document.docx --output-dir ./extracted
# Option 2: Copy from cache manually
cp -r ~/.claude-cache/docx/{cache_key}/* ./extracted_content/
Note: Cache is local and not meant for version control. Keep original Word files in the repository and extract locally on each development machine (one-time operation).
# Extract
python scripts/extract_docx.py InfoSecPolicy.docx
# Chunk
python scripts/semantic_chunker.py InfoSecPolicy_a8f9e2
# Find password policy section
python scripts/query_docx.py search InfoSecPolicy_a8f9e2 "password requirements"
# Extract
python scripts/extract_docx.py Vendor_Contract.docx
# Get specific section
python scripts/query_docx.py heading Vendor_Contract_f3a8c1 "Termination Clause"
# Extract large spec document
python scripts/extract_docx.py API_Specification.docx
# Search for endpoint details
python scripts/query_docx.py search API_Specification_b9d2e1 "authentication endpoint"
All output is JSON with UTF-8 encoding. Structured for easy parsing and analysis.
This plugin does NOT:
What it DOES:
Use this agent when analyzing conversation transcripts to find behaviors worth preventing with hooks. Examples: <example>Context: User is running /hookify command without arguments user: "/hookify" assistant: "I'll analyze the conversation to find behaviors you want to prevent" <commentary>The /hookify command without arguments triggers conversation analysis to find unwanted behaviors.</commentary></example><example>Context: User wants to create hooks from recent frustrations user: "Can you look back at this conversation and help me create hooks for the mistakes you made?" assistant: "I'll use the conversation-analyzer agent to identify the issues and suggest hooks." <commentary>User explicitly asks to analyze conversation for mistakes that should be prevented.</commentary></example>