Document chunking implementations and benchmarking tools for RAG pipelines including fixed-size, semantic, recursive, and sentence-based strategies. Use when implementing document processing, optimizing chunk sizes, comparing chunking approaches, benchmarking retrieval performance, or when user mentions chunking, text splitting, document segmentation, RAG optimization, or chunk evaluation.
/plugin marketplace add vanman2024/ai-dev-marketplace/plugin install rag-pipeline@ai-dev-marketplaceThis skill is limited to using the following tools:
README.mdSKILL_SUMMARY.mdexamples/chunk-code.pyexamples/chunk-markdown.pyexamples/chunk-pdf.pyscripts/benchmark-chunking.pyscripts/chunk-fixed-size.pyscripts/chunk-recursive.pyscripts/chunk-semantic.pytemplates/chunking-config.yamltemplates/custom-splitter.pyPurpose: Provide production-ready document chunking implementations, benchmarking tools, and strategy selection guidance for RAG pipelines.
Activation Triggers:
Key Resources:
scripts/chunk-fixed-size.py - Fixed-size chunking implementationscripts/chunk-semantic.py - Semantic chunking with paragraph preservationscripts/chunk-recursive.py - Recursive chunking for hierarchical documentsscripts/benchmark-chunking.py - Benchmark and compare chunking strategiestemplates/chunking-config.yaml - Chunking configuration templatetemplates/custom-splitter.py - Template for custom chunking logicexamples/chunk-markdown.py - Markdown-specific chunkingexamples/chunk-code.py - Source code chunkingexamples/chunk-pdf.py - PDF document chunkingFixed-Size Chunking:
Semantic Chunking:
Recursive Chunking:
Sentence-Based Chunking:
Script: scripts/chunk-fixed-size.py
Usage:
python scripts/chunk-fixed-size.py \
--input document.txt \
--chunk-size 1000 \
--overlap 200 \
--output chunks.json
Parameters:
chunk-size: Number of characters per chunk (default: 1000)overlap: Character overlap between chunks (default: 200)split-on: Split on sentences, words, or characters (default: sentences)Best Practices:
Script: scripts/chunk-semantic.py
Usage:
python scripts/chunk-semantic.py \
--input document.txt \
--max-chunk-size 1500 \
--output chunks.json
How it works:
Best for: Articles, blog posts, documentation, books
Script: scripts/chunk-recursive.py
Usage:
python scripts/chunk-recursive.py \
--input document.md \
--chunk-size 1000 \
--separators '["\\n## ", "\\n### ", "\\n\\n", "\\n", " "]' \
--output chunks.json
How it works:
Separator hierarchy examples:
["\\n## ", "\\n### ", "\\n\\n", "\\n", " "]["\\nclass ", "\\ndef ", "\\n\\n", "\\n", " "]["\\n\\n", "\\n", ". ", " "]Best for: Structured documents, source code, technical manuals
Script: scripts/benchmark-chunking.py
Usage:
python scripts/benchmark-chunking.py \
--input document.txt \
--strategies fixed,semantic,recursive \
--chunk-sizes 500,1000,1500 \
--output benchmark-results.json
Metrics Evaluated:
Output:
{
"fixed-1000": {
"time_ms": 45,
"chunk_count": 127,
"avg_size": 982,
"size_variance": 12.3,
"context_score": 0.72
},
"semantic-1000": {
"time_ms": 156,
"chunk_count": 114,
"avg_size": 1087,
"size_variance": 234.5,
"context_score": 0.91
}
}
Template: templates/chunking-config.yaml
Complete configuration:
chunking:
# Global defaults
default_strategy: semantic
default_chunk_size: 1000
default_overlap: 200
# Strategy-specific configs
strategies:
fixed_size:
chunk_size: 1000
overlap: 200
split_on: sentence # sentence, word, character
semantic:
max_chunk_size: 1500
min_chunk_size: 200
preserve_paragraphs: true
add_headers: true # Include section headers
recursive:
chunk_size: 1000
overlap: 100
separators:
markdown: ["\\n## ", "\\n### ", "\\n\\n", "\\n", " "]
code: ["\\nclass ", "\\ndef ", "\\n\\n", "\\n", " "]
text: ["\\n\\n", ". ", " "]
# Document type mappings
document_types:
".md": semantic
".py": recursive
".txt": fixed_size
".pdf": semantic
Template: templates/custom-splitter.py
Create your own chunking logic:
from typing import List, Dict
import re
class CustomChunker:
def __init__(self, chunk_size: int = 1000, overlap: int = 200):
self.chunk_size = chunk_size
self.overlap = overlap
def chunk(self, text: str, metadata: Dict = None) -> List[Dict]:
"""
Implement custom chunking logic here.
Returns:
List of chunks with metadata:
[
{
"text": "chunk content",
"metadata": {
"chunk_id": 0,
"source": "document.txt",
"start_char": 0,
"end_char": 1000
}
}
]
"""
chunks = []
# Your custom chunking logic here
# Example: Split on custom pattern
sections = self._split_sections(text)
for i, section in enumerate(sections):
chunks.append({
"text": section,
"metadata": {
"chunk_id": i,
"source": metadata.get("source", "unknown"),
"chunk_size": len(section)
}
})
return chunks
def _split_sections(self, text: str) -> List[str]:
# Implement your splitting logic
pass
Example: examples/chunk-markdown.py
Features:
Usage:
python examples/chunk-markdown.py README.md --output readme-chunks.json
Example: examples/chunk-code.py
Features:
Supported languages: Python, JavaScript, TypeScript, Java, Go
Usage:
python examples/chunk-code.py src/main.py --language python --output code-chunks.json
Example: examples/chunk-pdf.py
Features:
Dependencies: pypdf, pdfminer.six
Usage:
python examples/chunk-pdf.py research-paper.pdf --strategy semantic --output pdf-chunks.json
General recommendations:
| Content Type | Chunk Size | Overlap | Strategy |
|---|---|---|---|
| Q&A / FAQs | 200-400 | 50 | Sentence |
| Articles | 500-1000 | 100-200 | Semantic |
| Documentation | 1000-1500 | 200-300 | Recursive |
| Books | 1000-2000 | 300-400 | Semantic |
| Source code | 500-1000 | 100 | Recursive |
Test with your data: Use benchmark-chunking.py to find optimal settings
Why overlap matters:
Overlap guidelines:
Fast chunking (large documents):
# Use fixed-size for speed
python scripts/chunk-fixed-size.py --input large-doc.txt --chunk-size 1000
Quality chunking (smaller documents):
# Use semantic for better context
python scripts/chunk-semantic.py --input article.txt --max-chunk-size 1500
Batch processing:
# Process multiple files
for file in documents/*.txt; do
python scripts/chunk-semantic.py --input "$file" --output "chunks/$(basename $file .txt).json"
done
python scripts/benchmark-chunking.py \
--input sample-document.txt \
--strategies fixed,semantic,recursive \
--chunk-sizes 500,1000,1500
Review metrics:
Compare retrieval quality:
Use configuration file:
import yaml
from chunking_strategies import get_chunker
config = yaml.safe_load(open('chunking-config.yaml'))
chunker = get_chunker(config['chunking']['default_strategy'], config)
chunks = chunker.chunk(document_text)
Issue: Chunks too small/large
chunk_size parameterIssue: Lost context at boundaries
Issue: Slow processing
Issue: Poor retrieval quality
Core libraries:
pip install tiktoken # Token counting
pip install nltk # Sentence splitting
pip install spacy # Advanced NLP (optional)
For PDF support:
pip install pypdf pdfminer.six
For benchmarking:
pip install pandas numpy scikit-learn
Supported Strategies: Fixed-Size, Semantic, Recursive, Sentence-Based, Custom Output Format: JSON with text and metadata Version: 1.0.0
Use when working with Payload CMS projects (payload.config.ts, collections, fields, hooks, access control, Payload API). Use when debugging validation errors, security issues, relationship queries, transactions, or hook behavior.