Chunking Strategies

Purpose: Provide production-ready document chunking implementations, benchmarking tools, and strategy selection guidance for RAG pipelines.

Activation Triggers:

Implementing document chunking for RAG
Optimizing chunk size and overlap
Comparing different chunking strategies
Benchmarking chunking performance
Processing different document types (markdown, code, PDFs)
Evaluating retrieval quality with different chunk strategies

Key Resources:

scripts/chunk-fixed-size.py - Fixed-size chunking implementation
scripts/chunk-semantic.py - Semantic chunking with paragraph preservation
scripts/chunk-recursive.py - Recursive chunking for hierarchical documents
scripts/benchmark-chunking.py - Benchmark and compare chunking strategies
templates/chunking-config.yaml - Chunking configuration template
templates/custom-splitter.py - Template for custom chunking logic
examples/chunk-markdown.py - Markdown-specific chunking
examples/chunk-code.py - Source code chunking
examples/chunk-pdf.py - PDF document chunking

Chunking Strategy Overview

Strategy Selection Guide

Fixed-Size Chunking:

Best for: Uniform documents, simple content, consistent structure
Pros: Fast, predictable, simple implementation
Cons: May split semantic units, no context awareness
Use when: Speed matters more than semantic coherence

Semantic Chunking:

Best for: Natural language documents, articles, books
Pros: Preserves semantic boundaries, better context
Cons: Slower, variable chunk sizes
Use when: Content has clear paragraph/section structure

Recursive Chunking:

Best for: Hierarchical documents, technical docs, code
Pros: Preserves structure, handles nested content
Cons: Most complex, requires structure detection
Use when: Documents have clear hierarchical organization

Sentence-Based Chunking:

Best for: Q&A pairs, chatbots, precise retrieval
Pros: Natural boundaries, good for citations
Cons: Small chunks may lack context
Use when: Need precise attribution and citations

Implementation Scripts

1. Fixed-Size Chunking

Script: scripts/chunk-fixed-size.py

Usage:

python scripts/chunk-fixed-size.py \
  --input document.txt \
  --chunk-size 1000 \
  --overlap 200 \
  --output chunks.json

Parameters:

chunk-size: Number of characters per chunk (default: 1000)
overlap: Character overlap between chunks (default: 200)
split-on: Split on sentences, words, or characters (default: sentences)

Best Practices:

Use 500-1000 character chunks for most RAG applications
Set overlap to 10-20% of chunk size
Split on sentences for better coherence

2. Semantic Chunking

Script: scripts/chunk-semantic.py

Usage:

python scripts/chunk-semantic.py \
  --input document.txt \
  --max-chunk-size 1500 \
  --output chunks.json

How it works:

Detects natural boundaries (paragraphs, headings, line breaks)
Groups content while respecting max chunk size
Preserves semantic units (paragraphs stay together)
Adds context headers for nested sections

Best for: Articles, blog posts, documentation, books

3. Recursive Chunking

Script: scripts/chunk-recursive.py

Usage:

python scripts/chunk-recursive.py \
  --input document.md \
  --chunk-size 1000 \
  --separators '["\\n## ", "\\n### ", "\\n\\n", "\\n", " "]' \
  --output chunks.json

How it works:

Tries to split on first separator (e.g., ## headings)
If chunks still too large, recursively splits on next separator
Continues until all chunks are within size limit
Preserves hierarchical context

Separator hierarchy examples:

Markdown: ["\\n## ", "\\n### ", "\\n\\n", "\\n", " "]
Python: ["\\nclass ", "\\ndef ", "\\n\\n", "\\n", " "]
General: ["\\n\\n", "\\n", ". ", " "]

Best for: Structured documents, source code, technical manuals

4. Benchmark Chunking Strategies

Script: scripts/benchmark-chunking.py

Usage:

python scripts/benchmark-chunking.py \
  --input document.txt \
  --strategies fixed,semantic,recursive \
  --chunk-sizes 500,1000,1500 \
  --output benchmark-results.json

Metrics Evaluated:

Processing time: Speed of chunking
Chunk count: Total chunks generated
Chunk size variance: Consistency of chunk sizes
Context preservation: Semantic unit integrity (scored)
Retrieval quality: Simulated query performance

Output:

{
  "fixed-1000": {
    "time_ms": 45,
    "chunk_count": 127,
    "avg_size": 982,
    "size_variance": 12.3,
    "context_score": 0.72
  },
  "semantic-1000": {
    "time_ms": 156,
    "chunk_count": 114,
    "avg_size": 1087,
    "size_variance": 234.5,
    "context_score": 0.91
  }
}

Configuration Template

Template: templates/chunking-config.yaml

Complete configuration:

chunking:
  # Global defaults
  default_strategy: semantic
  default_chunk_size: 1000
  default_overlap: 200

  # Strategy-specific configs
  strategies:
    fixed_size:
      chunk_size: 1000
      overlap: 200
      split_on: sentence  # sentence, word, character

    semantic:
      max_chunk_size: 1500
      min_chunk_size: 200
      preserve_paragraphs: true
      add_headers: true  # Include section headers

    recursive:
      chunk_size: 1000
      overlap: 100
      separators:
        markdown: ["\\n## ", "\\n### ", "\\n\\n", "\\n", " "]
        code: ["\\nclass ", "\\ndef ", "\\n\\n", "\\n", " "]
        text: ["\\n\\n", ". ", " "]

  # Document type mappings
  document_types:
    ".md": semantic
    ".py": recursive
    ".txt": fixed_size
    ".pdf": semantic

Custom Splitter Template

Template: templates/custom-splitter.py

Create your own chunking logic:

from typing import List, Dict
import re

class CustomChunker:
    def __init__(self, chunk_size: int = 1000, overlap: int = 200):
        self.chunk_size = chunk_size
        self.overlap = overlap

    def chunk(self, text: str, metadata: Dict = None) -> List[Dict]:
        """
        Implement custom chunking logic here.

        Returns:
            List of chunks with metadata:
            [
                {
                    "text": "chunk content",
                    "metadata": {
                        "chunk_id": 0,
                        "source": "document.txt",
                        "start_char": 0,
                        "end_char": 1000
                    }
                }
            ]
        """
        chunks = []

        # Your custom chunking logic here
        # Example: Split on custom pattern
        sections = self._split_sections(text)

        for i, section in enumerate(sections):
            chunks.append({
                "text": section,
                "metadata": {
                    "chunk_id": i,
                    "source": metadata.get("source", "unknown"),
                    "chunk_size": len(section)
                }
            })

        return chunks

    def _split_sections(self, text: str) -> List[str]:
        # Implement your splitting logic
        pass

Document-Specific Examples

Markdown Chunking

Example: examples/chunk-markdown.py

Features:

Preserves heading hierarchy
Keeps code blocks together
Maintains list structure
Adds parent section context to chunks

Usage:

python examples/chunk-markdown.py README.md --output readme-chunks.json

Code Chunking

Example: examples/chunk-code.py

Features:

Splits on class and function boundaries
Preserves complete functions
Includes docstrings with implementations
Language-aware separator selection

Supported languages: Python, JavaScript, TypeScript, Java, Go

Usage:

python examples/chunk-code.py src/main.py --language python --output code-chunks.json

PDF Chunking

Example: examples/chunk-pdf.py

Features:

Extracts text from PDF
Preserves page boundaries
Maintains formatting clues
Handles multi-column layouts

Dependencies: pypdf, pdfminer.six

Usage:

python examples/chunk-pdf.py research-paper.pdf --strategy semantic --output pdf-chunks.json

Optimization Guidelines

Chunk Size Selection

General recommendations:

Content Type	Chunk Size	Overlap	Strategy
Q&A / FAQs	200-400	50	Sentence
Articles	500-1000	100-200	Semantic
Documentation	1000-1500	200-300	Recursive
Books	1000-2000	300-400	Semantic
Source code	500-1000	100	Recursive

Test with your data: Use benchmark-chunking.py to find optimal settings

Overlap Strategies

Why overlap matters:

Prevents information loss at boundaries
Improves retrieval of cross-boundary information
Balances redundancy vs. coverage

Overlap guidelines:

10-15%: Minimal overlap for speed
15-20%: Standard overlap for most use cases
20-30%: High overlap for critical applications

Performance Optimization

Fast chunking (large documents):

# Use fixed-size for speed
python scripts/chunk-fixed-size.py --input large-doc.txt --chunk-size 1000

Quality chunking (smaller documents):

# Use semantic for better context
python scripts/chunk-semantic.py --input article.txt --max-chunk-size 1500

Batch processing:

# Process multiple files
for file in documents/*.txt; do
  python scripts/chunk-semantic.py --input "$file" --output "chunks/$(basename $file .txt).json"
done

Evaluation Workflow

Step 1: Benchmark Strategies

python scripts/benchmark-chunking.py \
  --input sample-document.txt \
  --strategies fixed,semantic,recursive \
  --chunk-sizes 500,1000,1500

Step 2: Analyze Results

Review metrics:

Processing time (prefer < 100ms per document)
Context preservation score (target > 0.85)
Chunk size variance (lower is more predictable)

Step 3: A/B Test Retrieval

Compare retrieval quality:

Chunk same corpus with different strategies
Run identical test queries against each
Measure precision@k and recall@k
Select strategy with best retrieval metrics

Step 4: Production Deployment

Use configuration file:

import yaml
from chunking_strategies import get_chunker

config = yaml.safe_load(open('chunking-config.yaml'))
chunker = get_chunker(config['chunking']['default_strategy'], config)

chunks = chunker.chunk(document_text)

Common Issues & Solutions

Issue: Chunks too small/large

Adjust chunk_size parameter
Check document structure (may need different strategy)
Verify separator selection for recursive chunking

Issue: Lost context at boundaries

Increase overlap
Switch to semantic chunking
Add parent context to metadata

Issue: Slow processing

Use fixed-size chunking for large batches
Reduce overlap
Process files in parallel

Issue: Poor retrieval quality

Benchmark different strategies
Increase chunk size
Try hybrid approach (semantic + fixed fallback)

Dependencies

Core libraries:

pip install tiktoken  # Token counting
pip install nltk      # Sentence splitting
pip install spacy     # Advanced NLP (optional)

For PDF support:

pip install pypdf pdfminer.six

For benchmarking:

pip install pandas numpy scikit-learn

Best Practices Summary

Start with semantic chunking for most documents
Use recursive chunking for structured/hierarchical content
Benchmark on your actual data before production
Set overlap to 15-20% of chunk size
Include metadata (source, page, section) in chunks
Test retrieval quality, not just chunking speed
Use appropriate chunk size for your embedding model token limit
Document your chunking strategy choice and parameters

Supported Strategies: Fixed-Size, Semantic, Recursive, Sentence-Based, Custom Output Format: JSON with text and metadata Version: 1.0.0

chunking-strategies

Chunking Strategies

Chunking Strategy Overview

Strategy Selection Guide

Implementation Scripts

1. Fixed-Size Chunking

2. Semantic Chunking

3. Recursive Chunking

4. Benchmark Chunking Strategies

Configuration Template

Custom Splitter Template

Document-Specific Examples

Markdown Chunking

Code Chunking

PDF Chunking

Optimization Guidelines

Chunk Size Selection

Overlap Strategies

Performance Optimization

Evaluation Workflow

Step 1: Benchmark Strategies

Step 2: Analyze Results

Step 3: A/B Test Retrieval

Step 4: Production Deployment

Common Issues & Solutions

Dependencies

Best Practices Summary

Similar Skills