Skill

chunking

Splits extracted text into chunks for LLM context windows or RAG ingestion. Supports character, markdown, YAML, and semantic chunking with configurable size and overlap.

ai-ml

data-engineering

Popularity

Parent stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/kreuzberg:chunking

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Use this when feeding documents into an LLM context window or a vector

SKILL.md

151 lines · ~1.4k tokens

Stats

LanguageShell

Parent stars5

MaintenanceExcellent

Last CommitJun 25, 2026

Actions

View Source View Plugin View on GitHub View README

Chunking

Use this when feeding documents into an LLM context window or a vector store. Kreuzberg chunks two ways: inline during extraction (chunks land on result.chunks), or standalone via the chunk command for text you already have. Sizing is character-based by default, or token-based when a tokenizer model is supplied.

Inline during extraction

Turn on chunking with --chunk and the chunks appear on the structured result under chunks:

# 1000-char chunks, 200-char overlap (defaults when --chunk is on)
kreuzberg extract report.pdf --chunk --format json | jq '.chunks | length'

# Explicit size + overlap
kreuzberg extract report.pdf --chunk --chunk-size 1500 --chunk-overlap 300 --format json

Overlap must be smaller than chunk size — the CLI rejects --chunk-overlap >= --chunk-size. When you set only --chunk-overlap against an existing config, an overlap that exceeds the size is clamped to chunk_size / 4.

Standalone `chunk` command

Chunk text you already have, from --text or stdin. Output defaults to JSON:

# From a flag
kreuzberg chunk --text "long document text ..." --chunk-size 800 --chunk-overlap 100

# From stdin (pipe extracted content straight in)
kreuzberg extract notes.md | kreuzberg chunk --chunk-size 500 --format json

JSON output carries chunks (array of strings), chunk_count, the resolved config (max_characters, overlap, chunker_type), and input_size_bytes. Use --format text for a human-readable dump with --- chunk N --- separators.

Note: in the JSON output, chunker_type is rendered capitalized ("Text", "Markdown", "Yaml", "Semantic") because it is emitted via Rust's Debug formatting, whereas the --chunker-type input flag is lowercase (text, markdown, yaml, semantic). Lowercase the value before comparing if you parse it back.

Chunker types

--chunker-type selects the splitting strategy (standalone chunk command):

Type	Behavior
`text`	Default. Plain character-window splitting with overlap.
`markdown`	Markdown-aware — splits on structure (headings, blocks) where possible.
`yaml`	YAML-aware splitting for structured config/data documents.
`semantic`	Topic-boundary splitting driven by `--topic-threshold` (0.0–1.0, default 0.75).

# Markdown-aware chunking keeps headings and blocks intact
kreuzberg chunk --text "$(cat README.md)" --chunker-type markdown

# Semantic chunking — lower threshold = more, smaller topic chunks
kreuzberg chunk --text "$(cat transcript.txt)" --chunker-type semantic --topic-threshold 0.6

Token-based sizing

By default --chunk-size counts characters. To size chunks by tokens for a specific model, pass --chunking-tokenizer with a HuggingFace tokenizer id. On the extract command this implicitly enables chunking. Requires the chunking-tokenizers feature (present in the default CLI build).

# Size chunks by GPT-4o tokens during extraction
kreuzberg extract report.pdf --chunking-tokenizer Xenova/gpt-4o --format json

# Or on the standalone command
kreuzberg chunk --text "$(cat doc.txt)" --chunking-tokenizer Xenova/gpt-4o --chunk-size 512

With a tokenizer set, --chunk-size is interpreted in tokens, not characters.

Config file alternative

Field names in config files are snake_case under [chunking]:

[chunking]
max_characters = 1000
overlap = 200
chunker_type = "markdown"

kreuzberg extract report.pdf --config kreuzberg.toml --format json

CLI flags map to config fields as --chunk-size → max_characters and --chunk-overlap → overlap. In config files use the snake_case names.

Programmatic access

From Python, enable chunking on the config and read result.chunks:

from kreuzberg import extract_file_sync, ExtractionConfig, ChunkingConfig

config = ExtractionConfig(
    chunking=ChunkingConfig(max_chars=1000, max_overlap=200),
)
result = extract_file_sync("report.pdf", config=config)
for chunk in result.chunks:
    print(len(chunk))

Python ChunkingConfig uses max_chars / max_overlap. Rust uses max_characters / overlap. See references/python-api.md and references/rust-api.md in the sibling kreuzberg skill.

Picking parameters

RAG / vector store — 500–1000 chars (or 256–512 tokens) with 10–20% overlap. Use markdown chunking for docs to keep sections whole.
LLM summarization — larger chunks (1500–4000 chars) with small overlap; size by tokens to stay under the model window.
Topic segmentation — semantic chunker; tune --topic-threshold down for finer splits, up for coarser ones.

Common pitfalls

Overlap ≥ size — rejected on extract; clamped to size / 4 when only overlap is changed against an existing config.
Tokenizer without the feature — --chunking-tokenizer errors if the CLI was built without chunking-tokenizers. The default build includes it.
Empty input — the standalone chunk command bails on empty text; provide --text or pipe non-empty stdin.

See references/configuration.md for the full [chunking] schema and references/cli-reference.md for every chunk flag.

chunking

Popularity

Invocation

Context Preview

SKILL.md

chunking

Popularity

Invocation

Context Preview

SKILL.md

Chunking

Inline during extraction

Standalone `chunk` command

Chunker types

Token-based sizing

Config file alternative

Programmatic access

Picking parameters

Common pitfalls

Similar Skills

Chunking

Inline during extraction

Standalone `chunk` command

Chunker types

Token-based sizing

Config file alternative

Programmatic access

Picking parameters

Common pitfalls

Similar Skills

chunking

Popularity

Invocation

Context Preview

SKILL.md

chunking

Popularity

Invocation

Context Preview

SKILL.md

Chunking

Inline during extraction

Standalone chunk command

Chunker types

Token-based sizing

Config file alternative

Programmatic access

Picking parameters

Common pitfalls

Similar Skills

Chunking

Inline during extraction

Standalone chunk command

Chunker types

Token-based sizing

Config file alternative

Programmatic access

Picking parameters

Common pitfalls

Similar Skills

Standalone `chunk` command

Standalone `chunk` command