From kreuzberg
Splits extracted text into chunks for LLM context windows or RAG ingestion. Supports character, markdown, YAML, and semantic chunking with configurable size and overlap.
How this skill is triggered — by the user, by Claude, or both
Slash command
/kreuzberg:chunkingThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Use this when feeding documents into an LLM context window or a vector
Use this when feeding documents into an LLM context window or a vector
store. Kreuzberg chunks two ways: inline during extraction (chunks land on
result.chunks), or standalone via the chunk command for text you
already have. Sizing is character-based by default, or token-based when a
tokenizer model is supplied.
Turn on chunking with --chunk and the chunks appear on the structured
result under chunks:
# 1000-char chunks, 200-char overlap (defaults when --chunk is on)
kreuzberg extract report.pdf --chunk --format json | jq '.chunks | length'
# Explicit size + overlap
kreuzberg extract report.pdf --chunk --chunk-size 1500 --chunk-overlap 300 --format json
Overlap must be smaller than chunk size — the CLI rejects
--chunk-overlap >= --chunk-size. When you set only --chunk-overlap
against an existing config, an overlap that exceeds the size is clamped to
chunk_size / 4.
chunk commandChunk text you already have, from --text or stdin. Output defaults to
JSON:
# From a flag
kreuzberg chunk --text "long document text ..." --chunk-size 800 --chunk-overlap 100
# From stdin (pipe extracted content straight in)
kreuzberg extract notes.md | kreuzberg chunk --chunk-size 500 --format json
JSON output carries chunks (array of strings), chunk_count, the
resolved config (max_characters, overlap, chunker_type), and
input_size_bytes. Use --format text for a human-readable dump with
--- chunk N --- separators.
Note: in the JSON output,
chunker_typeis rendered capitalized ("Text","Markdown","Yaml","Semantic") because it is emitted via Rust's Debug formatting, whereas the--chunker-typeinput flag is lowercase (text,markdown,yaml,semantic). Lowercase the value before comparing if you parse it back.
--chunker-type selects the splitting strategy (standalone chunk
command):
| Type | Behavior |
|---|---|
text | Default. Plain character-window splitting with overlap. |
markdown | Markdown-aware — splits on structure (headings, blocks) where possible. |
yaml | YAML-aware splitting for structured config/data documents. |
semantic | Topic-boundary splitting driven by --topic-threshold (0.0–1.0, default 0.75). |
# Markdown-aware chunking keeps headings and blocks intact
kreuzberg chunk --text "$(cat README.md)" --chunker-type markdown
# Semantic chunking — lower threshold = more, smaller topic chunks
kreuzberg chunk --text "$(cat transcript.txt)" --chunker-type semantic --topic-threshold 0.6
By default --chunk-size counts characters. To size chunks by tokens for
a specific model, pass --chunking-tokenizer with a HuggingFace tokenizer
id. On the extract command this implicitly enables chunking. Requires the
chunking-tokenizers feature (present in the default CLI build).
# Size chunks by GPT-4o tokens during extraction
kreuzberg extract report.pdf --chunking-tokenizer Xenova/gpt-4o --format json
# Or on the standalone command
kreuzberg chunk --text "$(cat doc.txt)" --chunking-tokenizer Xenova/gpt-4o --chunk-size 512
With a tokenizer set, --chunk-size is interpreted in tokens, not
characters.
Field names in config files are snake_case under [chunking]:
[chunking]
max_characters = 1000
overlap = 200
chunker_type = "markdown"
kreuzberg extract report.pdf --config kreuzberg.toml --format json
CLI flags map to config fields as
--chunk-size→max_charactersand--chunk-overlap→overlap. In config files use the snake_case names.
From Python, enable chunking on the config and read result.chunks:
from kreuzberg import extract_file_sync, ExtractionConfig, ChunkingConfig
config = ExtractionConfig(
chunking=ChunkingConfig(max_chars=1000, max_overlap=200),
)
result = extract_file_sync("report.pdf", config=config)
for chunk in result.chunks:
print(len(chunk))
Python
ChunkingConfigusesmax_chars/max_overlap. Rust usesmax_characters/overlap. Seereferences/python-api.mdandreferences/rust-api.mdin the siblingkreuzbergskill.
markdown chunking for docs to keep sections whole.semantic chunker; tune --topic-threshold
down for finer splits, up for coarser ones.extract; clamped to size / 4 when
only overlap is changed against an existing config.--chunking-tokenizer errors if the
CLI was built without chunking-tokenizers. The default build includes it.chunk command bails on empty text;
provide --text or pipe non-empty stdin.See references/configuration.md for the full [chunking] schema and
references/cli-reference.md for every chunk flag.
npx claudepluginhub xberg-io/plugins --plugin kreuzbergMines projects and conversations into a searchable memory palace. Activates on queries about MemPalace, memory palace, mining, searching, palace setup, wings, rooms, drawers, or recalling past work.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.