From kreuzberg
Extracts structured tabular data from PDFs, spreadsheets, and images using layout-aware detection and configurable table models. Outputs markdown or JSON cells.
How this skill is triggered — by the user, by Claude, or both
Slash command
/kreuzberg:extracting-tablesThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Use this when the user wants structured tabular data — financial
Use this when the user wants structured tabular data — financial statements, scientific tables, invoices, spreadsheet-style PDFs. Kreuzberg detects tables via a layout model (RT-DETR v2) and reconstructs cell structure with a configurable table model.
# Markdown tables embedded in the content stream
kreuzberg extract report.pdf --layout --content-format markdown
# Structured JSON output, tables appear under result.tables
kreuzberg extract report.pdf --layout --format json
--layout turns on layout-aware extraction; without it, tables fall back
to plain text reflow and you lose cell boundaries.
Two surfaces, picked via --format (CLI shape) and --content-format
(content rendering):
content — --content-format markdown. Tables
appear inline as | col | col | blocks. Good for LLM ingestion.tables array — --format json. Each entry has
cells[][] (rows × cols), markdown (pre-rendered), page_index,
bbox. Use this when downstream code needs exact cell access.Both are populated at once when --layout is on. The tables array is
always structured; the content stream switches representation.
kreuzberg extract financials.pdf --layout --format json \
| jq '.tables[] | {page: .page_index, rows: (.cells | length)}'
--layout-table-model picks the reconstruction backend:
| Model | Best for | Notes |
|---|---|---|
tatr | dense complex tables (academic, financial) | Default. Heaviest, highest accuracy. |
slanet_auto | dispatches per-table to wired/wireless | Good when table styles are mixed. |
slanet_wired | tables with visible borders | Faster than tatr. |
slanet_wireless | tables without borders (whitespace-separated) | For invoices, simple grids. |
slanet_plus | hybrid wired / wireless | Lighter than slanet_auto. |
disabled | layout detection only, no table structure | Use to skip table model cost. |
kreuzberg extract bank-statement.pdf \
--layout --layout-table-model tatr --content-format markdown
Drop --layout-confidence when the layout model misses tables (default
threshold ~0.5):
kreuzberg extract noisy-scan.pdf --layout --layout-confidence 0.3
.xlsx, .ods, .csv, .tsv are extracted by dedicated parsers — no
layout model needed. Each sheet becomes a markdown table (or structured
table) automatically:
kreuzberg extract workbook.xlsx --content-format markdown
kreuzberg extract data.csv --format json
Pass --no-cache=true only when iterating on the same file with different
configs.
# `output_format` in config files equals `--content-format` on the CLI.
output_format = "markdown"
[layout_detection]
enabled = true
confidence_threshold = 0.5
table_model = "tatr"
Then:
kreuzberg extract report.pdf --format json
From Python, structured tables live on result.tables:
from kreuzberg import extract_file_sync, ExtractionConfig, LayoutDetectionConfig
config = ExtractionConfig(
layout_detection=LayoutDetectionConfig(enabled=True, table_model="tatr"),
output_format="markdown",
)
result = extract_file_sync("report.pdf", config=config)
for table in result.tables:
print(table.markdown) # rendered markdown
print(table.cells[0][0]) # cell access
Node.js mirrors this (extractFile, result.tables, camelCase fields).
See references/python-api.md and references/nodejs-api.md in the
sibling kreuzberg skill for full type signatures.
--ocr-auto-rotate true for image-based
PDFs before extraction.tables[] entry.
Stitch by matching column headers if needed.tables with --layout on — confidence threshold too high or
table model mismatched. Drop --layout-confidence to 0.3, try
--layout-table-model tatr.--layout-table-model to
slanet_wired for bordered grids or slanet_wireless for invoices.tatr is heavy. Use slanet_auto or
slanet_plus as a default; reach for tatr only when accuracy matters.See references/cli-reference.md for the full layout flag set and
references/advanced-features.md for the layout pipeline internals.
npx claudepluginhub xberg-io/plugins --plugin kreuzbergMines projects and conversations into a searchable memory palace. Activates on queries about MemPalace, memory palace, mining, searching, palace setup, wings, rooms, drawers, or recalling past work.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.