Skill

extracting-tables

Extracts structured tabular data from PDFs, spreadsheets, and images using layout-aware detection and configurable table models. Outputs markdown or JSON cells.

Python

data-engineering

Popularity

Parent stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/kreuzberg:extracting-tables

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Use this when the user wants structured tabular data — financial

SKILL.md

149 lines · ~1.4k tokens

Stats

LanguageShell

Parent stars5

MaintenanceExcellent

Last CommitJun 25, 2026

Actions

View Source View Plugin View on GitHub View README

Extracting tables

Use this when the user wants structured tabular data — financial statements, scientific tables, invoices, spreadsheet-style PDFs. Kreuzberg detects tables via a layout model (RT-DETR v2) and reconstructs cell structure with a configurable table model.

Basic usage

# Markdown tables embedded in the content stream
kreuzberg extract report.pdf --layout --content-format markdown

# Structured JSON output, tables appear under result.tables
kreuzberg extract report.pdf --layout --format json

--layout turns on layout-aware extraction; without it, tables fall back to plain text reflow and you lose cell boundaries.

Output shapes

Two surfaces, picked via --format (CLI shape) and --content-format (content rendering):

Markdown tables in content — --content-format markdown. Tables appear inline as | col | col | blocks. Good for LLM ingestion.
Structured tables array — --format json. Each entry has cells[][] (rows × cols), markdown (pre-rendered), page_index, bbox. Use this when downstream code needs exact cell access.

Both are populated at once when --layout is on. The tables array is always structured; the content stream switches representation.

kreuzberg extract financials.pdf --layout --format json \
  | jq '.tables[] | {page: .page_index, rows: (.cells | length)}'

Table models

--layout-table-model picks the reconstruction backend:

Model	Best for	Notes
`tatr`	dense complex tables (academic, financial)	Default. Heaviest, highest accuracy.
`slanet_auto`	dispatches per-table to wired/wireless	Good when table styles are mixed.
`slanet_wired`	tables with visible borders	Faster than tatr.
`slanet_wireless`	tables without borders (whitespace-separated)	For invoices, simple grids.
`slanet_plus`	hybrid wired / wireless	Lighter than `slanet_auto`.
`disabled`	layout detection only, no table structure	Use to skip table model cost.

kreuzberg extract bank-statement.pdf \
  --layout --layout-table-model tatr --content-format markdown

Drop --layout-confidence when the layout model misses tables (default threshold ~0.5):

kreuzberg extract noisy-scan.pdf --layout --layout-confidence 0.3

Spreadsheets

.xlsx, .ods, .csv, .tsv are extracted by dedicated parsers — no layout model needed. Each sheet becomes a markdown table (or structured table) automatically:

kreuzberg extract workbook.xlsx --content-format markdown
kreuzberg extract data.csv --format json

Pass --no-cache=true only when iterating on the same file with different configs.

Config file alternative

# `output_format` in config files equals `--content-format` on the CLI.
output_format = "markdown"

[layout_detection]
enabled = true
confidence_threshold = 0.5
table_model = "tatr"

Then:

kreuzberg extract report.pdf --format json

Programmatic access

From Python, structured tables live on result.tables:

from kreuzberg import extract_file_sync, ExtractionConfig, LayoutDetectionConfig

config = ExtractionConfig(
    layout_detection=LayoutDetectionConfig(enabled=True, table_model="tatr"),
    output_format="markdown",
)
result = extract_file_sync("report.pdf", config=config)
for table in result.tables:
    print(table.markdown)        # rendered markdown
    print(table.cells[0][0])     # cell access

Node.js mirrors this (extractFile, result.tables, camelCase fields). See references/python-api.md and references/nodejs-api.md in the sibling kreuzberg skill for full type signatures.

Known limitations

Merged cells — reconstructed as repeated values across the spanned region; the merge is not preserved as metadata in v0.1.
Rotated tables — enable --ocr-auto-rotate true for image-based PDFs before extraction.
Nested tables — flattened. Detection succeeds; structural nesting is lost.
Multi-page tables — each page yields a separate tables[] entry. Stitch by matching column headers if needed.
ONNX Runtime required — layout and table models are unavailable in WASM builds and on the Android x86_64 emulator; native targets ship full support.

Common failure modes

Empty tables with --layout on — confidence threshold too high or table model mismatched. Drop --layout-confidence to 0.3, try --layout-table-model tatr.
Markdown tables look ragged — switch --layout-table-model to slanet_wired for bordered grids or slanet_wireless for invoices.
Slow extraction — tatr is heavy. Use slanet_auto or slanet_plus as a default; reach for tatr only when accuracy matters.

See references/cli-reference.md for the full layout flag set and references/advanced-features.md for the layout pipeline internals.

extracting-tables

Popularity

Invocation

Context Preview

SKILL.md

extracting-tables

Popularity

Invocation

Context Preview

SKILL.md

Extracting tables

Basic usage

Output shapes

Table models

Spreadsheets

Config file alternative

Programmatic access

Known limitations

Common failure modes

Similar Skills

Extracting tables

Basic usage

Output shapes

Table models

Spreadsheets

Config file alternative

Programmatic access

Known limitations

Common failure modes

Similar Skills