Skill

langchain-data-handling

Load and chunk documents for LangChain 1.0 RAG pipelines correctly — language-aware splitters, table-safe PDF loaders, Cloudflare-compatible web loaders, chunk-boundary strategies that survive real-world structure. Use when building a RAG pipeline, diagnosing why retrieval misquotes a table, or debugging a crawler returning blank content. Trigger with "langchain document loader", "text splitter", "chunking strategy", "pdf loader", "markdown splitter", "webbaseloader".

npx claudepluginhub flight505/skill-forge --plugin langchain-py-pack

Tool Access

This skill is limited to using the following tools:

ReadWriteEditBash(python:*)Bash(pip:*)

Preview

You have a RAG system over a Python docs site. A user asks "what does

Supporting Assets

references/crawler-hygiene.mdreferences/language-aware-splitters.mdreferences/loader-selection-matrix.mdreferences/one-pager.mdreferences/table-preservation.md

SKILL.md

Similar Skills

cache-components

139.3k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

mcp-builder

124.2k

Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).

9 files

anthropics-skills-13

Stats

Parent Repo Stars0

Parent Repo Forks0

Last CommitApr 30, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

LangChain Data Handling — Loaders and Splitters (Python)

Overview

You have a RAG system over a Python docs site. A user asks "what does trim_messages do?" and the retriever returns this chunk:

### `trim_messages(strategy="last", include_system=True)`

Trim a message history to fit a token budget. The newest messages are kept;
older messages are dropped. Pass `include_system=True` to preserve the system

...and that's it. The chunk ends there. The code example showing the function body — the actual thing the user wanted — is in a different chunk, retrieved with a lower similarity score and dropped before the LLM sees it. The model then hallucinates the function's behavior from the signature alone.

This is pain-catalog entry P13. RecursiveCharacterTextSplitter's default separators are ["\n\n", "\n", " ", ""]. It splits on any blank line — including inside triple-backtick code fences in Markdown. The fix is a one-line swap to RecursiveCharacterTextSplitter.from_language(Language.MARKDOWN), which treats the fence as an atomic unit, but you have to know the bug exists.

The sibling failures this skill prevents:

P49 — PyPDFLoader splits by page. A 5-row financial table that spans a page break gets torn in half; rows 1-3 go in one chunk, rows 4-5 in another with no header. A RAG answer sourced from the second chunk misquotes the numbers because the column meanings are in the first chunk. Fix: use PyMuPDFLoader or UnstructuredPDFLoader, which detect tables and emit them as distinct structured elements.
P50 — WebBaseLoader's default User-Agent is python-requests/2.x. Cloudflare-protected sites flag this as a bot and return a 403 interstitial HTML page ("Checking your browser...") instead of real content. The crawler indexes the challenge page. You notice weeks later when every retrieval from that source returns the same Cloudflare text. Fix: set a realistic header_template={"User-Agent": "Mozilla/5.0 ..."}, respect robots.txt, and rate-limit per-host to 1 req/sec.

Pinned versions: langchain-core 1.0.x, langchain-community 1.0.x, langchain-text-splitters 1.0.x, pymupdf, unstructured. Pain-catalog anchors: P13, P49, P50, P15.

This skill is the upstream half of the RAG pipeline — load and chunk. For the downstream half (embedding, scoring, reranking) see the pair skill langchain-embeddings-search, which covers score semantics (P12), dim guards (P14), and reranker filtering (P15). Do not re-implement chunking there.

Prerequisites

Python 3.10+
langchain-core >= 1.0, < 2.0 and langchain-community >= 1.0, < 2.0
langchain-text-splitters >= 1.0, < 2.0
PDF support: pip install pymupdf unstructured[pdf]
Web loading: pip install beautifulsoup4 requests
For corpus dedup (optional): pip install datasketch

Instructions

Step 1 — Choose a loader by source format

Loader selection is the first decision — get it wrong and no amount of splitter tuning will recover. Use the decision table:

Source	Use	NOT	Why
PDF with tables	`PyMuPDFLoader` or `UnstructuredPDFLoader`	`PyPDFLoader`	Tables torn by page splits (P49)
PDF text-only	`PyPDFLoader`	—	Simple, fast, OK when no tables
Web page	`WebBaseLoader(header_template=...)`	Default UA	Cloudflare 403 (P50)
Markdown docs	`UnstructuredMarkdownLoader`	Plain text read	Preserves heading structure
HTML long-form	`WebBaseLoader` + `HTMLHeaderTextSplitter`	Plain text	Keeps `<h1>`/`<h2>` context
Code repo	`GenericLoader` with language parser	`DirectoryLoader` as text	Language-aware chunking
Corpus (1000+ docs)	`DirectoryLoader` + `glob` filter	One-by-one	Parallel load, progress

from langchain_community.document_loaders import (
    PyMuPDFLoader,            # table-aware PDF
    WebBaseLoader,            # web pages (set custom UA)
    UnstructuredMarkdownLoader,
    DirectoryLoader,
)

# PDF with tables — P49 fix
pdf_docs = PyMuPDFLoader("10-Q-filing.pdf").load()

# Web page — P50 fix
web_docs = WebBaseLoader(
    "https://example.com/article",
    header_template={
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36"
    },
).load()

# Markdown docs site
md_docs = UnstructuredMarkdownLoader("docs/guide.md").load()

# Corpus
corpus = DirectoryLoader(
    "./docs", glob="**/*.md",
    loader_cls=UnstructuredMarkdownLoader,
    show_progress=True,
).load()

Hard limit: keep single-PDF ingestion under 5 MB per call. Larger files should be pre-split with pdftk / qpdf to avoid OOM on PyMuPDFLoader's full-document parse.

See Loader Selection Matrix for the full per-format table with cost and accuracy notes.

Step 2 — Pick a splitter by content type

Content	Splitter	chunk_size	chunk_overlap	Why
Prose (docs, articles)	`RecursiveCharacterTextSplitter.from_language(Language.MARKDOWN)`	1000	100	Preserves code fences (P13)
Python source	`RecursiveCharacterTextSplitter.from_language(Language.PYTHON)`	1500	150	Splits at `def`/`class`
FAQ / Q&A	`RecursiveCharacterTextSplitter` with `separators=["\n\n"]`	500	50	One chunk per Q-A pair
HTML long-form	`HTMLHeaderTextSplitter`	—	—	Headers become metadata
Generic text	`RecursiveCharacterTextSplitter`	1000	100	Safe default

from langchain_text_splitters import (
    RecursiveCharacterTextSplitter,
    Language,
    HTMLHeaderTextSplitter,
)

# GOOD — P13 fix for Markdown
md_splitter = RecursiveCharacterTextSplitter.from_language(
    Language.MARKDOWN, chunk_size=1000, chunk_overlap=100,
)

# GOOD — Python code
py_splitter = RecursiveCharacterTextSplitter.from_language(
    Language.PYTHON, chunk_size=1500, chunk_overlap=150,
)

# GOOD — HTML long-form with heading-as-metadata
html_splitter = HTMLHeaderTextSplitter(
    headers_to_split_on=[("h1", "Header 1"), ("h2", "Header 2")],
)

# BAD — breaks inside code fences (P13)
bad = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

See Language-Aware Splitters for the full list of Language.* enum values, custom separator patterns, and the code-fence-detection regex for when you need a custom splitter.

Step 3 — Tune chunk_size and overlap

Defaults from the table work for most corpora. Tune when:

Retrieval misses context: increase chunk_size (1000 → 1500) or chunk_overlap (100 → 200). Overlap is what bridges a concept that crosses chunk boundaries.
Retrieval too broad, answers wander: decrease chunk_size (1000 → 500). Smaller chunks = more precise retrieval but more chunks to index.
Tables / structured data: do NOT tune — index them separately (step 4).

A 1% overlap-to-size ratio is too low (200/20000); 20% is the sweet spot for most prose. Code needs less overlap (10%) because function boundaries are natural splits.

Step 4 — Detect and index tables as structured records

Tables are not text. If your corpus has financial filings, product specs, or any tabular data, index tables as separate records with column metadata:

import fitz  # pymupdf directly for table detection

def extract_tables_as_records(pdf_path: str) -> list[dict]:
    """Extract tables as one record per row."""
    doc = fitz.open(pdf_path)
    records = []
    for page_num, page in enumerate(doc):
        tables = page.find_tables()
        for table in tables:
            rows = table.extract()
            if not rows:
                continue
            headers = rows[0]
            for row_idx, row in enumerate(rows[1:], start=1):
                record = {
                    "page": page_num,
                    "table_idx": tables.tables.index(table),
                    "row_idx": row_idx,
                    "content": " | ".join(f"{h}: {v}" for h, v in zip(headers, row)),
                    "metadata": dict(zip(headers, row)),
                }
                records.append(record)
    return records

Now a question like "what was Q3 revenue?" retrieves a single row with its column headers attached, not half a table missing the column meanings. See Table Preservation for the full pattern including hybrid retrieval (prose + table records).

Step 5 — Preserve metadata through the pipeline

The loader attaches metadata (source, page, heading); the splitter propagates it. Front-matter in Markdown, PDF page numbers, and web URLs should all end up in doc.metadata so retrieval results are citable:

for doc in md_docs:
    # Markdown front-matter (if loader extracted it)
    print(doc.metadata.get("title"), doc.metadata.get("date"))

# Splitter-preserved metadata
chunks = md_splitter.split_documents(md_docs)
assert chunks[0].metadata == md_docs[0].metadata  # preserved

Custom metadata (tenant_id, version, confidence) should be added before splitting so every chunk inherits it.

Step 6 — Deduplicate noisy corpora

Web crawls and scraped docs often contain near-duplicate pages (nav chrome, footer boilerplate, syndicated posts). MinHash-based dedup at the chunk level keeps the index clean:

from datasketch import MinHash, MinHashLSH

lsh = MinHashLSH(threshold=0.9, num_perm=128)
kept = []
for i, chunk in enumerate(all_chunks):
    mh = MinHash(num_perm=128)
    for tok in chunk.page_content.lower().split():
        mh.update(tok.encode())
    if not list(lsh.query(mh)):
        lsh.insert(str(i), mh)
        kept.append(chunk)

A threshold of 0.9 catches near-duplicates (minor wording differences) without eating legitimate paraphrases.

Step 7 — Compose the pipeline

# Multi-stage: load → split → dedup → index
def build_rag_index(source_dir: str, store):
    # 1. Load
    docs = DirectoryLoader(
        source_dir, glob="**/*.md",
        loader_cls=UnstructuredMarkdownLoader,
    ).load()

    # 2. Clean (empty-content filter)
    docs = [d for d in docs if d.page_content.strip()]

    # 3. Split (language-aware)
    splitter = RecursiveCharacterTextSplitter.from_language(
        Language.MARKDOWN, chunk_size=1000, chunk_overlap=100,
    )
    chunks = splitter.split_documents(docs)

    # 4. Dedup (optional for noisy corpora)
    # chunks = dedup_minhash(chunks, threshold=0.9)

    # 5. Index — handoff to langchain-embeddings-search
    store.add_documents(chunks)
    return store

For the embedding + indexing + retrieval steps, see langchain-embeddings-search.

Output

Loader chosen from the selection matrix matching source format and table needs
Splitter chosen from the decision tree matching content type
Chunk size + overlap tuned from the defaults (1000/100 prose, 1500/150 code, 500/50 FAQ)
Tables extracted as structured records with column metadata (not text chunks)
Web loaders configured with realistic User-Agent and robots.txt respect
Metadata preserved through loader → splitter → index
Optional MinHash dedup (threshold 0.9) for noisy corpora

Error Handling

Error / symptom	Cause	Fix
RAG retrieves function signature without body	`RecursiveCharacterTextSplitter` broke inside code fence (P13)	Use `from_language(Language.MARKDOWN)` or add "```" as first separator
Table rows misquoted in RAG answer	`PyPDFLoader` tore table by page (P49)	Switch to `PyMuPDFLoader`; index tables as structured records
`WebBaseLoader` returns 403 / blank content	Default UA flagged by Cloudflare (P50)	Set `header_template={"User-Agent": "Mozilla/5.0 ..."}`; respect robots.txt
`ValueError: expected str, NoneType found` during split	Empty `page_content`	Filter `[d for d in docs if d.page_content.strip()]` before splitting
`MemoryError` loading PDF	PDF > 5 MB ingested in one call	Pre-split with `pdftk` / `qpdf`; process chunks separately
Chunks missing metadata after split	Custom metadata added after loading but before splitting was lost	Add metadata before `split_documents()`; verify `chunks[0].metadata` preserved
Retrieval quality low on FAQ corpus	Chunks too large, one chunk holds multiple Q-A pairs	Drop to `chunk_size=500, chunk_overlap=50` with `separators=["\n\n"]`
Web crawl indexes Cloudflare challenge page	No check for HTTP status / response length	Assert `len(doc.page_content) > 500` and reject pages containing `"Checking your browser"`
Duplicate chunks eat retrieval slots	Syndicated content, nav chrome not stripped	MinHash dedup at threshold 0.9 before indexing
Reranker scores inconsistent across chunks	Chunks of wildly different size change score distribution (P15)	Normalize chunk size within a corpus; target ±20% of `chunk_size`

Examples

Ingesting a Markdown docs site with code examples

Markdown docs with Python code fences require Language.MARKDOWN to keep fence boundaries intact. Chunk size 1000 with 100 overlap preserves one function-sized example per chunk. Front-matter fields (title, date, author) are attached as metadata for citation. See Language-Aware Splitters.

Ingesting a PDF filing with financial tables

10-Q filings have dozens of multi-row tables. Use PyMuPDFLoader for the prose and a direct fitz.find_tables() pass to extract tables as structured records. Index prose with chunk_size=1000 and tables as one-row-per-record with the header row concatenated. Questions like "what was Q3 revenue?" hit a single row with column meanings attached. See Table Preservation.

Crawling a documentation site behind Cloudflare

Set a realistic User-Agent, fetch robots.txt first and respect Disallow rules, rate-limit to 1 req/sec per host, and prefer the site's sitemap or RSS feed when available. Assert response length > 500 chars and reject known interstitial patterns. See Crawler Hygiene.

Ingesting a Python code repo for code RAG

GenericLoader with LanguageParser(language=Language.PYTHON) preserves function and class boundaries. Chunk size 1500 with 150 overlap gives enough context for typical function-level queries. Imports and module docstrings end up in their own chunks — tag them with metadata for higher precision retrieval on "where is X imported from" queries.

Resources

LangChain Python: Text splitters
LangChain Python: Document loaders
LangChain Python: Retrievers
PyMuPDF (fitz) docs — table detection via page.find_tables()
Unstructured.io — table-aware PDF parsing
Cloudflare bot management — why default UAs fail
robots.txt spec — crawler obligations
Pair skill: langchain-embeddings-search (embed/score/rerank — downstream of this skill)
Pack pain catalog: docs/pain-catalog.md (entries P13, P49, P50, P15)