From lightningrod
Converts local files (PDF, CSV, TXT, MD) into Lightningrod seeds and manages FileSets for large-scale document ingestion with metadata filtering and temporal ordering.
npx claudepluginhub lightning-rod-labs/lightningrod-python-sdkThis skill uses the workspace's default tool permissions.
```python
Guides users through building forecasting datasets and fine-tuning models using the Lightningrod SDK. Follows proven patterns for temporal splitting and domain-specific data sources.
Guides dataset upload, validation, management, and preparation for DataRobot ML projects using Python SDK. Useful for data quality checks before training.
Creates evaluation datasets for Dokimos in JSON, CSV, or JSONL formats for LLM evaluation, test data, experiments, and format conversions.
Share bugs, ideas, or general feedback.
from lightningrod import preprocessing
# Glob pattern — supports .txt, .md, .pdf, .csv
samples = preprocessing.files_to_samples(
"data/*.pdf",
chunk_size=1000,
chunk_overlap=100,
)
# Single file
samples = preprocessing.file_to_samples("report.pdf")
# CSV with explicit columns
samples = preprocessing.files_to_samples(
"data.csv",
csv_text_column="body",
csv_label_column="outcome", # optional — embeds label in sample
)
# Raw string chunks
samples = preprocessing.chunks_to_samples(chunks, metadata={"source": "internal"})
input_dataset = lr.datasets.create_from_samples(samples, batch_size=1000)
# Pass to lr.transforms.run():
dataset = lr.transforms.run(pipeline, input_dataset=input_dataset, max_questions=10)
Prefer preprocessing.files_to_samples() for small, one-shot collections that only need to become seeds. Reach for a FileSet when any of these apply:
from lightningrod import (
FileSetMetadataSchemaInput, MetadataFieldDefinitionInput, MetadataFieldType,
)
# Metadata schema is optional — include it only if you plan to filter on these fields later
schema = FileSetMetadataSchemaInput(fields=[
MetadataFieldDefinitionInput(
name="ticker", field_type=MetadataFieldType.STRING, required=True,
description="Company ticker",
),
])
fs = lr.filesets.create(
name="quarterly-reports",
description="Investor reports",
metadata_schema=schema, # omit for unstructured collections
)
from datetime import datetime
# Per-file metadata is a dict keyed by filename. file_date powers temporal ordering.
result = lr.filesets.upload_files(
fs.id,
file_paths=["report_q1.pdf", "report_q2.pdf"],
metadata={
"report_q1.pdf": {"ticker": "APEX", "file_date": datetime(2024, 3, 31)},
"report_q2.pdf": {"ticker": "APEX", "file_date": datetime(2024, 6, 30)},
},
)
print(result.succeeded, result.failed, result.errors)
# Scale path — uses parallel GCS transfer. Requires the [transfer] extra:
# pip install "lightningrod-ai[transfer]"
result = lr.filesets.upload_directory(
fs.id, "./docs", pattern="*.pdf", max_workers=100, show_progress=True,
)
After upload, choose the transform pattern that matches how the documents relate to your questions.
FileSetSeedGeneratorUse when documents provide the material questions are about, but labels come from elsewhere (web search, news events, a later-resolved outcome).
from lightningrod import FileSetSeedGenerator
seed_gen = FileSetSeedGenerator(
file_set_id=fs.id,
chunk_size=2000,
chunk_overlap=200,
metadata_filters=["ticker='APEX'"], # SQL-style, optional
)
FileSetDocumentContextGenerator / FileSetDocumentLabelerNo embeddings, no vector search. Picks a single document by chronological relationship to the seed and passes its full text to the LLM. Right when:
NEXT_DOCUMENT / PREVIOUS_DOCUMENT) — Qdrant can't express thisfrom lightningrod import (
FileSetDocumentContextGenerator, FileSetDocumentLabeler, TemporalConstraint,
BinaryAnswerType,
)
# TemporalConstraint values: EQUAL, NEXT_DOCUMENT, PREVIOUS_DOCUMENT, BEFORE, AFTER
context = FileSetDocumentContextGenerator(
file_set_id=fs.id,
temporal_constraint=TemporalConstraint.EQUAL, # same doc as seed
metadata_filter_keys=["ticker"], # match seed's ticker
system_instruction="Extract sections relevant to forecasting.",
max_document_chars=200_000, # optional
)
labeler = FileSetDocumentLabeler(
file_set_id=fs.id,
temporal_constraint=TemporalConstraint.NEXT_DOCUMENT, # resolve from next report
metadata_filter_keys=["ticker"],
confidence_threshold=0.7,
answer_type=BinaryAnswerType(
labeler_instruction="Resolve Yes/No only when explicitly addressed.",
),
)
QdrantContextGenerator / QdrantRAGLabelerBuilds a vector index over the FileSet (BAAI/bge-small-en-v1.5, index_chunk_size=1500). At runtime, embeds the question and retrieves top_k chunks across the whole corpus. Right when:
from lightningrod import QdrantContextGenerator, QdrantRAGLabeler
context = QdrantContextGenerator(
file_set_id=fs.id,
top_k=5,
# Maps Qdrant payload key -> sample metadata key. Restricts retrieval to
# chunks whose `ticker` payload equals the sample's `ticker`.
payload_filters={"ticker": "ticker"},
temporal_direction="before", # soft timestamp filter: "before" | "after"
)
labeler = QdrantRAGLabeler(
file_set_id=fs.id,
payload_filters={"ticker": "ticker"},
temporal_direction="after", # forward-looking questions resolved by later docs
confidence_threshold=0.7,
answer_type=BinaryAnswerType(),
)
| Dimension | Qdrant* | FileSetDocument* |
|---|---|---|
| Retrieval | Vector search, top_k chunks | Single whole document, picked chronologically |
| Index | Builds embeddings on first use | None |
| Temporal param | temporal_direction="before"/"after" | temporal_constraint=TemporalConstraint.{EQUAL,NEXT_DOCUMENT,PREVIOUS_DOCUMENT,BEFORE,AFTER} |
| Metadata filter | payload_filters={"qdrant_key": "sample_key"} | metadata_filter_keys=["key1", "key2"] |
| Best for | Knowledge-base search | Periodic reports that resolve each other |
Rule of thumb: FileSetDocument = periodic reports that resolve each other. Qdrant = searching a knowledge base.
Before building a pipeline, check that the data is suitable:
| Check | How | Minimum bar |
|---|---|---|
| Volume | len(samples) | ≥ 50 samples for a meaningful demo |
| Date coverage | Check sample.date fields | Dates present for temporal split; span ≥ 30 days for forecasting |
| Text quality | Spot-check sample.text values | Readable prose, not garbled OCR or empty strings |
| Label availability | Check sample.label if using QuestionAndLabelGenerator | Labels present and non-null |
If the data fails a check, explain the issue clearly and stop — do not build a pipeline on bad inputs.
chunk_size=1000, chunk_overlap=100 works for most documentschunk_size=500)chunk_size=1500)notebooks/getting_started/02_custom_documents_datasource.ipynbnotebooks/custom_filesets/