From lightningrod
Provides production examples for content learning (SFT) training using TopicTree + WebSearch for domain knowledge and FileSet + QuestionAndLabel for document-based Q&A.
npx claudepluginhub lightning-rod-labs/lightningrod-python-sdkThis skill uses the workspace's default tool permissions.
---
Guides selection between GRPO (forward-looking) and SFT (content learning) training patterns, including tabular data. Useful when starting a project or choosing answer types.
Extracts structured, quality-scored domain knowledge from in-session AI models or local Ollama models. Builds searchable knowledge bases and tabular ML data for domains like medical, finance.
Templates and patterns for common ML training scenarios including text classification, text generation, fine-tuning, and PEFT/LoRA. Provides ready-to-use training configurations, dataset preparation scripts, and complete training pipelines. Use when building ML training pipelines, fine-tuning models, implementing classification or generation tasks, setting up PEFT/LoRA training, or when user mentions model training, fine-tuning, classification, generation, or parameter-efficient tuning.
Share bugs, ideas, or general feedback.
From documents: Documents → chunk → QuestionAndLabelGenerator (extracts Q and A) → SFT. Use QuestionAndLabelGenerator, not WebSearchLabeler — the answers are in the documents.
From a topic/domain (no documents): Domain → TopicTreeSeedGenerator → questions → WebSearchLabeler (finds answers from the web) → SFT.
Goal: Train a model to give step-by-step survival instructions for grid-down emergencies.
TopicTreeSeedGenerator decomposes broad domains into specific leaf seeds for coverage, then WebSearchLabeler finds authoritative answers from the web.
Source:
lightningrod-python-sdk/notebooks/fine_tuning/03_survival_llm.ipynb
from lightningrod import (
LightningRod, QuestionPipeline,
QuestionGenerator, FreeResponseAnswerType, WebSearchLabeler,
)
# TopicTreeSeedGenerator is coming soon — not yet available in the SDK.
# When released, import it from lightningrod and use as shown below.
from lightningrod import TopicTreeSeedGenerator # available soon
lr = LightningRod(api_key=api_key)
answer_type = FreeResponseAnswerType(
labeler_instruction=(
"You are a survival expert giving emergency field instructions. "
"Direct, numbered steps. No introductions or disclaimers. "
"Specific measurements and techniques."
),
answer_format_instruction=(
"Direct, step-by-step answer. Start with step 1, no introduction."
),
)
pipeline = QuestionPipeline(
# TopicTreeSeedGenerator decomposes each root topic into degree^depth leaf seeds.
# 16 roots × 5^2 = 400 specific seeds like
# "Field medicine → improvising supplies → makeshift tourniquets"
seed_generator=TopicTreeSeedGenerator(
topic=[
"Field medicine and trauma care in austere environments",
"Water purification and safe water sourcing without electricity",
"Food preservation, canning, and long-term storage without refrigeration",
"Ham radio and emergency communications setup and operation",
"Land navigation using map, compass, and natural indicators",
"Growing food: gardening, permaculture, and seed saving",
"Herbal medicine and natural remedies from wild plants",
"Construction, structural repair, and improvised building",
"Welding, metalworking, and tool fabrication",
"Vehicle repair and mechanical troubleshooting without a shop",
"Fire starting, fire management, and fuel sourcing",
"Emergency shelter building from natural and salvaged materials",
"Hunting, trapping, fishing, and wild game processing",
"Knot tying, rope work, and cordage making",
"Weather reading and natural forecasting without instruments",
"Perimeter security, self-defense, and community safety planning",
],
tree_depth=2, # levels of recursive expansion
tree_degree=5, # subtopics per node
model_name="google/gemini-3-flash-preview",
model_system_prompt=(
"You are an expert in survival and self-reliance. "
"Generate specific, practical subtopics useful in a grid-down emergency."
),
),
question_generator=QuestionGenerator(
answer_type=answer_type,
questions_per_seed=10, # high — topic seeds are conceptual, not dense text
instructions=(
"Generate practical survival questions for grid-down emergencies. "
"Specific, scenario-based, ask HOW to do something with limited tools. "
"Each must cover a DISTINCT technique."
),
examples=[
"How do I purify water using only sand, gravel, and charcoal?",
"How do I perform a needle decompression for tension pneumothorax in the field?",
"How do I build a Dakota fire hole to minimize smoke and maximize heat?",
],
bad_examples=[
"What is survival? (too vague)",
"Tell me about water purification. (not actionable)",
"How does a ham radio work? (theoretical, not how-to)",
],
),
labeler=WebSearchLabeler(answer_type=answer_type, confidence_threshold=0.8),
)
dataset = lr.transforms.run(pipeline, name="SurvivalLLM")
After dataset = lr.transforms.run(...), prepare a train split and run hosted SFT on Lightning Rod (same service as GRPO training):
from lightningrod import prepare_for_training, FilterParams, SplitParams, SFTTrainingConfig
# Lint the full dataset before splitting
from lightningrod import display_lint_overview, get_lint_affected_sample_ids
lint_result = lr.datasets.linter.run(dataset.id)
display_lint_overview(lint_result)
train_dataset, test_dataset = prepare_for_training(
dataset,
filter=FilterParams(),
split=SplitParams(test_size=0.2),
)
BASE_MODEL = "Qwen/Qwen3-8B-Instruct"
training_config = SFTTrainingConfig(
base_model_id=BASE_MODEL,
training_steps=50,
epochs=3,
learning_rate=2e-4,
)
cost = lr.training.estimate_cost(training_config, dataset=train_dataset)
job = lr.training.run(training_config, dataset=train_dataset, name="survival-sft-v1")
# job.model_id — your LoRA checkpoint for inference via lr.predict(...)
For low-level local training loops (e.g. direct Tinker ServiceClient), use the Tinker SDK separately; the snippet above is the recommended path when your data already lives in Lightning Rod samples.
Goal: Train a model to answer clinical nutrition questions using knowledge from medical textbooks.
QuestionAndLabelGenerator extracts Q&A pairs directly from document chunks — no labeler needed since the answers are in the text.
Source:
llm_forecasting/notebooks/client_work/takeoff41/dataset_generation.ipynb
import json
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
from lightningrod import (
LightningRod, FileSetMetadataSchemaInput,
MetadataFieldDefinitionInput, MetadataFieldType,
)
lr = LightningRod(api_key=api_key)
schema = FileSetMetadataSchemaInput(fields=[
MetadataFieldDefinitionInput(
name="book_title", field_type=MetadataFieldType.STRING, required=True,
description="Title of the textbook"
),
])
fileset = lr.filesets.create(
name="Medical Nutrition Textbooks",
description="Clinical nutrition textbooks for SFT training data",
metadata_schema=schema,
)
# textbooks is a list of (pdf_path, title) tuples
file_names = [pdf_path.name for pdf_path, _ in textbooks]
upload_response = lr.filesets.upload_folder(fileset.id, file_names)
# Upload PDFs in parallel
def upload_file(pdf_path, title):
url = upload_response.upload_urls.additional_properties[pdf_path.name]
with open(pdf_path, "rb") as f:
requests.put(url, data=f.read()).raise_for_status()
with ThreadPoolExecutor(max_workers=10) as executor:
for pdf_path, title in textbooks:
executor.submit(upload_file, pdf_path, title)
# Upload metadata manifest
manifest = {pdf_path.name: {"book_title": title} for pdf_path, title in textbooks}
manifest_url = upload_response.upload_urls.additional_properties["_manifest.json"]
requests.put(manifest_url, data=json.dumps(manifest).encode("utf-8"))
# The vector index is built automatically when the FileSet is first used in a pipeline
from lightningrod import (
QuestionPipeline, FileSetSeedGenerator,
QuestionAndLabelGenerator, FreeResponseAnswerType,
)
pipeline = QuestionPipeline(
seed_generator=FileSetSeedGenerator(
file_set_id=fileset.id,
chunk_size=4000, # larger chunks = more context per Q&A
chunk_overlap=200,
),
question_generator=QuestionAndLabelGenerator(
answer_type=FreeResponseAnswerType(),
questions_per_seed=3, # 3 Q&A pairs per chunk — dense medical text
instructions=(
"Generate questions testing understanding of clinical nutrition concepts, "
"medical procedures, and evidence-based practices. Specific, proper terminology. "
"Answers should cite specific values/ranges when mentioned."
),
),
)
dataset = lr.transforms.run(pipeline, max_seeds=4000, name="Medical nutrition Q&A")
sft_data = []
for s in dataset.download():
if not s.is_valid: continue
q, a = s.question.question_text, s.label.label
if not q or not a or a == "undetermined": continue
sft_data.append({"messages": [
{"role": "system", "content": "You are a clinical nutrition expert."},
{"role": "user", "content": q},
{"role": "assistant", "content": a},
]})
| Book | Q&A Pairs |
|---|---|
| ASPEN Parenteral Nutrition | 1,504 |
| ASPEN Fluids & Electrolytes | 1,127 |
| ASPEN Pediatric Nutrition | 3,787 |
| Handbook | 1,347 |
| NBNSC Book | 908 |
| Pediatric Nutrition | 1,666 |
| Total | 10,339 |
QuestionAndLabelGenerator, not WebSearchLabeler — answers are in the documentsWebSearchLabeler is correct — the web provides answers for topic-generated questionsFilterCriteria(min_score=0.7), score cutoffs, or agreement checksquestions_per_seed to density: topic tree nodes → 10, doc chunks (4000) → 3, doc chunks (2000) → 2, short text → 1