tweaktune is a Rust-powered, Python-facing library for synthesizing datasets for training and fine-tuning AI models, especially Language Models.
Build powerful data pipelines to generate synthetic text, structured JSON, and function calling datasets using LLM APIs. Perfect for creating high-quality training data for model fine-tuning.

Documentation
Features
Flexible Data Sources
Load data from multiple sources:
- Files: Parquet, CSV, JSONL, JSON
- Databases: PostgreSQL, MySQL, SQLite (via ConnectorX)
- HuggingFace: Direct integration with datasets
- Arrow: PyArrow datasets and record batches
- Python: Dictionaries, functions, Pydantic models
- APIs: OpenAPI specifications for function calling
- SQL: Filter and transform with SQL queries
LLM Integration
Connect to any LLM provider:
- OpenAI: GPT-4, GPT-3.5, and compatible APIs
- Azure OpenAI: Enterprise deployments
- Local Models: Unsloth, MistralRS support
- Custom APIs: Any OpenAI-compatible endpoint
Powerful Pipeline Features
- Parallel Processing: Multi-worker execution for speed
- Dynamic Templates: Jinja2 templating with custom filters
- Data Validation: JSON schema, language detection, custom validators
- Deduplication: Exact hash, fuzzy simhash, semantic embeddings
- Quality Checks: Built-in and custom quality filters
- Conditional Logic: If-else branching in pipelines
- Custom Steps: Extend with Python classes
- Metadata Tracking: Track runs, items, and deduplication state
Dataset Generation
Create datasets for:
- Question-Answer pairs: Synthetic Q&A for training
- Function calling: Tool use and API interaction datasets
- Conversations: Multi-turn dialogue datasets
- Structured output: JSON conforming to schemas
- Chat formatting: Model-specific conversation formatting
Quick Start
Installation
pip install tweaktune
Simple Example
Generate synthetic data in minutes:
from tweaktune import Pipeline
import os
# Create a Q&A dataset
(Pipeline()
.with_workers(3)
.with_llm_openai(
name="gpt4",
api_key=os.environ["OPENAI_API_KEY"],
model="gpt-4o-mini"
)
.with_template("system", "You are an expert educator.")
.with_template("question", "Generate a question about: {{topic}}")
.with_template("answer", "Answer this question: {{question}}")
.with_template("output", """{"topic": "{{topic}}", "question": "{{question}}", "answer": "{{answer}}"}""")
.iter_range(100)
.add_column("topic", lambda data: f"Topic {data['index']}")
.generate_text(
template="question",
llm="gpt4",
output="question",
system_template="system"
)
.generate_text(
template="answer",
llm="gpt4",
output="answer",
system_template="system"
)
.write_jsonl(path="qa_dataset.jsonl", template="output")
.run())
Function Calling Dataset
Create datasets for training models on tool use:
from tweaktune import Pipeline
from pydantic import Field
def search_products(
query: str = Field(..., description="Search query"),
category: str = Field(..., description="Product category")
):
"""Search for products in the catalog."""
pass
(Pipeline()
.with_workers(5)
.with_llm_openai("gpt4", api_key, "gpt-4o-mini")
.with_tools_dataset("tools", [search_products])
.iter_range(50)
.sample_tools("tools", 1, "tool")
# Generate user question, tool call, and response
# ... (see examples/08_function_calling.py for complete code)
.render_conversation(
conversation="@user:question|@assistant:tool_calls([call])|@tool:result|@assistant:answer",
tools="tool",
output="conversation"
)
.write_jsonl(path="function_calling.jsonl", value="conversation")
.run())
More examples in the examples directory.
Learn More
Documentation