tweaktune is a Rust-powered, Python-facing library for synthesizing datasets for training and fine-tuning AI models, especially Language Models.

Build powerful data pipelines to generate synthetic text, structured JSON, and function calling datasets using LLM APIs. Perfect for creating high-quality training data for model fine-tuning.

Documentation

Complete Documentation - Comprehensive guides and API reference
Examples - Working code examples
Getting Started Guide - Quick start tutorial

Features

Flexible Data Sources

Load data from multiple sources:

Files: Parquet, CSV, JSONL, JSON
Databases: PostgreSQL, MySQL, SQLite (via ConnectorX)
HuggingFace: Direct integration with datasets
Arrow: PyArrow datasets and record batches
Python: Dictionaries, functions, Pydantic models
APIs: OpenAPI specifications for function calling
SQL: Filter and transform with SQL queries

LLM Integration

Connect to any LLM provider:

OpenAI: GPT-4, GPT-3.5, and compatible APIs
Azure OpenAI: Enterprise deployments
Local Models: Unsloth, MistralRS support
Custom APIs: Any OpenAI-compatible endpoint

Powerful Pipeline Features

Parallel Processing: Multi-worker execution for speed
Dynamic Templates: Jinja2 templating with custom filters
Data Validation: JSON schema, language detection, custom validators
Deduplication: Exact hash, fuzzy simhash, semantic embeddings
Quality Checks: Built-in and custom quality filters
Conditional Logic: If-else branching in pipelines
Custom Steps: Extend with Python classes
Metadata Tracking: Track runs, items, and deduplication state

Dataset Generation

Create datasets for:

Question-Answer pairs: Synthetic Q&A for training
Function calling: Tool use and API interaction datasets
Conversations: Multi-turn dialogue datasets
Structured output: JSON conforming to schemas
Chat formatting: Model-specific conversation formatting

Quick Start

Installation

pip install tweaktune

Simple Example

Generate synthetic data in minutes:

from tweaktune import Pipeline
import os

# Create a Q&A dataset
(Pipeline()
    .with_workers(3)
    .with_llm_openai(
        name="gpt4",
        api_key=os.environ["OPENAI_API_KEY"],
        model="gpt-4o-mini"
    )
    .with_template("system", "You are an expert educator.")
    .with_template("question", "Generate a question about: {{topic}}")
    .with_template("answer", "Answer this question: {{question}}")
    .with_template("output", """{"topic": "{{topic}}", "question": "{{question}}", "answer": "{{answer}}"}""")
    .iter_range(100)
        .add_column("topic", lambda data: f"Topic {data['index']}")
        .generate_text(
            template="question",
            llm="gpt4",
            output="question",
            system_template="system"
        )
        .generate_text(
            template="answer",
            llm="gpt4",
            output="answer",
            system_template="system"
        )
        .write_jsonl(path="qa_dataset.jsonl", template="output")
    .run())

Function Calling Dataset

Create datasets for training models on tool use:

from tweaktune import Pipeline
from pydantic import Field

def search_products(
    query: str = Field(..., description="Search query"),
    category: str = Field(..., description="Product category")
):
    """Search for products in the catalog."""
    pass

(Pipeline()
    .with_workers(5)
    .with_llm_openai("gpt4", api_key, "gpt-4o-mini")
    .with_tools_dataset("tools", [search_products])
    .iter_range(50)
        .sample_tools("tools", 1, "tool")
        # Generate user question, tool call, and response
        # ... (see examples/08_function_calling.py for complete code)
        .render_conversation(
            conversation="@user:question|@assistant:tool_calls([call])|@tool:result|@assistant:answer",
            tools="tool",
            output="conversation"
        )
        .write_jsonl(path="function_calling.jsonl", value="conversation")
    .run())

More examples in the examples directory.

Learn More

Documentation

from tweaktune import Pipeline import os # Create a Q&A dataset (Pipeline() .with_workers(3) .with_llm_openai( name="gpt4", api_key=os.environ["OPENAI_API_KEY"], model="gpt-4o-mini" ) .with_template("system", "You are an expert educator.") .with_template("question", "Generate a question about: {{topic}}") .with_template("answer", "Answer this question: {{question}}") .with_template("output", """{"topic": "{{topic}}", "question": "{{question}}", "answer": "{{answer}}"}""") .iter_range(100) .add_column("topic", lambda data: f"Topic {data['index']}") .generate_text( template="question", llm="gpt4", output="question", system_template="system" ) .generate_text( template="answer", llm="gpt4", output="answer", system_template="system" ) .write_jsonl(path="qa_dataset.jsonl", template="output") .run())

from tweaktune import Pipeline from pydantic import Field def search_products( query: str = Field(..., description="Search query"), category: str = Field(..., description="Product category") ): """Search for products in the catalog.""" pass (Pipeline() .with_workers(5) .with_llm_openai("gpt4", api_key, "gpt-4o-mini") .with_tools_dataset("tools", [search_products]) .iter_range(50) .sample_tools("tools", 1, "tool") # Generate user question, tool call, and response # ... (see examples/08_function_calling.py for complete code) .render_conversation( conversation="@user:question|@assistant:tool_calls([call])|@tool:result|@assistant:answer", tools="tool", output="conversation" ) .write_jsonl(path="function_calling.jsonl", value="conversation") .run())

tweaktune-synthesizer

Component Overview

Component Details

Skills (1)

README

Documentation

Features

Flexible Data Sources

LLM Integration

Powerful Pipeline Features

Dataset Generation

Quick Start

Installation

Simple Example

Function Calling Dataset

Learn More

Documentation

Similar Plugins

huggingface-skills

tweaktune-synthesizer

Component Overview

Component Details

Skills (1)

README

Documentation

Features

Flexible Data Sources

LLM Integration

Powerful Pipeline Features

Dataset Generation

Quick Start

Installation

Simple Example

Function Calling Dataset

Learn More

Documentation

Similar Plugins

huggingface-skills

agent-browser

fullstack-dev-skills

planning-with-files

payload

data