Skill

synthetic-data-generation

Generates synthetic data using sdg_hub with composable blocks and YAML flows. Supports pre-built flows, custom scripts, agent frameworks, and 100+ LLM providers.

Python

ai-ml

Stats

Actions

Synthetic Data Generation with SDG Hub

Generate synthetic data using composable blocks and flows. Blocks are processing units that transform datasets; flows chain blocks into pipelines defined in YAML.

Core concept: dataset -> Block_1 -> Block_2 -> Block_3 -> enriched_dataset

Choose Your Approach

Approach	When to Use
Pre-built flow	Standard pipeline exists for your task (QA generation, text analysis, red-teaming, RAG eval, MCP distillation)
Custom Python	Quick experiments, ad-hoc generation, custom logic
Custom YAML flow	Reusable pipeline, team sharing, complex multi-block workflows
Agent-based	Need external agent frameworks (Langflow, LangGraph) or MCP tool-use in your pipeline

Approach A: Pre-Built Flows

Step 1: Discover flows

# play.py
from sdg_hub import FlowRegistry

# List all flows
for f in FlowRegistry.list_flows():
    print(f"- {f['name']} (tags: {f.get('tags', [])})")

# Search by tag
FlowRegistry.search_flows(tag="qa-generation")

Consult references/pre_built_flows.md for the full catalog with descriptions and required inputs.

Step 2: Load and inspect

from sdg_hub import Flow, FlowRegistry

path = FlowRegistry.get_flow_path("Flow Name or ID")
flow = Flow.from_yaml(path)
flow.print_info()

# Check what dataset columns are needed
reqs = flow.get_dataset_requirements()
if reqs:
    print(f"Required columns: {reqs.required_columns}")

Step 3: Configure model

import os

flow.set_model_config(
    model="openai/gpt-4o-mini",
    api_key=os.environ.get("OPENAI_API_KEY")
)

# For local models (vLLM, Ollama)
flow.set_model_config(
    model="meta-llama/Llama-3.3-70B-Instruct",
    api_base="http://localhost:8000/v1",
    api_key="EMPTY"
)

See references/model_configs.md for all supported providers (OpenAI, Anthropic, Azure, vLLM, Ollama, Together, Groq, Bedrock, etc.).

Step 4: Prepare data and dry run

import pandas as pd

df = pd.DataFrame({"document": ["Your text here..."]})

# Validate dataset against flow requirements
errors = flow.validate_dataset(df)
if errors:
    print(f"Fix these: {errors}")

# Dry run with 2 samples -- do this before every full run
dry = flow.dry_run(df, sample_size=2)
print(f"Success: {dry['execution_successful']}")
for block in dry['blocks_executed']:
    print(f"  {block['block_name']}: {block['execution_time_seconds']:.2f}s")

Step 5: Generate and save

# Full run with checkpointing for large datasets
result = flow.generate(
    df,
    checkpoint_dir="./checkpoints",
    save_freq=100,
    max_concurrency=5
)

result.to_parquet("output.parquet")

Approach B: Custom Python Scripts

Use blocks directly for ad-hoc experiments.

Basic: Single block

# play.py
from sdg_hub.core.blocks import LLMChatBlock
import pandas as pd

block = LLMChatBlock(
    block_name="gen",
    input_cols="messages",
    output_cols="response",
    model="openai/gpt-4o-mini",
    api_key="sk-...",
    temperature=0.7
)

df = pd.DataFrame({
    "messages": [[
        {"role": "system", "content": "You generate QA pairs."},
        {"role": "user", "content": "Generate a fun fact about Python."}
    ]]
})

result = block(df)
print(result["response"].iloc[0])

Chain: Multiple blocks

from sdg_hub.core.blocks import LLMChatBlock, TagParserBlock
import pandas as pd

# Step 1: Generate
llm = LLMChatBlock(
    block_name="gen",
    input_cols="messages",
    output_cols="response",
    model="openai/gpt-4o-mini",
    api_key="sk-..."
)

# Step 2: Parse with tags
parser = TagParserBlock(
    block_name="parse",
    input_cols="response",
    output_cols=["question", "answer"],
    start_tags=["<question>", "<answer>"],
    end_tags=["</question>", "</answer>"]
)

df = pd.DataFrame({
    "messages": [[
        {"role": "user", "content": "Generate a QA pair. Use <question>...</question> and <answer>...</answer> tags."}
    ]]
})

result = parser(llm(df))
print(result[["question", "answer"]])

Batch processing for large datasets

from tqdm import tqdm

def process_in_batches(df, block, batch_size=50):
    results = []
    for i in tqdm(range(0, len(df), batch_size)):
        batch = df.iloc[i:i+batch_size].copy()
        results.append(block(batch))
    return pd.concat(results, ignore_index=True)

See references/block_reference.md for all 20+ available blocks and their configurations.

Approach C: Authoring Custom Flow YAMLs

Build incrementally -- start with one block, test, add the next.

Step 1: Define the data contract

# play.py - Clarify inputs and outputs first
import pandas as pd

input_df = pd.DataFrame({
    "document": ["Climate change is accelerating..."],
    "domain": ["environment"]
})
print("Input columns:", list(input_df.columns))

expected_outputs = ["document", "domain", "question", "response"]
print("Expected output:", expected_outputs)

Step 2: Minimal YAML

# flow.yaml
metadata:
  name: "My QA Flow"
  version: "0.1.0"
  author: "Your Name"
  description: "Generate QA pairs from documents"
  dataset_requirements:
    required_columns: ["document"]

blocks:
  - block_type: "PromptBuilderBlock"
    block_config:
      block_name: "build_prompt"
      input_cols: ["document"]
      output_cols: "messages"
      prompt_config_path: "prompts/qa.yaml"

  - block_type: "LLMChatBlock"
    block_config:
      block_name: "generate"
      input_cols: "messages"
      output_cols: "raw_response"
      temperature: 0.7
      async_mode: true

  - block_type: "TagParserBlock"
    block_config:
      block_name: "parse"
      input_cols: "raw_response"
      output_cols: ["question", "response"]
      start_tags: ["<question>", "<answer>"]
      end_tags: ["</question>", "</answer>"]

Step 3: Create prompt template

# prompts/qa.yaml (relative to flow.yaml)
- role: system
  content: |
    You generate question-answer pairs from documents.

- role: user
  content: |
    Generate one question and answer from this document.
    Use <question>...</question> and <answer>...</answer> tags.

    {document}

Step 4: Test incrementally

# play.py
from sdg_hub import Flow
import pandas as pd

flow = Flow.from_yaml("flow.yaml")
flow.set_model_config(model="openai/gpt-4o-mini", api_key="sk-...")

df = pd.DataFrame({"document": ["Python was created by Guido van Rossum in 1991."]})

# Dry run first
dry = flow.dry_run(df, sample_size=1)
print(f"Success: {dry['execution_successful']}")

# Full run
if dry['execution_successful']:
    result = flow.generate(df)
    print(result[["document", "question", "response"]])

See references/yaml_schema.md for the complete YAML structure and references/flow_patterns.md for common patterns (quality filtering, parallel paths, multi-step extraction).

Approach D: Agent and MCP Pipelines

Agent frameworks (Langflow, LangGraph)

Use AgentBlock to call external agent frameworks as pipeline steps:

from sdg_hub.core.blocks.agent import AgentBlock

block = AgentBlock(
    block_name="my_agent",
    agent_framework="langflow",       # or "langgraph"
    agent_url="http://localhost:7860/api/v1/run/my-flow",
    agent_api_key="your-key",
    input_cols=["question"],
    output_cols=["agent_response"],
    extract_response=True
)

result = block.generate(dataset)

In YAML flows, configure agent blocks with set_agent_config():

flow = Flow.from_yaml("flow.yaml")
if flow.is_agent_config_required():
    flow.set_agent_config(
        agent_framework="langgraph",
        agent_url="http://localhost:8123",
        agent_api_key="your-key"
    )

MCP tool-use distillation

MCPAgentBlock connects an LLM to a remote MCP server for agentic tool-use. The LLM calls tools in a loop, producing full traces for training data:

- block_type: "MCPAgentBlock"
  block_config:
    block_name: "mcp_agent"
    input_cols: "messages"
    output_cols: "agent_trace"
    mcp_server_url: "http://localhost:3000/mcp"
    max_iterations: 10

See the pre-built MCP Server Distillation flow in references/pre_built_flows.md for a complete pipeline.

Flow Methods Quick Reference

flow = Flow.from_yaml("flow.yaml")

# Model configuration
flow.set_model_config(model="...", api_key="...", blocks=["specific_block"])
flow.is_model_config_required()
flow.get_default_model()
flow.get_model_recommendations()

# Agent configuration
flow.set_agent_config(agent_framework="...", agent_url="...", agent_api_key="...")
flow.is_agent_config_required()

# Dataset validation
flow.validate_dataset(df)
flow.get_dataset_requirements()

# Execution
flow.dry_run(df, sample_size=2)
flow.generate(df, checkpoint_dir="./ckpt", save_freq=100, max_concurrency=5)

# Inspection
flow.print_info()
flow.to_yaml("output_flow.yaml")

Block Discovery

from sdg_hub.core.blocks import BlockRegistry

BlockRegistry.discover_blocks()                    # Rich table of all blocks
BlockRegistry.list_blocks(category="llm")          # By category
BlockRegistry.list_blocks(grouped=True)            # Grouped by category
BlockRegistry.categories()                         # All categories

Data I/O

import pandas as pd

# Load
df = pd.read_csv("input.csv")
df = pd.read_parquet("input.parquet")
df = pd.read_json("input.jsonl", lines=True)

# From HuggingFace
from datasets import load_dataset
df = load_dataset("your_dataset", split="train").to_pandas()

# Save
result.to_parquet("output.parquet")
result.to_csv("output.csv", index=False)
result.to_json("output.jsonl", orient="records", lines=True)

# Push to HuggingFace Hub
from datasets import Dataset
Dataset.from_pandas(result).push_to_hub("username/dataset")

Quality Checklist

Before using generated data:

Dry run succeeded with sample_size=2?
Output columns are correct?
Sample outputs look reasonable (spot-check 5-10)?
No excessive nulls or empty values?
Data saved to durable storage?

Common Issues

"Column X not found" -- Input data is missing a required column. Run flow.get_dataset_requirements() to see what the flow expects, then check your DataFrame columns.

Empty or null outputs -- The LLM response didn't match the parser pattern. Check the raw LLM output before parsing, and adjust your prompt template or parser config.

Rate limit errors -- Reduce max_concurrency in flow.generate() or add timeout and num_retries to set_model_config().

Slow generation -- Use async_mode: true on LLMChatBlock, increase max_concurrency, or use checkpointing to resume interrupted runs.

Model not responding -- Verify your model config works with a single-sample test:

from sdg_hub.core.blocks import LLMChatBlock
block = LLMChatBlock(block_name="test", input_cols="messages", output_cols="r", model="...", api_key="...")
block(pd.DataFrame({"messages": [[{"role": "user", "content": "hello"}]]}))

Reference Files

Detailed documentation for specific topics:

references/block_reference.md -- All 20+ blocks with YAML configs and usage examples
references/pre_built_flows.md -- Catalog of pre-built flows with inputs, outputs, and usage
references/model_configs.md -- LLM provider configurations (OpenAI, Anthropic, vLLM, Ollama, etc.)
references/yaml_schema.md -- Complete flow YAML structure and validation rules
references/flow_patterns.md -- Common composition patterns (LLM chain, quality filtering, parallel paths, agent integration)

synthetic-data-generation

synthetic-data-generation

Popularity

Invocation

Context Preview

SKILL.md

Synthetic Data Generation with SDG Hub

Choose Your Approach

Approach A: Pre-Built Flows

Step 1: Discover flows

Step 2: Load and inspect

Step 3: Configure model

Step 4: Prepare data and dry run

Step 5: Generate and save

Approach B: Custom Python Scripts

Basic: Single block

Chain: Multiple blocks

Batch processing for large datasets

Approach C: Authoring Custom Flow YAMLs

Step 1: Define the data contract

Step 2: Minimal YAML

Step 3: Create prompt template

Step 4: Test incrementally

Approach D: Agent and MCP Pipelines

Agent frameworks (Langflow, LangGraph)

MCP tool-use distillation

Flow Methods Quick Reference

Block Discovery

Data I/O

Quality Checklist

Common Issues

Reference Files

Similar Skills

Similar Skills