Help us improve
Share bugs, ideas, or general feedback.
From sdg-hub
Generates synthetic data using sdg_hub with composable blocks and YAML flows. Supports pre-built flows, custom scripts, agent frameworks, and 100+ LLM providers.
npx claudepluginhub red-hat-ai-innovation-team/sdg_hub --plugin sdg-hubHow this skill is triggered — by the user, by Claude, or both
Slash command
/sdg-hub:synthetic-data-generationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Generate synthetic data using composable blocks and flows. Blocks are processing units that transform datasets; flows chain blocks into pipelines defined in YAML.
Sets up sdg_hub for synthetic data generation: detects environment, installs if needed, collects API keys and model config.
Generates synthetic test inputs for LLM pipeline evaluation using dimension-based tuples. Bootstrap eval datasets when real data is sparse or stress-test specific failure hypotheses.
Guides claude-flow orchestration decisions including topology, agents, memory, and SPARC workflow setup.
Share bugs, ideas, or general feedback.
Generate synthetic data using composable blocks and flows. Blocks are processing units that transform datasets; flows chain blocks into pipelines defined in YAML.
Core concept: dataset -> Block_1 -> Block_2 -> Block_3 -> enriched_dataset
| Approach | When to Use |
|---|---|
| Pre-built flow | Standard pipeline exists for your task (QA generation, text analysis, red-teaming, RAG eval, MCP distillation) |
| Custom Python | Quick experiments, ad-hoc generation, custom logic |
| Custom YAML flow | Reusable pipeline, team sharing, complex multi-block workflows |
| Agent-based | Need external agent frameworks (Langflow, LangGraph) or MCP tool-use in your pipeline |
# play.py
from sdg_hub import FlowRegistry
# List all flows
for f in FlowRegistry.list_flows():
print(f"- {f['name']} (tags: {f.get('tags', [])})")
# Search by tag
FlowRegistry.search_flows(tag="qa-generation")
Consult references/pre_built_flows.md for the full catalog with descriptions and required inputs.
from sdg_hub import Flow, FlowRegistry
path = FlowRegistry.get_flow_path("Flow Name or ID")
flow = Flow.from_yaml(path)
flow.print_info()
# Check what dataset columns are needed
reqs = flow.get_dataset_requirements()
if reqs:
print(f"Required columns: {reqs.required_columns}")
import os
flow.set_model_config(
model="openai/gpt-4o-mini",
api_key=os.environ.get("OPENAI_API_KEY")
)
# For local models (vLLM, Ollama)
flow.set_model_config(
model="meta-llama/Llama-3.3-70B-Instruct",
api_base="http://localhost:8000/v1",
api_key="EMPTY"
)
See references/model_configs.md for all supported providers (OpenAI, Anthropic, Azure, vLLM, Ollama, Together, Groq, Bedrock, etc.).
import pandas as pd
df = pd.DataFrame({"document": ["Your text here..."]})
# Validate dataset against flow requirements
errors = flow.validate_dataset(df)
if errors:
print(f"Fix these: {errors}")
# Dry run with 2 samples -- do this before every full run
dry = flow.dry_run(df, sample_size=2)
print(f"Success: {dry['execution_successful']}")
for block in dry['blocks_executed']:
print(f" {block['block_name']}: {block['execution_time_seconds']:.2f}s")
# Full run with checkpointing for large datasets
result = flow.generate(
df,
checkpoint_dir="./checkpoints",
save_freq=100,
max_concurrency=5
)
result.to_parquet("output.parquet")
Use blocks directly for ad-hoc experiments.
# play.py
from sdg_hub.core.blocks import LLMChatBlock
import pandas as pd
block = LLMChatBlock(
block_name="gen",
input_cols="messages",
output_cols="response",
model="openai/gpt-4o-mini",
api_key="sk-...",
temperature=0.7
)
df = pd.DataFrame({
"messages": [[
{"role": "system", "content": "You generate QA pairs."},
{"role": "user", "content": "Generate a fun fact about Python."}
]]
})
result = block(df)
print(result["response"].iloc[0])
from sdg_hub.core.blocks import LLMChatBlock, TagParserBlock
import pandas as pd
# Step 1: Generate
llm = LLMChatBlock(
block_name="gen",
input_cols="messages",
output_cols="response",
model="openai/gpt-4o-mini",
api_key="sk-..."
)
# Step 2: Parse with tags
parser = TagParserBlock(
block_name="parse",
input_cols="response",
output_cols=["question", "answer"],
start_tags=["<question>", "<answer>"],
end_tags=["</question>", "</answer>"]
)
df = pd.DataFrame({
"messages": [[
{"role": "user", "content": "Generate a QA pair. Use <question>...</question> and <answer>...</answer> tags."}
]]
})
result = parser(llm(df))
print(result[["question", "answer"]])
from tqdm import tqdm
def process_in_batches(df, block, batch_size=50):
results = []
for i in tqdm(range(0, len(df), batch_size)):
batch = df.iloc[i:i+batch_size].copy()
results.append(block(batch))
return pd.concat(results, ignore_index=True)
See references/block_reference.md for all 20+ available blocks and their configurations.
Build incrementally -- start with one block, test, add the next.
# play.py - Clarify inputs and outputs first
import pandas as pd
input_df = pd.DataFrame({
"document": ["Climate change is accelerating..."],
"domain": ["environment"]
})
print("Input columns:", list(input_df.columns))
expected_outputs = ["document", "domain", "question", "response"]
print("Expected output:", expected_outputs)
# flow.yaml
metadata:
name: "My QA Flow"
version: "0.1.0"
author: "Your Name"
description: "Generate QA pairs from documents"
dataset_requirements:
required_columns: ["document"]
blocks:
- block_type: "PromptBuilderBlock"
block_config:
block_name: "build_prompt"
input_cols: ["document"]
output_cols: "messages"
prompt_config_path: "prompts/qa.yaml"
- block_type: "LLMChatBlock"
block_config:
block_name: "generate"
input_cols: "messages"
output_cols: "raw_response"
temperature: 0.7
async_mode: true
- block_type: "TagParserBlock"
block_config:
block_name: "parse"
input_cols: "raw_response"
output_cols: ["question", "response"]
start_tags: ["<question>", "<answer>"]
end_tags: ["</question>", "</answer>"]
# prompts/qa.yaml (relative to flow.yaml)
- role: system
content: |
You generate question-answer pairs from documents.
- role: user
content: |
Generate one question and answer from this document.
Use <question>...</question> and <answer>...</answer> tags.
{document}
# play.py
from sdg_hub import Flow
import pandas as pd
flow = Flow.from_yaml("flow.yaml")
flow.set_model_config(model="openai/gpt-4o-mini", api_key="sk-...")
df = pd.DataFrame({"document": ["Python was created by Guido van Rossum in 1991."]})
# Dry run first
dry = flow.dry_run(df, sample_size=1)
print(f"Success: {dry['execution_successful']}")
# Full run
if dry['execution_successful']:
result = flow.generate(df)
print(result[["document", "question", "response"]])
See references/yaml_schema.md for the complete YAML structure and references/flow_patterns.md for common patterns (quality filtering, parallel paths, multi-step extraction).
Use AgentBlock to call external agent frameworks as pipeline steps:
from sdg_hub.core.blocks.agent import AgentBlock
block = AgentBlock(
block_name="my_agent",
agent_framework="langflow", # or "langgraph"
agent_url="http://localhost:7860/api/v1/run/my-flow",
agent_api_key="your-key",
input_cols=["question"],
output_cols=["agent_response"],
extract_response=True
)
result = block.generate(dataset)
In YAML flows, configure agent blocks with set_agent_config():
flow = Flow.from_yaml("flow.yaml")
if flow.is_agent_config_required():
flow.set_agent_config(
agent_framework="langgraph",
agent_url="http://localhost:8123",
agent_api_key="your-key"
)
MCPAgentBlock connects an LLM to a remote MCP server for agentic tool-use. The LLM calls tools in a loop, producing full traces for training data:
- block_type: "MCPAgentBlock"
block_config:
block_name: "mcp_agent"
input_cols: "messages"
output_cols: "agent_trace"
mcp_server_url: "http://localhost:3000/mcp"
max_iterations: 10
See the pre-built MCP Server Distillation flow in references/pre_built_flows.md for a complete pipeline.
flow = Flow.from_yaml("flow.yaml")
# Model configuration
flow.set_model_config(model="...", api_key="...", blocks=["specific_block"])
flow.is_model_config_required()
flow.get_default_model()
flow.get_model_recommendations()
# Agent configuration
flow.set_agent_config(agent_framework="...", agent_url="...", agent_api_key="...")
flow.is_agent_config_required()
# Dataset validation
flow.validate_dataset(df)
flow.get_dataset_requirements()
# Execution
flow.dry_run(df, sample_size=2)
flow.generate(df, checkpoint_dir="./ckpt", save_freq=100, max_concurrency=5)
# Inspection
flow.print_info()
flow.to_yaml("output_flow.yaml")
from sdg_hub.core.blocks import BlockRegistry
BlockRegistry.discover_blocks() # Rich table of all blocks
BlockRegistry.list_blocks(category="llm") # By category
BlockRegistry.list_blocks(grouped=True) # Grouped by category
BlockRegistry.categories() # All categories
import pandas as pd
# Load
df = pd.read_csv("input.csv")
df = pd.read_parquet("input.parquet")
df = pd.read_json("input.jsonl", lines=True)
# From HuggingFace
from datasets import load_dataset
df = load_dataset("your_dataset", split="train").to_pandas()
# Save
result.to_parquet("output.parquet")
result.to_csv("output.csv", index=False)
result.to_json("output.jsonl", orient="records", lines=True)
# Push to HuggingFace Hub
from datasets import Dataset
Dataset.from_pandas(result).push_to_hub("username/dataset")
Before using generated data:
sample_size=2?"Column X not found" -- Input data is missing a required column. Run flow.get_dataset_requirements() to see what the flow expects, then check your DataFrame columns.
Empty or null outputs -- The LLM response didn't match the parser pattern. Check the raw LLM output before parsing, and adjust your prompt template or parser config.
Rate limit errors -- Reduce max_concurrency in flow.generate() or add timeout and num_retries to set_model_config().
Slow generation -- Use async_mode: true on LLMChatBlock, increase max_concurrency, or use checkpointing to resume interrupted runs.
Model not responding -- Verify your model config works with a single-sample test:
from sdg_hub.core.blocks import LLMChatBlock
block = LLMChatBlock(block_name="test", input_cols="messages", output_cols="r", model="...", api_key="...")
block(pd.DataFrame({"messages": [[{"role": "user", "content": "hello"}]]}))
Detailed documentation for specific topics:
references/block_reference.md -- All 20+ blocks with YAML configs and usage examplesreferences/pre_built_flows.md -- Catalog of pre-built flows with inputs, outputs, and usagereferences/model_configs.md -- LLM provider configurations (OpenAI, Anthropic, vLLM, Ollama, etc.)references/yaml_schema.md -- Complete flow YAML structure and validation rulesreferences/flow_patterns.md -- Common composition patterns (LLM chain, quality filtering, parallel paths, agent integration)