From synthetic-data
Fit a generative model on real tabular data and sample synthetic rows preserving correlations and distributions.
npx claudepluginhub danielrosehill/claude-code-plugins --plugin synthetic-dataThis skill uses the workspace's default tool permissions.
Learn a probabilistic model from a real dataset, then sample synthetic rows that preserve statistical properties (marginals, correlations, constraints).
Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.
Share bugs, ideas, or general feedback.
Learn a probabilistic model from a real dataset, then sample synthetic rows that preserve statistical properties (marginals, correlations, constraints).
./synthetic-data-workspace/outputs/)GaussianCopula (fast, general), CTGAN (high-fidelity, slower), or TVAEInstall SDV and SDMetrics:
pip install sdv sdmetrics
Load real data and detect metadata:
import pandas as pd
from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer
# Load real data
real_data = pd.read_csv('real_data.csv')
print(real_data.head())
print(real_data.info())
# Auto-detect metadata (column types, constraints)
metadata = SingleTableMetadata.detect_from_dataframe(real_data)
metadata.validate()
print(metadata)
Fit the synthesizer:
# Option A: Fast general-purpose (GaussianCopula)
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(real_data)
# Option B: High-fidelity mixed-type (CTGAN, slower)
# from sdv.single_table import CTGANSynthesizer
# synthesizer = CTGANSynthesizer(metadata, epochs=100)
# synthesizer.fit(real_data)
Sample synthetic data:
synthetic_data = synthesizer.sample(num_rows=1000)
print(synthetic_data.head())
# Save to CSV or Parquet
synthetic_data.to_csv('synthetic_data.csv', index=False)
# synthetic_data.to_parquet('synthetic_data.parquet', index=False)
Optional: Multi-table synthesis (HMA):
from sdv.metadata import MultiTableMetadata
from sdv.multi_table import HMASynthesizer
# If you have multiple related tables:
metadata = MultiTableMetadata.detect_from_dataframes({
'customers': customers_df,
'orders': orders_df
})
metadata.set_relationships([
{
'parent_table_name': 'customers',
'child_table_name': 'orders',
'foreign_key_columns': {'order_customer_id': 'customer_id'}
}
])
synthesizer = HMASynthesizer(metadata)
synthesizer.fit({'customers': customers_df, 'orders': orders_df})
synthetic_tables = synthesizer.sample()
num_rows rowssynthesizer.save('model.pkl')) and reusedevaluate-quality to checksynthesizer_kwargs={'epsilon': 1.0}) or use Synthcity instead