From synthetic-data
Reference card of open-source synthetic data tools, when to use each, install commands, and design patterns.
npx claudepluginhub danielrosehill/claude-code-plugins --plugin synthetic-dataThis skill uses the workspace's default tool permissions.
A curated reference of open-source and cloud synthetic data tools. Use this skill when the user asks which tool to pick for their use case, or wants to understand the landscape.
Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.
Share bugs, ideas, or general feedback.
A curated reference of open-source and cloud synthetic data tools. Use this skill when the user asks which tool to pick for their use case, or wants to understand the landscape.
What it does: Learns probability distribution from real data, then samples synthetic rows that preserve marginals, correlations, and constraints.
When to use:
Pros: Mature, multi-table support, handles mixed types, row/column constraints, reversible transformations.
Cons: Slower for very high-dimensional data; assumes stationarity.
Key models:
GaussianCopulaSynthesizer — fast, general, works well for continuous/categorical mixCTGANSynthesizer — neural-network-based, high fidelity on mixed types, slower to fitTVAESynthesizer — variational autoencoder, good balance of speed/fidelityHMASynthesizer — multi-table, learns foreign-key relationshipsInstall: pip install sdv
Quality eval: pip install sdmetrics
What it does: Plugin-based generative synthesis with support for differential privacy, fairness constraints, and time-series.
When to use:
Pros: DP guarantees, plugin architecture, time-series module.
Cons: Steeper learning curve; fewer pre-packaged models than SDV.
Install: pip install synthcity
What it does: Differentially-private tabular synthesis using Gaussian copula or uniform sampling.
When to use:
Pros: Formal privacy guarantees, simple API, fast.
Cons: Limited to DP copula approach; less flexible than SDV.
Install: pip install DataSynthesizer
What it does: GAN-based tabular and time-series synthesis.
When to use:
Pros: Strong empirical fidelity; time-series support.
Cons: Training can be unstable; hyperparameter-sensitive.
Install: pip install ydata-synthetic
What it does: Generates realistic fake values for common fields (names, emails, phone numbers, addresses, dates, credit cards, SSNs, IP addresses, etc.) with locale support.
When to use:
Pros: Simple, fast, extremely customizable, 50+ locales.
Cons: No correlation learning; purely random per-field.
Install: pip install faker
Example:
from faker import Faker
fake = Faker('de_DE')
print(fake.name(), fake.email(), fake.phone_number())
What it does: Similar to Faker — generates localized fake data with additional structure (e.g. geography, person relationships).
When to use:
Pros: Compact, fast, good geographic support.
Cons: Smaller ecosystem than Faker.
Install: pip install mimesis
What it does: LLM-based probabilistic time-series forecasting and generation.
When to use:
Pros: Pre-trained, handles multiple series lengths.
Cons: Requires external service or local inference.
Install: Via Hugging Face, huggingface_hub
What it does: Generative adversarial network for time-series synthesis.
When to use:
Pros: Good temporal fidelity.
Cons: Training instability; requires careful hypertuning.
Install: Via research implementations (e.g. GitHub repos)
What it does: Use Claude to generate or transform text records based on prompts describing persona, tone, intent, schema.
When to use:
Pros: Flexible, human-readable output, custom logic via prompts.
Cons: API costs; slower than generative models; requires API key.
Install: pip install anthropic
What it does: Similar to Claude but via OpenAI models.
When to use:
Install: pip install openai
What it does: Client for Gretel's cloud-based synthetic data platform. Uploads data, trains models in the cloud, downloads synthetic data.
When to use:
Pros: Managed service, no local compute, audit trail.
Cons: Costs; data leaves local network.
Install: pip install gretel-client
| Use case | Recommended | Alternatives |
|---|---|---|
| Real tabular → synthetic | SDV (GaussianCopula) | Synthcity, ydata-synthetic |
| Mixed-type tabular, high fidelity | SDV (CTGAN) | ydata-synthetic (GAN), Gretel |
| Privacy-sensitive tabular | Synthcity, DataSynthesizer | SDV + manual DP |
| Multi-table relational | SDV (HMA) | Synthcity with custom |
| Schema-based tabular (fake) | Faker + Mimesis | custom generation |
| PII replacement on real data | Faker + deterministic mapping | ydata-synthetic (PII aware) |
| Synthetic text records | Claude/GPT API | ydata-synthetic (text models) |
| Real text → synthetic (paraphrase) | Claude/GPT API with prompts | fine-tuned LLMs |
| Time-series forecasting | Chronos | SDV (multivariate time) |
| Time-series GAN-based | TimeGAN, ydata-synthetic | SDV (univariate) |
Next: Choose your use case, then refer to the appropriate skill (tabular-from-schema, tabular-from-real, text-records-llm, etc.) for step-by-step execution.