From synthetic-data
Generate synthetic text records (support tickets, reviews, medical notes, etc.) using Claude based on a persona pool and schema.
npx claudepluginhub danielrosehill/claude-code-plugins --plugin synthetic-dataThis skill uses the workspace's default tool permissions.
Create synthetic text records (customer support tickets, product reviews, medical notes, chat logs, etc.) using Claude. User specifies the record type, field schema, persona pool, count, and tone. The skill batch-generates records with optional deduplication by semantic similarity.
Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.
Share bugs, ideas, or general feedback.
Create synthetic text records (customer support tickets, product reviews, medical notes, chat logs, etc.) using Claude. User specifies the record type, field schema, persona pool, count, and tone. The skill batch-generates records with optional deduplication by semantic similarity.
["subject", "body", "priority"] for a ticket)["angry customer", "confused user", "power user", "first-time buyer"])./synthetic-data-workspace/outputs/)Install Anthropic CLI and dependencies:
pip install anthropic pandas
Draft a generation prompt template:
You are generating synthetic {record_type} records for testing/training purposes.
Each record must be realistic, following this schema:
{schema_json}
Generate one record as valid JSON (no markdown, just raw JSON).
Assume the persona: {persona}
Tone/style: {tone}
Language: {language}
Ensure each field is filled with realistic, natural text.
Write a batch generation script:
import json
import anthropic
from itertools import cycle
def generate_synthetic_records(record_type, schema_fields, personas, count,
tone, language, output_path):
client = anthropic.Anthropic() # Uses ANTHROPIC_API_KEY env var
schema_json = json.dumps({f: f"<{f} value>" for f in schema_fields})
records = []
persona_cycle = cycle(personas)
for i in range(count):
persona = next(persona_cycle)
prompt = f"""You are generating synthetic {record_type} records for testing.
Schema (as JSON):
{schema_json}
Generate ONE record. Return only valid JSON, no markdown.
Persona: {persona}
Tone: {tone}
Language: {language}"""
try:
message = client.messages.create(
model="claude-3-5-haiku-20241022", # Fast, economical
max_tokens=500,
messages=[
{"role": "user", "content": prompt}
]
)
response_text = message.content[0].text.strip()
# Try to parse as JSON; if markdown-wrapped, extract
if response_text.startswith('```'):
response_text = response_text.split('```')[1].lstrip('json').strip()
record = json.loads(response_text)
records.append(record)
if (i + 1) % 10 == 0:
print(f"Generated {i + 1}/{count} records...")
except json.JSONDecodeError as e:
print(f"Warning: failed to parse record {i+1}: {e}")
continue
except anthropic.APIError as e:
print(f"API error on record {i+1}: {e}")
continue
# Save to JSONL
with open(output_path, 'w') as f:
for record in records:
f.write(json.dumps(record) + '\n')
print(f"Generated {len(records)} records to {output_path}")
return records
if __name__ == '__main__':
schema = ["subject", "body", "priority", "category"]
personas = [
"An angry customer whose order arrived late",
"A confused user having technical issues",
"A power user reporting a feature request",
"A first-time buyer with basic questions"
]
generate_synthetic_records(
record_type="customer support ticket",
schema_fields=schema,
personas=personas,
count=50,
tone="realistic and varied",
language="English",
output_path="synthetic_tickets.jsonl"
)
Optional: Deduplication by semantic similarity:
import json
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
def deduplicate_by_similarity(jsonl_path, threshold=0.85):
"""Remove records with cosine similarity > threshold to earlier records."""
with open(jsonl_path) as f:
records = [json.loads(line) for line in f]
# Concatenate fields for similarity check
texts = [' '.join(str(v) for v in r.values()) for r in records]
vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(texts)
similarities = cosine_similarity(tfidf)
unique_records = [records[0]]
for i in range(1, len(records)):
# Check if this record is too similar to any earlier record
max_sim = max(similarities[i, :i])
if max_sim < threshold:
unique_records.append(records[i])
print(f"Kept {len(unique_records)} unique records (removed {len(records) - len(unique_records)})")
return unique_records
Run the script and verify:
python generate_text_records.py
head synthetic_tickets.jsonl | python -m json.tool