From synthetic-data
Anonymise a real dataset by replacing PII columns with realistic Faker values, preserving structure and referential integrity.
npx claudepluginhub danielrosehill/claude-code-plugins --plugin synthetic-dataThis skill uses the workspace's default tool permissions.
Anonymise a real dataset in place by replacing specified PII columns (name, email, phone, address, SSN, IP, DoB) with realistic fake values, while preserving non-PII columns and row count. Uses deterministic mapping to ensure repeated values (e.g., the same customer name appearing 5 times) map to the same fake name.
Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.
Share bugs, ideas, or general feedback.
Anonymise a real dataset in place by replacing specified PII columns (name, email, phone, address, SSN, IP, DoB) with realistic fake values, while preserving non-PII columns and row count. Uses deterministic mapping to ensure repeated values (e.g., the same customer name appearing 5 times) map to the same fake name.
{"name": "name", "email": "email", "phone": "phone_number"})en_US; e.g., de_DE, fr_FR, ja_JP)./synthetic-data-workspace/outputs/)Install Faker:
pip install faker
Write a PII replacement script:
import pandas as pd
import hashlib
from faker import Faker
def anonymise_pii(input_path, output_path, pii_columns, locale='en_US'):
"""
Replace PII columns with faker-generated values.
pii_columns: dict of {col_name: faker_provider}
E.g. {'name': 'name', 'email': 'email', 'phone': 'phone_number',
'address': 'address', 'ssn': 'ssn', 'dob': 'date_of_birth'}
"""
fake = Faker(locale)
df = pd.read_csv(input_path)
# Deterministic mapping: hash(real_value) → seed → fake_value
# Ensures same real PII → same fake PII across rows
mapping = {}
for col_name, provider in pii_columns.items():
if col_name not in df.columns:
print(f"Warning: column {col_name} not found, skipping")
continue
# Build mapping for this column
col_mapping = {}
for real_value in df[col_name].dropna().unique():
real_value_str = str(real_value)
# Deterministic seed from hash
hash_int = int(hashlib.md5(real_value_str.encode()).hexdigest(), 16)
seed = hash_int % (2**31)
# Generate fake value with that seed
Faker.seed(seed)
fake_temp = Faker(locale)
fake_temp.seed_instance(seed)
# Get the fake value
fake_value = getattr(fake_temp, provider)()
col_mapping[real_value] = fake_value
# Apply mapping to column
df[col_name] = df[col_name].map(col_mapping)
mapping[col_name] = col_mapping
print(f"Replaced {col_name}: {len(col_mapping)} unique values")
# Save anonymised data
df.to_csv(output_path, index=False)
print(f"Anonymised data saved to {output_path}")
print(df.head())
return df, mapping
if __name__ == '__main__':
pii_cols = {
'customer_name': 'name',
'email': 'email',
'phone': 'phone_number',
'address': 'address'
}
anonymise_pii('real_data.csv', 'anonymised_data.csv', pii_cols, locale='de_DE')
Run the script:
python replace_pii.py
Verify output:
head -5 anonymised_data.csv
# Check that PII columns are replaced and non-PII columns unchanged
Optional: Check referential integrity:
# Verify that the same original PII value always maps to the same fake
original_df = pd.read_csv('real_data.csv')
anon_df = pd.read_csv('anonymised_data.csv')
for col in pii_cols.keys():
print(f"{col}: All rows with same real value mapped to same fake?")
# This is guaranteed by deterministic seed, but can spot-check