Audits AI/ML models for privacy leakage via training data extraction, membership inference attacks, model inversion, and attribute inference using ML Privacy Meter, ART, and Foolbox.
npx claudepluginhub mukul975/privacy-data-protection-skills --plugin privacy-skills-completeThis skill uses the workspace's default tool permissions.
AI model privacy auditing is the systematic assessment of whether trained ML models leak information about their training data. Models can memorize individual training records, enabling adversaries to extract personal data, determine dataset membership, reconstruct input features, or infer sensitive attributes. This skill implements a comprehensive model privacy audit methodology using establis...
Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.
Dynamically discovers and combines enabled skills into cohesive, unexpected delightful experiences like interactive HTML or themed artifacts. Activates on 'surprise me', inspiration, or boredom cues.
Generates images from structured JSON prompts via Python script execution. Supports reference images and aspect ratios for characters, scenes, products, visuals.
AI model privacy auditing is the systematic assessment of whether trained ML models leak information about their training data. Models can memorize individual training records, enabling adversaries to extract personal data, determine dataset membership, reconstruct input features, or infer sensitive attributes. This skill implements a comprehensive model privacy audit methodology using established attack techniques and tools (ML Privacy Meter, ART, Foolbox) to quantify privacy leakage before deployment and periodically during operation. The audit results feed directly into the AI DPIA risk assessment and inform mitigation measure selection.
Objective: Extract verbatim or near-verbatim records from the model's training data.
| Attack Vector | Description | Target Models |
|---|---|---|
| Prompt-based extraction | Craft prompts that cause LLMs to regurgitate training data | Language models, generative models |
| Canary extraction | Insert known canary strings into training data and test if model reproduces them | Any model (testing methodology) |
| Gradient-based extraction | Use model gradients to reconstruct training inputs | Models with accessible gradients |
| Generative reconstruction | Use the model as an oracle to iteratively reconstruct training samples | GANs, VAEs, diffusion models |
Risk Factors Increasing Extraction Likelihood:
Testing Methodology:
Objective: Determine whether a specific record was in the model's training set.
| Attack Type | Method | Computational Cost |
|---|---|---|
| Shadow model attack | Train shadow models on similar data, build a binary classifier on model outputs | High — requires training multiple shadow models |
| Metric-based attack | Use model confidence, loss, or entropy to distinguish members from non-members | Low — single model query per sample |
| Label-only attack | Use predicted labels (no confidence scores) to infer membership | Medium — requires multiple queries |
| Likelihood ratio attack (LiRA) | Compare per-sample loss to reference distributions | High — most accurate, requires multiple models |
ML Privacy Meter Implementation:
Testing Methodology:
Objective: Reconstruct input features from model outputs.
| Attack Type | Method | Target |
|---|---|---|
| Confidence-based inversion | Iteratively optimise input to maximise model confidence for a known label | Classification models |
| Gradient-based inversion | Use model gradients to reconstruct inputs from outputs | White-box models |
| GAN-based inversion | Train a GAN to invert model outputs to input space | Face recognition, image classifiers |
Testing Methodology:
Objective: Infer sensitive attributes not present in the model's output.
| Attack Type | Description |
|---|---|
| Correlation exploitation | Use correlated features to infer sensitive attributes from model behaviour |
| Partial knowledge attack | Attacker knows some attributes and uses model to infer remaining sensitive ones |
| Group inference | Determine statistical properties of training subgroups |
Testing Methodology:
For each selected attack:
| Metric | Acceptable | Elevated | Unacceptable |
|---|---|---|---|
| Membership inference TPR@1%FPR | < 5% | 5-15% | > 15% |
| Training data extraction rate | < 0.1% | 0.1-1% | > 1% |
| Model inversion SSIM | < 0.3 | 0.3-0.6 | > 0.6 |
| Attribute inference accuracy above baseline | < 10% | 10-25% | > 25% |
| Mitigation | Attacks Mitigated | Trade-off |
|---|---|---|
| Differential privacy (DP-SGD) | All — provides mathematical guarantee | Model accuracy reduction (calibrate epsilon) |
| Training data deduplication | Extraction, membership inference | One-time preprocessing cost |
| Regularisation (dropout, weight decay) | Membership inference, overfitting-related leakage | May affect model performance |
| Output perturbation | Model inversion, attribute inference | Reduces output precision |
| Confidence score rounding | Metric-based membership inference | Minor output precision loss |
| Model distillation | Extraction, membership inference | Requires additional training |
| Rate limiting | All query-based attacks | Affects legitimate use |
| Input/output PII filtering | Extraction of PII from generative models | May affect model utility |
| Tool | Purpose | Source |
|---|---|---|
| ML Privacy Meter | Membership inference auditing | github.com/privacytrustlab/ml_privacy_meter |
| IBM ART | Adversarial robustness and privacy testing | github.com/Trusted-AI/adversarial-robustness-toolbox |
| TensorFlow Privacy | Differential privacy training | github.com/tensorflow/privacy |
| Opacus | PyTorch differential privacy | github.com/pytorch/opacus |
| Google DP Library | Differential privacy algorithms | github.com/google/differential-privacy |
| Foolbox | Adversarial attack library | github.com/bethgelab/foolbox |
Model privacy auditing is not explicitly required by the GDPR or AI Act, but is effectively mandated through: