From ai-privacy-governance-skills
Implements federated learning architecture patterns for GDPR-compliant distributed ML training, including secure aggregation protocols, differential privacy, and communication protocols.
npx claudepluginhub mukul975/privacy-data-protection-skills --plugin ai-privacy-governance-skillsThis skill uses the workspace's default tool permissions.
Federated learning (FL) is a distributed machine learning approach that trains models across multiple data holders without centralising personal data. Instead of collecting training data into a central repository, federated learning sends the model to the data, computes local updates on each participant's device or server, and aggregates only model updates (gradients or weights) at a central co...
Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.
Dynamically discovers and combines enabled skills into cohesive, unexpected delightful experiences like interactive HTML or themed artifacts. Activates on 'surprise me', inspiration, or boredom cues.
Generates images from structured JSON prompts via Python script execution. Supports reference images and aspect ratios for characters, scenes, products, visuals.
Federated learning (FL) is a distributed machine learning approach that trains models across multiple data holders without centralising personal data. Instead of collecting training data into a central repository, federated learning sends the model to the data, computes local updates on each participant's device or server, and aggregates only model updates (gradients or weights) at a central coordinator. This architecture directly addresses GDPR data minimisation (Art. 5(1)(c)) and data protection by design (Art. 25) principles by eliminating the need to transfer and centralise personal data for AI training. However, federated learning is not a privacy silver bullet — it introduces its own privacy risks that must be managed through complementary techniques.
Use case: Training on data from millions of user devices (smartphones, tablets, IoT).
| Component | Description |
|---|---|
| Participants | End-user devices (smartphones, tablets, wearables) |
| Scale | Thousands to millions of participants |
| Data | Small per-device, large aggregate (e.g., keyboard predictions, health metrics) |
| Coordination | Central server selects participants per round, distributes model, aggregates updates |
| Communication | Compressed gradient updates over mobile networks |
| Privacy risk | Individual gradient updates may leak information about device data |
GDPR Analysis:
Use case: Training across organisational boundaries (hospitals, banks, subsidiaries).
| Component | Description |
|---|---|
| Participants | Organisational data silos (hospitals, branches, partner companies) |
| Scale | 2 to 100 participants |
| Data | Large per-silo, structured (e.g., medical records, financial transactions) |
| Coordination | Trusted aggregator or peer-to-peer protocol |
| Communication | Model updates over secure channels between organisations |
| Privacy risk | Gradient updates may reveal institutional data patterns |
GDPR Analysis:
Use case: Different organisations hold different features for the same individuals.
| Component | Description |
|---|---|
| Participants | Organisations with complementary data (bank + retailer sharing customer features) |
| Scale | 2 to 10 participants |
| Data | Different features for overlapping individuals |
| Coordination | Secure multi-party computation for feature combination |
| Communication | Encrypted intermediate representations |
| Privacy risk | Feature linkage may reveal individual attributes across parties |
GDPR Analysis:
Participants mask their local updates with pairwise random masks that cancel out upon aggregation. The aggregator receives the sum without seeing individual updates.
| Property | Value |
|---|---|
| Privacy guarantee | Individual updates not visible to aggregator or other participants |
| Computational cost | O(n^2) pairwise key agreement, O(n) masking per round |
| Communication cost | 2x baseline (masks + masked updates) |
| Dropout tolerance | Handles participant dropout if sufficient participants remain |
| Collusion resistance | Secure against aggregator + up to t-1 participant collusion |
Participants encrypt their updates with a homomorphic encryption scheme. The aggregator computes the sum on encrypted data without decryption.
| Property | Value |
|---|---|
| Privacy guarantee | Computationally secure — updates encrypted throughout |
| Computational cost | 100-1000x overhead for encryption/decryption operations |
| Communication cost | 2-10x baseline (ciphertext expansion) |
| Dropout tolerance | Excellent — encrypted updates can be summed independently |
| Collusion resistance | Secure against aggregator (does not hold decryption key) |
Aggregation occurs within a hardware-protected enclave (Intel SGX, ARM TrustZone). Participants send updates to the TEE, which performs aggregation in isolated memory.
| Property | Value |
|---|---|
| Privacy guarantee | Hardware-based isolation — aggregator cannot inspect updates |
| Computational cost | Near-native (small overhead for enclave transitions) |
| Communication cost | Baseline (no encryption expansion for enclave-to-enclave) |
| Dropout tolerance | Excellent |
| Collusion resistance | Depends on hardware trust model; vulnerable to side-channel attacks |
Each participant adds noise to their gradient update before sending to the aggregator:
The aggregator adds noise to the aggregated update before applying to the global model:
| Parameter | Description | Guidance |
|---|---|---|
| Epsilon (ε) | Privacy loss parameter — lower is more private | ε ≤ 1: strong privacy; ε ≤ 8: moderate; ε > 10: weak |
| Delta (δ) | Probability of privacy failure | δ < 1/N where N is dataset size |
| Rounds (T) | Number of federated training rounds | Privacy degrades with rounds — use composition theorems |
| Clip norm (C) | Maximum gradient norm per participant | Balance between privacy (lower C) and convergence (higher C) |
| Noise multiplier (σ) | Ratio of noise to sensitivity | Determined by ε, δ, C, and composition method |
| GDPR Principle | FL Implementation | Compliance Status |
|---|---|---|
| Data minimisation (Art. 5(1)(c)) | Personal data stays local — only model updates transmitted | Strong compliance |
| Purpose limitation (Art. 5(1)(b)) | Local processing for specified training purpose | Requires per-participant purpose documentation |
| Storage limitation (Art. 5(1)(e)) | No central training data repository — data retained locally per participant's policy | Compliance depends on participant retention |
| Integrity and confidentiality (Art. 5(1)(f)) | Secure aggregation protects update confidentiality | Strong with SA + DP |
| Accuracy (Art. 5(1)(d)) | Model accuracy may differ from centralised training | Monitor and document accuracy trade-offs |
| Technique | Description | Privacy Impact |
|---|---|---|
| Gradient compression | Quantise or sparsify gradients before transmission | May interact with DP noise — careful calibration needed |
| Federated averaging (FedAvg) | Multiple local SGD steps before communication | Reduces communication rounds; may increase per-round privacy cost |
| Gradient selection | Send only top-k gradient components | Leaks which components are most significant — privacy concern |
| Strategy | Description | Privacy Consideration |
|---|---|---|
| Random selection | Uniformly random participant sampling per round | Fair representation; privacy amplification through subsampling |
| Availability-based | Select participants with sufficient resources | May bias toward certain participant profiles |
| Contribution-based | Select participants whose data improves model most | Reveals information about data distribution — privacy risk |
While no enforcement action has specifically addressed federated learning, the technology is directly relevant to: