From data-classification-skills
Classifies sensitive data in AI/ML training datasets, detects bias for Art. 9 categories, generates data cards, tracks provenance, verifies consent for GDPR/EU AI Act compliance.
npx claudepluginhub mukul975/privacy-data-protection-skills --plugin data-classification-skillsThis skill uses the workspace's default tool permissions.
AI and machine learning models trained on personal data raise distinct classification challenges. Training data may contain direct personal data, inferred special categories, proxy variables for protected characteristics, and data whose consent scope does not extend to model training. The EU AI Act (Regulation (EU) 2024/1689) imposes additional requirements for high-risk AI systems, including d...
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
AI and machine learning models trained on personal data raise distinct classification challenges. Training data may contain direct personal data, inferred special categories, proxy variables for protected characteristics, and data whose consent scope does not extend to model training. The EU AI Act (Regulation (EU) 2024/1689) imposes additional requirements for high-risk AI systems, including data governance obligations under Art. 10 that intersect with GDPR classification requirements. This skill provides a framework for classifying training data, detecting bias-relevant features, documenting data provenance, and verifying consent coverage.
| GDPR Article | Application to AI Training |
|---|---|
| Art. 5(1)(b) — Purpose limitation | Training a model is a distinct processing purpose; if data was collected for customer service, using it for ML training requires a compatible purpose assessment or new lawful basis |
| Art. 5(1)(c) — Data minimisation | Training datasets must not include more personal data than necessary for the model objective |
| Art. 6 — Lawful basis | Model training requires its own lawful basis; legitimate interests (Art. 6(1)(f)) is most common, but requires LIA documentation |
| Art. 9 — Special categories | If training data contains or enables inference of special category data, Art. 9(2) condition required |
| Art. 22 — Automated decision-making | If the trained model makes decisions with legal or significant effects, additional safeguards apply |
| Art. 25 — Data protection by design | Classification of training data is a by-design measure enabling appropriate technical protections |
| Art. 35 — DPIA | High-risk AI processing (profiling, automated decision-making) requires DPIA |
The AI Act Art. 10 requires that training, validation, and testing datasets for high-risk AI systems:
| Classification | Description | Example |
|---|---|---|
| TRAINING_PII_DIRECT | Dataset contains direct identifiers | Customer names, email addresses in NLP training corpus |
| TRAINING_PII_INDIRECT | Dataset contains indirect identifiers | Customer IDs, transaction patterns enabling singling out |
| TRAINING_SPECIAL_CAT | Dataset contains Art. 9 special category data | Health records for medical diagnosis model |
| TRAINING_CRIMINAL | Dataset contains Art. 10 criminal data | Fraud transaction labels derived from criminal investigations |
| TRAINING_PSEUDONYMISED | Personal data replaced with tokens but re-identification key exists | Pseudonymised customer data with mapping held by data team |
| TRAINING_ANONYMISED | Data verified as anonymised per WP29 criteria | Aggregated population statistics with k ≥ 10 |
| TRAINING_SYNTHETIC | Artificially generated data with no real personal data | GAN-generated synthetic transaction data |
| TRAINING_NON_PERSONAL | No personal data content | Market price data, weather data, product specifications |
Even when a dataset does not directly contain Art. 9 special category data, it may contain proxy variables that correlate with protected characteristics:
| Proxy Variable | Correlated Protected Characteristic | Detection Method |
|---|---|---|
| Postcode/ZIP code | Racial/ethnic origin, socioeconomic status | Geographic demographic analysis |
| First name | Gender, ethnic origin, age cohort | Name demographics database lookup |
| Language preference | Ethnic origin, nationality | Statistical correlation analysis |
| Shopping patterns | Religious belief (halal/kosher purchases), health status | Purchase category analysis |
| Web browsing history | Political opinions, sexual orientation, health status | Topic modelling on browsing categories |
| Employment gap patterns | Gender (maternity), disability, health | Statistical pattern analysis |
| Credit score | Racial/ethnic origin (documented correlation in US/UK studies) | Disparate impact analysis |
| Classification | Description | Compliance Requirement |
|---|---|---|
| CONSENT_COVERS_TRAINING | Original consent explicitly covers AI/ML training | Document consent text and verify specificity |
| CONSENT_DOES_NOT_COVER | Original consent did not anticipate ML training | New consent required or alternative lawful basis needed |
| LEGITIMATE_INTEREST | ML training relies on legitimate interests (Art. 6(1)(f)) | Documented LIA required |
| CONTRACT_PERFORMANCE | ML training is necessary for contract performance | Narrow scope — must be genuinely necessary |
| PUBLIC_DATA | Data sourced from publicly available sources | Still requires lawful basis; public availability is not a lawful basis |
| RESEARCH_EXEMPTION | Processing under Art. 89(1) research exemption | Appropriate safeguards including pseudonymisation required |
A data card is a structured document accompanying each training dataset, providing transparency about its contents, provenance, and limitations. Modelled on the "Datasheets for Datasets" framework (Gebru et al., 2021) and adapted for GDPR compliance.
| Section | Fields |
|---|---|
| 1. Dataset Identity | Name, version, creation date, owner, purpose |
| 2. Personal Data Classification | Tier 1 classification, data elements present, classification labels |
| 3. Data Subjects | Categories of data subjects, volume, geographic scope |
| 4. Provenance | Original collection purpose, source systems, processing chain from collection to training set |
| 5. Consent/Lawful Basis | Tier 3 classification, consent text reference or LIA reference, purpose compatibility assessment |
| 6. Special Category Assessment | Whether Art. 9 data is present (direct or inferred), Art. 9(2) condition if applicable |
| 7. Bias Assessment | Proxy variables identified, disparate impact analysis results, demographic representation statistics |
| 8. De-identification | Technique applied (pseudonymisation, anonymisation, synthetic generation), assessment reference |
| 9. Retention | Training data retention period, model retention period, deletion schedule |
| 10. Access Controls | Who can access the training data, who can access the model, audit logging |
| 11. DPIA Reference | DPIA document reference if applicable |
| 12. Limitations | Known biases, geographic limitations, temporal limitations, data quality issues |
For each training dataset, calculate representation statistics:
Scan all features for proxy correlation with Art. 9 protected characteristics:
For classification or scoring models:
If bias detection requires processing special category data: