From data-classification-skills
Guides automated PII discovery and classification using Microsoft Purview, BigID, OneTrust DataDiscovery, AWS Macie. Covers scanning configs, accuracy tuning, false positives, integrations.
npx claudepluginhub mukul975/privacy-data-protection-skills --plugin data-classification-skillsThis skill uses the workspace's default tool permissions.
Automated data discovery tools scan structured and unstructured data repositories to identify, classify, and catalogue personal data across the enterprise. Manual data inventories cannot keep pace with the volume, velocity, and variety of modern data processing. Automated discovery provides continuous visibility into where personal data resides, how it flows, and whether it is classified and pr...
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
Automated data discovery tools scan structured and unstructured data repositories to identify, classify, and catalogue personal data across the enterprise. Manual data inventories cannot keep pace with the volume, velocity, and variety of modern data processing. Automated discovery provides continuous visibility into where personal data resides, how it flows, and whether it is classified and protected according to policy. This skill covers implementation patterns for four leading platforms — Microsoft Purview, BigID, OneTrust DataDiscovery, and AWS Macie — with focus on scanning configuration, accuracy optimisation, and integration with privacy compliance workflows.
| Capability | Microsoft Purview | BigID | OneTrust DataDiscovery | AWS Macie |
|---|---|---|---|---|
| Structured data scanning | SQL Server, Azure SQL, Synapse, Cosmos DB, Oracle, PostgreSQL, MySQL, Teradata | 100+ connectors including all major RDBMS, NoSQL, data warehouses | 200+ connectors, pre-built integrations with SaaS applications | S3, DynamoDB, RDS (via Lambda) |
| Unstructured data scanning | SharePoint, OneDrive, Exchange, Azure Blob, Azure Files, AWS S3, GCP Storage | File shares, email, SharePoint, cloud storage, Slack, Teams, Confluence | File shares, email, cloud storage, collaboration platforms | S3 buckets (primary focus) |
| Classification method | 300+ built-in sensitive information types (SITs), trainable classifiers, exact data match (EDM), custom regex | ML-based NER, correlation analysis, pattern matching, custom classifiers | Pattern matching, NER, contextual analysis, custom rules | ML-based pattern matching, custom data identifiers, managed data identifiers |
| GDPR-specific classifiers | EU national ID formats, EU passport numbers, EU debit/credit card numbers, EU tax ID numbers per Member State | GDPR personal data taxonomy, Art. 9 special category detection, cross-regulation mapping | Pre-built GDPR data subject types, purpose mapping, lawful basis tagging | EU personal data identifiers (limited — primarily financial and identity patterns) |
| Accuracy tuning | Confidence levels (low/medium/high), custom keyword dictionaries, EDM for exact matching, document fingerprinting | ML model retraining, feedback loop, confidence thresholds, correlation rules | Confidence scoring, validation rules, exception management | Custom data identifiers with regex and keyword proximity, severity scoring |
| Deployment model | SaaS (Microsoft 365/Azure), hybrid with Purview governance | SaaS, on-premises, hybrid | SaaS, on-premises agent | AWS-native SaaS |
| Pricing model | Per information protection unit (Azure), per Microsoft 365 licence tier (E5 includes advanced) | Per data source connector, per TB scanned | Per data source module, per connector | Per S3 bucket evaluated, per GB scanned |
Data Sources Microsoft Purview
┌─────────────┐ ┌───────────────────────────┐
│ Azure SQL │──────scanner──►│ Data Map (metadata store) │
│ SharePoint │──────scanner──►│ Data Catalog (search/tag) │
│ AWS S3 │──────scanner──►│ Data Estate Insights │
│ On-prem SQL │──self-hosted──►│ Information Protection │
│ Power BI │──────scanner──►│ Data Loss Prevention (DLP) │
└─────────────┘ └───────────────────────────┘
Step 1: Register Data Sources
Step 2: Configure Scanning Rules
[A-CEGHJ-PR-TW-Z]{2}\d{6}[A-D])VFS-\d{10}EMP-[A-Z]{2}\d{6}PF-\d{8}-[A-Z]{2}Step 3: Set Scanning Schedule
Step 4: Configure Sensitivity Labels
Public → Purview label: PublicInternal → Purview label: GeneralConfidential → Purview label: ConfidentialRestricted → Purview label: Highly Confidential (auto-applied to Art. 9/Art. 10 data)Step 5: DLP Policy Integration
| Issue | Tuning Approach |
|---|---|
| False positive: UK phone numbers flagged as National Insurance numbers | Increase minimum confidence to HIGH for NINO SIT; add negative keyword list ("phone", "tel", "fax", "mobile") |
| False positive: Internal reference numbers flagged as account numbers | Create EDM schema for actual customer accounts; custom SIT with proximity to customer-related keywords |
| False negative: Health data in free-text email bodies | Enable trainable classifier for health content; train on sample of 50+ positive examples from occupational health correspondence |
| False negative: Genetic identifiers in research datasets | Create custom SIT for rs-number pattern (rs\d{4,12}), ICD-10 codes, and HUGO gene names |
BigID uses a distributed scanning architecture with correlation-based discovery:
Data Sources BigID Platform
┌─────────────┐ ┌──────────────────────────┐
│ Databases │──scan───►│ Discovery Engine │
│ File Shares │──scan───►│ Correlation Engine (ML) │
│ Cloud │──scan───►│ Classification Engine │
│ SaaS Apps │──API────►│ Catalog & Inventory │
│ Email │──scan───►│ Privacy Rights Automation │
└─────────────┘ └──────────────────────────┘
BigID's ML-based correlation engine identifies personal data by correlating data elements across sources to build identity profiles. This approach detects personal data that pattern matching alone would miss — for example, a customer ID in one system linked to a name in another.
OneTrust integrates discovery with its broader privacy management platform:
Data Sources OneTrust Platform
┌─────────────┐ ┌──────────────────────────┐
│ Cloud/SaaS │──API────►│ DataDiscovery Module │
│ Databases │──agent──►│ Data Mapping (Art. 30) │
│ File Shares │──agent──►│ Assessment Automation │
│ Endpoints │──agent──►│ Consent Management │
└─────────────┘ │ DSAR Automation │
└──────────────────────────┘
OneTrust's value proposition is tight integration between discovery results and privacy program management — discovered personal data feeds directly into Art. 30 records, DPIA assessments, and DSAR fulfilment workflows.
Macie is purpose-built for S3 data discovery within AWS:
AWS Environment
┌─────────────────────────────────────┐
│ S3 Buckets ──scan──► Macie │
│ │ │
│ EventBridge ◄──alerts──┘ │
│ Security Hub ◄──findings──┘ │
│ CloudWatch ◄──metrics──┘ │
└─────────────────────────────────────┘
| Scan Type | Frequency | Duration Window | Trigger |
|---|---|---|---|
| Full discovery scan | Monthly | Weekend maintenance window (8-12 hours) | Scheduled |
| Incremental scan | Weekly | Off-peak hours (2-4 hours) | Scheduled |
| New source onboarding scan | On registration | Within 48 hours of source registration | Event-driven |
| Post-incident scan | As needed | Immediate (targeted scope) | Incident response |
| Pre-DPIA scan | Before DPIA commencement | 1-2 weeks before DPIA start | Project-triggered |
| Metric | Target | Measurement Method |
|---|---|---|
| Precision (true positive rate) | > 90% | Sample 100 classified items monthly; verify classification accuracy |
| Recall (detection rate) | > 85% | Plant known PII test data in scan scope; measure detection rate |
| False positive rate | < 10% | Count items classified as personal data that are not |
| False negative rate | < 15% | Count personal data items missed by the scanner |
| Classification consistency | > 95% | Same data element classified consistently across repeat scans |
Month 1: Baseline scan → establish initial accuracy metrics
Month 2: Review false positives/negatives → tune rules and thresholds
Month 3: Re-scan → measure improvement
Month 4: Expand scope (new data sources) → re-baseline
Month 5: Review edge cases → create custom classifiers
Month 6: Accuracy audit by DPO → formal accuracy report
[Repeat cycle]