From data-classification-skills
Detects PII in unstructured data like emails, documents, images, and logs using spaCy NER, Microsoft Presidio, regex patterns, OCR, and confidence scoring.
npx claudepluginhub mukul975/privacy-data-protection-skills --plugin data-classification-skillsThis skill uses the workspace's default tool permissions.
Unstructured data — emails, documents, images, chat logs, call transcripts, and system logs — accounts for an estimated 80% of enterprise data and presents the greatest challenge for privacy compliance. Unlike structured databases where personal data resides in known columns, unstructured data contains PII embedded in free text, attached files, scanned images, and metadata. This skill covers de...
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
Unstructured data — emails, documents, images, chat logs, call transcripts, and system logs — accounts for an estimated 80% of enterprise data and presents the greatest challenge for privacy compliance. Unlike structured databases where personal data resides in known columns, unstructured data contains PII embedded in free text, attached files, scanned images, and metadata. This skill covers detection approaches using Named Entity Recognition (NER), pattern matching, OCR, and hybrid pipelines, with focus on Microsoft Presidio and spaCy as implementation frameworks.
| Source | Volume | PII Risk | Detection Challenge |
|---|---|---|---|
| Email (Exchange Online) | 2.1M messages/month | HIGH — names, account numbers, financial data in body and attachments | Mixed text and attachments; forwarded chains contain accumulated PII |
| SharePoint documents | 4.2TB across 1,200 sites | HIGH — contracts, KYC docs, customer correspondence | Multiple formats (docx, pdf, xlsx); embedded images |
| Teams chat | 890K messages/month | MEDIUM — casual references to customers, internal discussions | Short messages, abbreviations, context-dependent PII |
| Application logs | 50GB/day | MEDIUM — IP addresses, user IDs, error messages with PII | High volume, mixed with non-PII technical data |
| Scanned documents | 45K pages/month | HIGH — passport scans, signed contracts, medical certificates | Requires OCR; variable image quality |
| Call transcripts | 8K transcripts/month | HIGH — customers state names, account numbers, personal details | Speech-to-text errors, colloquial language |
| PDF reports | 12K documents/month | MEDIUM — financial reports may contain customer lists | Embedded tables, charts with PII labels |
Presidio is an open-source PII detection and anonymisation SDK developed by Microsoft, designed for integration with enterprise data pipelines.
Input Text/Document
│
▼
┌──────────────────┐
│ Pre-processing │ Format conversion, encoding normalisation,
│ (text extract) │ OCR for images/scanned PDFs
└──────┬───────────┘
│
▼
┌──────────────────┐
│ Presidio │ Multiple recognisers run in parallel:
│ Analyzer │ - NER model (spaCy/transformers)
│ │ - Pattern recognisers (regex)
│ │ - Custom recognisers (org-specific)
│ │ - Context-aware enhancers
└──────┬───────────┘
│
▼
┌──────────────────┐
│ Confidence │ Each detection assigned confidence score
│ Scoring & │ Threshold filtering applied
│ Filtering │ Context enhancement boosts/reduces scores
└──────┬───────────┘
│
▼
┌──────────────────┐
│ Results │ PII locations, types, confidence scores
│ (structured) │ Ready for classification, redaction, or alerting
└──────────────────┘
NER Model (spaCy/Transformers):
en_core_web_trf model (transformer-based) for English NERPattern Recognisers (Regex):
[A-CEGHJ-PR-TW-Z]{2}\d{6}[A-D][a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}(?:0|\+44)\d{10,11}[A-Z]{2}\d{2}[A-Z0-9]{4}\d{7}([A-Z0-9]?){0,16}\b(?:\d{4}[-\s]?){3}\d{4}\b (with Luhn validation)\b(?:\d{1,3}\.){3}\d{1,3}\b\b\d{2}[/-]\d{2}[/-]\d{4}\bVFS-\d{10}[A-Z]\d{2}\.\d{1,4}Context-Aware Enhancement:
Scanned Document / Image
│
▼
┌──────────────────┐
│ Pre-processing │ Deskew, denoise, contrast enhancement,
│ (image) │ resolution upscaling (if < 300 DPI)
└──────┬───────────┘
│
▼
┌──────────────────┐
│ OCR Engine │ Tesseract OCR (open-source) or
│ │ Azure AI Document Intelligence (cloud)
│ │ Output: extracted text with bounding boxes
└──────┬───────────┘
│
▼
┌──────────────────┐
│ Presidio │ Standard NER + pattern detection
│ Analyzer │ on OCR-extracted text
└──────┬───────────┘
│
▼
┌──────────────────┐
│ Confidence │ Adjust for OCR quality:
│ Adjustment │ OCR confidence < 80% → reduce PII confidence by 20%
│ │ OCR confidence > 95% → no adjustment
└──────────────────┘
| Document Type | OCR Strategy | Expected PII |
|---|---|---|
| Passport scan | Azure AI Document Intelligence (ID document model) | Full name, DOB, nationality, passport number, photo (biometric) |
| Utility bill | General OCR + address pattern recognition | Full name, address, account number |
| Medical certificate | General OCR + health NER model | Name, diagnosis, doctor name, dates |
| Signed contract | General OCR + contract template matching | Names, addresses, financial terms, signatures |
| Cheque image | Banking-specific OCR model | Name, account number, sort code, amount |
| Component | Weight | Description |
|---|---|---|
| Pattern match confidence | 40% | Regex pattern specificity and validation (e.g., Luhn check for credit cards) |
| NER model confidence | 30% | Model probability score for entity classification |
| Context enhancement | 20% | Keyword proximity, section header, document type |
| Source quality | 10% | OCR quality score, document resolution, text extraction confidence |
| Confidence Level | Score Range | Action |
|---|---|---|
| HIGH | 85-100% | Auto-classify and auto-label; include in discovery report |
| MEDIUM | 70-84% | Queue for human review; include in discovery report as pending |
| LOW | 50-69% | Log for audit; do not auto-classify; available for bulk review |
| BELOW THRESHOLD | < 50% | Suppress; do not report unless specifically queried |
# Conceptual pipeline for Exchange Online email scanning
# 1. Microsoft Graph API retrieves email messages
# 2. Extract body text (HTML → plain text conversion)
# 3. Extract attachment text (document parsing)
# 4. Run Presidio analyzer on combined text
# 5. Map findings to email metadata (sender, recipients, date)
# 6. Apply classification labels via Microsoft Purview
For application logs, specific patterns dominate:
Log scanning requires higher false-positive tolerance and volume-optimised processing.
Teams/Slack messages present unique challenges:
Strategy: scan message threads rather than individual messages to capture context.
| Source | Precision Target | Recall Target | Key Challenges |
|---|---|---|---|
| Email body text | > 92% | > 88% | Forwarded chains, signatures, disclaimers |
| SharePoint documents (Office formats) | > 90% | > 85% | Embedded tables, headers/footers |
| Scanned documents (OCR) | > 85% | > 80% | OCR errors, handwriting, poor image quality |
| Application logs | > 88% | > 82% | IP address over-detection, reference number ambiguity |
| Chat messages | > 80% | > 75% | Short context, informal language, abbreviations |
| Call transcripts | > 82% | > 78% | Speech-to-text errors, overlapping speech, accents |