From cybersecurity-skills
Implements Cloud DLP using Amazon Macie, Azure Information Protection, and Google Cloud DLP API to discover, classify, and protect sensitive data in cloud storage, databases, and pipelines.
npx claudepluginhub mukul975/anthropic-cybersecurity-skills --plugin cybersecurity-skillsThis skill uses the workspace's default tool permissions.
- When compliance frameworks (GDPR, HIPAA, PCI DSS) require automated sensitive data discovery and protection
Applies Acme Corporation brand guidelines including colors, fonts, layouts, and messaging to generated PowerPoint, Excel, and PDF documents.
Builds DCF models with sensitivity analysis, Monte Carlo simulations, and scenario planning for investment valuation and risk assessment.
Calculates profitability (ROE, margins), liquidity (current ratio), leverage, efficiency, and valuation (P/E, EV/EBITDA) ratios from financial statements in CSV, JSON, text, or Excel for investment analysis.
Do not use for endpoint DLP (use Microsoft Purview or Symantec DLP agents), for email DLP (use Microsoft 365 DLP or Google Workspace DLP), or for network-level data exfiltration prevention (use VPC endpoint policies and network firewalls).
gcloud services enable dlp.googleapis.com)Enable Macie and configure automated sensitive data discovery jobs for S3 buckets.
# Enable Amazon Macie
aws macie2 enable-macie
# List all S3 buckets Macie can scan
aws macie2 describe-buckets \
--query 'buckets[*].[bucketName,classifiableSizeInBytes,unclassifiableObjectCount.total]' \
--output table
# Create a classification job for specific buckets
aws macie2 create-classification-job \
--job-type SCHEDULED \
--name "weekly-pii-scan" \
--schedule-frequency-details '{"weekly":{"dayOfWeek":"MONDAY"}}' \
--s3-job-definition '{
"bucketDefinitions": [{
"accountId": "ACCOUNT_ID",
"buckets": ["customer-data-bucket", "analytics-data-lake", "backup-bucket"]
}],
"scoping": {
"includes": {
"and": [{
"simpleScopeTerm": {
"key": "OBJECT_EXTENSION",
"values": ["csv", "json", "parquet", "txt", "xlsx"],
"comparator": "EQ"
}
}]
}
}
}' \
--managed-data-identifier-ids '["SSN","CREDIT_CARD_NUMBER","EMAIL_ADDRESS","AWS_CREDENTIALS","PHONE_NUMBER"]'
# Create custom data identifier for internal employee IDs
aws macie2 create-custom-data-identifier \
--name "EmployeeID" \
--regex "EMP-[0-9]{6}" \
--description "Internal employee ID format"
# Check job status and results
aws macie2 list-classification-jobs \
--query 'items[*].[name,jobStatus,statistics.approximateNumberOfObjectsToProcess]' \
--output table
Use Google Cloud DLP to inspect and de-identify sensitive data across GCP resources.
# Inspect a Cloud Storage bucket for sensitive data
gcloud dlp inspect-content \
--content-type=TEXT_PLAIN \
--min-likelihood=LIKELY \
--info-types=PHONE_NUMBER,EMAIL_ADDRESS,CREDIT_CARD_NUMBER,US_SOCIAL_SECURITY_NUMBER \
--storage-type=CLOUD_STORAGE \
--gcs-uri="gs://sensitive-data-bucket/data/*.csv"
# Create an inspection job for BigQuery
cat > dlp-job.json << 'EOF'
{
"inspectJob": {
"storageConfig": {
"bigQueryOptions": {
"tableReference": {
"projectId": "PROJECT_ID",
"datasetId": "customer_data",
"tableId": "transactions"
},
"sampleMethod": "RANDOM_START",
"rowsLimit": 10000
}
},
"inspectConfig": {
"infoTypes": [
{"name": "CREDIT_CARD_NUMBER"},
{"name": "US_SOCIAL_SECURITY_NUMBER"},
{"name": "EMAIL_ADDRESS"},
{"name": "PHONE_NUMBER"},
{"name": "PERSON_NAME"}
],
"minLikelihood": "LIKELY",
"limits": {"maxFindingsPerRequest": 1000}
},
"actions": [{
"saveFindings": {
"outputConfig": {
"table": {
"projectId": "PROJECT_ID",
"datasetId": "dlp_results",
"tableId": "findings"
}
}
}
}]
}
}
EOF
gcloud dlp jobs create --project=PROJECT_ID --body-from-file=dlp-job.json
Configure de-identification transforms to mask, tokenize, or redact sensitive data.
# deidentify_pipeline.py - De-identify sensitive data using Google Cloud DLP
from google.cloud import dlp_v2
def deidentify_data(project_id, text):
"""De-identify PII in text using Cloud DLP."""
client = dlp_v2.DlpServiceClient()
inspect_config = {
"info_types": [
{"name": "EMAIL_ADDRESS"},
{"name": "PHONE_NUMBER"},
{"name": "CREDIT_CARD_NUMBER"},
{"name": "US_SOCIAL_SECURITY_NUMBER"},
],
"min_likelihood": dlp_v2.Likelihood.LIKELY,
}
deidentify_config = {
"info_type_transformations": {
"transformations": [
{
"info_types": [{"name": "EMAIL_ADDRESS"}],
"primitive_transformation": {
"character_mask_config": {
"masking_character": "*",
"number_to_mask": 0,
"characters_to_ignore": [
{"common_characters_to_ignore": "PUNCTUATION"}
],
}
},
},
{
"info_types": [{"name": "CREDIT_CARD_NUMBER"}],
"primitive_transformation": {
"crypto_replace_ffx_fpe_config": {
"crypto_key": {
"kms_wrapped": {
"wrapped_key": "WRAPPED_KEY_BASE64",
"crypto_key_name": "projects/PROJECT/locations/global/keyRings/dlp/cryptoKeys/tokenization",
}
},
"common_alphabet": "NUMERIC",
}
},
},
{
"info_types": [{"name": "US_SOCIAL_SECURITY_NUMBER"}],
"primitive_transformation": {
"redact_config": {}
},
},
]
}
}
item = {"value": text}
parent = f"projects/{project_id}/locations/global"
response = client.deidentify_content(
request={
"parent": parent,
"deidentify_config": deidentify_config,
"inspect_config": inspect_config,
"item": item,
}
)
return response.item.value
Set up sensitivity labels and DLP policies in Microsoft Purview for Azure resources.
# Connect to Microsoft Purview compliance
Connect-IPPSSession
# Create sensitivity labels
New-Label -DisplayName "Confidential - PII" \
-Name "Confidential-PII" \
-Tooltip "Contains personally identifiable information" \
-ContentType "File, Email"
New-Label -DisplayName "Highly Confidential - Financial" \
-Name "HighlyConfidential-Financial" \
-Tooltip "Contains financial data subject to PCI DSS" \
-ContentType "File, Email"
# Create auto-labeling policy for Azure Storage
New-AutoSensitivityLabelPolicy -Name "Auto-Label-PII" \
-ExchangeLocation All \
-SharePointLocation All \
-OneDriveLocation All \
-Mode Enable
New-AutoSensitivityLabelRule -Policy "Auto-Label-PII" \
-Name "Detect-SSN" \
-ContentContainsSensitiveInformation @{
Name = "U.S. Social Security Number (SSN)";
MinCount = 1;
MinConfidence = 85
} \
-ApplySensitivityLabel "Confidential-PII"
# Azure: Configure DLP policy for Storage accounts
az security assessment create \
--name "storage-sensitive-data" \
--assessed-resource-type "Microsoft.Storage/storageAccounts"
# Enable Microsoft Defender for Storage with sensitive data threat detection
az security pricing create --name StorageAccounts --tier standard \
--subplan DefenderForStorageV2 \
--extensions '[{"name":"SensitiveDataDiscovery","isEnabled":"True"}]'
Add DLP scanning to ETL and data pipeline workflows to prevent sensitive data leakage.
# pipeline_dlp_gate.py - DLP gate for data pipelines
import boto3
import json
macie_client = boto3.client('macie2')
s3_client = boto3.client('s3')
def scan_pipeline_output(bucket, prefix):
"""Scan pipeline output data for sensitive content before promotion."""
job_response = macie_client.create_classification_job(
jobType='ONE_TIME',
name=f'pipeline-scan-{prefix}',
s3JobDefinition={
'bucketDefinitions': [{
'accountId': boto3.client('sts').get_caller_identity()['Account'],
'buckets': [bucket]
}],
'scoping': {
'includes': {
'and': [{
'simpleScopeTerm': {
'key': 'OBJECT_KEY',
'comparator': 'STARTS_WITH',
'values': [prefix]
}
}]
}
}
},
managedDataIdentifierSelector='ALL'
)
return job_response['jobId']
def check_scan_results(job_id):
"""Check if DLP scan found sensitive data."""
response = macie_client.list_findings(
findingCriteria={
'criterion': {
'classificationDetails.jobId': {'eq': [job_id]},
'severity.description': {'eq': ['High', 'Critical']}
}
}
)
return len(response.get('findingIds', [])) > 0
def gate_decision(bucket, prefix):
"""DLP gate: block pipeline if sensitive data found."""
job_id = scan_pipeline_output(bucket, prefix)
has_sensitive_data = check_scan_results(job_id)
if has_sensitive_data:
return {
'decision': 'BLOCK',
'reason': 'Sensitive data detected in pipeline output',
'action': 'Apply de-identification before promoting to production'
}
return {'decision': 'ALLOW', 'reason': 'No sensitive data detected'}
Aggregate DLP findings across cloud providers and generate compliance reports.
# Macie: Get finding statistics
aws macie2 get-finding-statistics \
--group-by "severity.description" \
--finding-criteria '{"criterion":{"category":{"eq":["CLASSIFICATION"]}}}'
# Macie: List findings by sensitivity type
aws macie2 list-findings \
--finding-criteria '{
"criterion": {
"classificationDetails.result.sensitiveData.category": {"eq": ["PERSONAL_INFORMATION"]},
"severity.description": {"eq": ["High"]}
}
}' \
--sort-criteria '{"attributeName": "updatedAt", "orderBy": "DESC"}'
# GCP DLP: List job results
gcloud dlp jobs list --project=PROJECT_ID --filter="state=DONE" \
--format="table(name, createTime, inspectDetails.result.processedBytes, inspectDetails.result.totalEstimatedTransformations)"
# Export Macie findings to S3 for compliance reporting
aws macie2 create-findings-report \
--finding-criteria '{"criterion":{"category":{"eq":["CLASSIFICATION"]}}}' \
--sort-criteria '{"attributeName":"severity.score","orderBy":"DESC"}'
| Term | Definition |
|---|---|
| Data Loss Prevention | Security controls and technologies that detect and prevent unauthorized disclosure of sensitive data from cloud environments |
| Amazon Macie | AWS service using machine learning to discover, classify, and protect sensitive data stored in S3 buckets |
| Google Cloud DLP | GCP API for inspecting, classifying, and de-identifying sensitive data across Cloud Storage, BigQuery, and Datastore |
| Data De-identification | Transforming sensitive data using masking, tokenization, encryption, or redaction to remove identifying characteristics while preserving utility |
| Sensitivity Label | Classification tag applied to data (Confidential, Highly Confidential) that triggers DLP policy enforcement and access controls |
| Custom Data Identifier | Organization-specific pattern (regex or keyword) added to DLP services to detect proprietary sensitive data formats |
Context: A compliance audit reveals that the analytics team's S3 data lake contains customer PII (names, emails, SSNs) in CSV files without encryption or access controls. The organization must classify all data and implement DLP controls.
Approach:
Pitfalls: Macie charges per GB scanned. Large data lakes can generate significant costs. Use scoping rules to focus on high-risk object types (CSV, JSON, Parquet) and exclude known-safe formats (compressed archives, binary files). De-identification must preserve data utility for analytics while removing re-identification risk.
Cloud DLP Compliance Report
==============================
Organization: Acme Corp
Scan Period: 2026-02-01 to 2026-02-23
Environments: AWS (12 buckets), GCP (3 datasets), Azure (5 storage accounts)
DATA DISCOVERY SUMMARY:
Total objects/records scanned: 2,847,000
Objects with sensitive data: 45,200 (1.6%)
Unique sensitivity categories: 8
SENSITIVE DATA FINDINGS:
PII (names, emails, phone): 23,400 objects
Financial (credit cards, bank): 8,700 objects
Health (PHI, medical records): 3,200 objects
Credentials (API keys, tokens): 1,400 objects
Government ID (SSN, passport): 5,800 objects
Custom (employee ID, account): 2,700 objects
FINDINGS BY SEVERITY:
Critical: 1,400 (exposed credentials)
High: 14,200 (unprotected PII/PHI)
Medium: 18,600 (standard PII)
Low: 11,000 (non-sensitive patterns)
PROTECTION STATUS:
Data with encryption at rest: 78%
Data with access controls: 65%
Data with sensitivity labels: 12%
Pipeline data with DLP gates: 30%
REMEDIATION ACTIONS:
Objects quarantined: 1,400
De-identification applied: 8,200
Access controls tightened: 14,200
Sensitivity labels applied: 45,200