Implements cloud DLP using Amazon Macie, Azure Information Protection, and Google Cloud DLP API to discover, classify, and protect sensitive data in cloud storage, databases, and pipelines. For compliance like GDPR, HIPAA, PCI DSS.
npx claudepluginhub killvxk/cybersecurity-skills-zhThis skill uses the workspace's default tool permissions.
- 合规框架(GDPR、HIPAA、PCI DSS)要求自动化敏感数据发现和保护时
Implements Cloud DLP using Amazon Macie, Azure Information Protection, and Google Cloud DLP API to discover, classify, and protect sensitive data across cloud storage, databases, and pipelines. For GDPR, HIPAA compliance.
Implements Cloud DLP using Amazon Macie, Azure Information Protection, and Google Cloud DLP API to discover, classify, and protect sensitive data in cloud storage, databases, and pipelines.
Guides automated PII discovery and classification using Microsoft Purview, BigID, OneTrust DataDiscovery, AWS Macie. Covers scanning configs, accuracy tuning, false positives, integrations.
Share bugs, ideas, or general feedback.
不适用于:端点 DLP(使用 Microsoft Purview 或 Symantec DLP 代理)、邮件 DLP(使用 Microsoft 365 DLP 或 Google Workspace DLP),或网络级别的数据外泄防护(使用 VPC 端点策略和网络防火墙)。
gcloud services enable dlp.googleapis.com)启用 Macie 并为 S3 存储桶配置自动化敏感数据发现任务。
# 启用 Amazon Macie
aws macie2 enable-macie
# 列出 Macie 可扫描的所有 S3 存储桶
aws macie2 describe-buckets \
--query 'buckets[*].[bucketName,classifiableSizeInBytes,unclassifiableObjectCount.total]' \
--output table
# 为特定存储桶创建分类任务
aws macie2 create-classification-job \
--job-type SCHEDULED \
--name "weekly-pii-scan" \
--schedule-frequency-details '{"weekly":{"dayOfWeek":"MONDAY"}}' \
--s3-job-definition '{
"bucketDefinitions": [{
"accountId": "ACCOUNT_ID",
"buckets": ["customer-data-bucket", "analytics-data-lake", "backup-bucket"]
}],
"scoping": {
"includes": {
"and": [{
"simpleScopeTerm": {
"key": "OBJECT_EXTENSION",
"values": ["csv", "json", "parquet", "txt", "xlsx"],
"comparator": "EQ"
}
}]
}
}
}' \
--managed-data-identifier-ids '["SSN","CREDIT_CARD_NUMBER","EMAIL_ADDRESS","AWS_CREDENTIALS","PHONE_NUMBER"]'
# 为内部员工 ID 创建自定义数据标识符
aws macie2 create-custom-data-identifier \
--name "EmployeeID" \
--regex "EMP-[0-9]{6}" \
--description "内部员工 ID 格式"
# 检查任务状态和结果
aws macie2 list-classification-jobs \
--query 'items[*].[name,jobStatus,statistics.approximateNumberOfObjectsToProcess]' \
--output table
使用 Google Cloud DLP 检查并去标识 GCP 资源中的敏感数据。
# 检查 Cloud Storage 存储桶中的敏感数据
gcloud dlp inspect-content \
--content-type=TEXT_PLAIN \
--min-likelihood=LIKELY \
--info-types=PHONE_NUMBER,EMAIL_ADDRESS,CREDIT_CARD_NUMBER,US_SOCIAL_SECURITY_NUMBER \
--storage-type=CLOUD_STORAGE \
--gcs-uri="gs://sensitive-data-bucket/data/*.csv"
# 为 BigQuery 创建检查任务
cat > dlp-job.json << 'EOF'
{
"inspectJob": {
"storageConfig": {
"bigQueryOptions": {
"tableReference": {
"projectId": "PROJECT_ID",
"datasetId": "customer_data",
"tableId": "transactions"
},
"sampleMethod": "RANDOM_START",
"rowsLimit": 10000
}
},
"inspectConfig": {
"infoTypes": [
{"name": "CREDIT_CARD_NUMBER"},
{"name": "US_SOCIAL_SECURITY_NUMBER"},
{"name": "EMAIL_ADDRESS"},
{"name": "PHONE_NUMBER"},
{"name": "PERSON_NAME"}
],
"minLikelihood": "LIKELY",
"limits": {"maxFindingsPerRequest": 1000}
},
"actions": [{
"saveFindings": {
"outputConfig": {
"table": {
"projectId": "PROJECT_ID",
"datasetId": "dlp_results",
"tableId": "findings"
}
}
}
}]
}
}
EOF
gcloud dlp jobs create --project=PROJECT_ID --body-from-file=dlp-job.json
配置去标识转换,对敏感数据进行掩码、令牌化或编辑。
# deidentify_pipeline.py - 使用 Google Cloud DLP 对敏感数据进行去标识
from google.cloud import dlp_v2
def deidentify_data(project_id, text):
"""使用 Cloud DLP 对文本中的 PII 进行去标识。"""
client = dlp_v2.DlpServiceClient()
inspect_config = {
"info_types": [
{"name": "EMAIL_ADDRESS"},
{"name": "PHONE_NUMBER"},
{"name": "CREDIT_CARD_NUMBER"},
{"name": "US_SOCIAL_SECURITY_NUMBER"},
],
"min_likelihood": dlp_v2.Likelihood.LIKELY,
}
deidentify_config = {
"info_type_transformations": {
"transformations": [
{
"info_types": [{"name": "EMAIL_ADDRESS"}],
"primitive_transformation": {
"character_mask_config": {
"masking_character": "*",
"number_to_mask": 0,
"characters_to_ignore": [
{"common_characters_to_ignore": "PUNCTUATION"}
],
}
},
},
{
"info_types": [{"name": "CREDIT_CARD_NUMBER"}],
"primitive_transformation": {
"crypto_replace_ffx_fpe_config": {
"crypto_key": {
"kms_wrapped": {
"wrapped_key": "WRAPPED_KEY_BASE64",
"crypto_key_name": "projects/PROJECT/locations/global/keyRings/dlp/cryptoKeys/tokenization",
}
},
"common_alphabet": "NUMERIC",
}
},
},
{
"info_types": [{"name": "US_SOCIAL_SECURITY_NUMBER"}],
"primitive_transformation": {
"redact_config": {}
},
},
]
}
}
item = {"value": text}
parent = f"projects/{project_id}/locations/global"
response = client.deidentify_content(
request={
"parent": parent,
"deidentify_config": deidentify_config,
"inspect_config": inspect_config,
"item": item,
}
)
return response.item.value
在 Microsoft Purview 中为 Azure 资源设置敏感度标签和 DLP 策略。
# 连接到 Microsoft Purview 合规性
Connect-IPPSSession
# 创建敏感度标签
New-Label -DisplayName "机密 - PII" \
-Name "Confidential-PII" \
-Tooltip "包含个人身份信息" \
-ContentType "File, Email"
New-Label -DisplayName "高度机密 - 金融" \
-Name "HighlyConfidential-Financial" \
-Tooltip "包含受 PCI DSS 约束的金融数据" \
-ContentType "File, Email"
# 为 Azure Storage 创建自动标记策略
New-AutoSensitivityLabelPolicy -Name "Auto-Label-PII" \
-ExchangeLocation All \
-SharePointLocation All \
-OneDriveLocation All \
-Mode Enable
New-AutoSensitivityLabelRule -Policy "Auto-Label-PII" \
-Name "Detect-SSN" \
-ContentContainsSensitiveInformation @{
Name = "U.S. Social Security Number (SSN)";
MinCount = 1;
MinConfidence = 85
} \
-ApplySensitivityLabel "Confidential-PII"
# Azure:为存储账户配置 DLP 策略
az security assessment create \
--name "storage-sensitive-data" \
--assessed-resource-type "Microsoft.Storage/storageAccounts"
# 启用带敏感数据威胁检测的 Microsoft Defender for Storage
az security pricing create --name StorageAccounts --tier standard \
--subplan DefenderForStorageV2 \
--extensions '[{"name":"SensitiveDataDiscovery","isEnabled":"True"}]'
将 DLP 扫描添加到 ETL 和数据管道工作流,防止敏感数据泄漏。
# pipeline_dlp_gate.py - 数据管道的 DLP 门控
import boto3
import json
macie_client = boto3.client('macie2')
s3_client = boto3.client('s3')
def scan_pipeline_output(bucket, prefix):
"""在数据晋升前扫描管道输出数据中的敏感内容。"""
job_response = macie_client.create_classification_job(
jobType='ONE_TIME',
name=f'pipeline-scan-{prefix}',
s3JobDefinition={
'bucketDefinitions': [{
'accountId': boto3.client('sts').get_caller_identity()['Account'],
'buckets': [bucket]
}],
'scoping': {
'includes': {
'and': [{
'simpleScopeTerm': {
'key': 'OBJECT_KEY',
'comparator': 'STARTS_WITH',
'values': [prefix]
}
}]
}
}
},
managedDataIdentifierSelector='ALL'
)
return job_response['jobId']
def check_scan_results(job_id):
"""检查 DLP 扫描是否发现了敏感数据。"""
response = macie_client.list_findings(
findingCriteria={
'criterion': {
'classificationDetails.jobId': {'eq': [job_id]},
'severity.description': {'eq': ['High', 'Critical']}
}
}
)
return len(response.get('findingIds', [])) > 0
def gate_decision(bucket, prefix):
"""DLP 门控:若发现敏感数据则阻止管道。"""
job_id = scan_pipeline_output(bucket, prefix)
has_sensitive_data = check_scan_results(job_id)
if has_sensitive_data:
return {
'decision': 'BLOCK',
'reason': '管道输出中检测到敏感数据',
'action': '在晋升到生产环境前应用去标识处理'
}
return {'decision': 'ALLOW', 'reason': '未检测到敏感数据'}
聚合跨云提供商的 DLP 发现结果并生成合规报告。
# Macie:获取发现结果统计数据
aws macie2 get-finding-statistics \
--group-by "severity.description" \
--finding-criteria '{"criterion":{"category":{"eq":["CLASSIFICATION"]}}}'
# Macie:按敏感度类型列出发现结果
aws macie2 list-findings \
--finding-criteria '{
"criterion": {
"classificationDetails.result.sensitiveData.category": {"eq": ["PERSONAL_INFORMATION"]},
"severity.description": {"eq": ["High"]}
}
}' \
--sort-criteria '{"attributeName": "updatedAt", "orderBy": "DESC"}'
# GCP DLP:列出任务结果
gcloud dlp jobs list --project=PROJECT_ID --filter="state=DONE" \
--format="table(name, createTime, inspectDetails.result.processedBytes, inspectDetails.result.totalEstimatedTransformations)"
# 将 Macie 发现结果导出到 S3 用于合规报告
aws macie2 create-findings-report \
--finding-criteria '{"criterion":{"category":{"eq":["CLASSIFICATION"]}}}' \
--sort-criteria '{"attributeName":"severity.score","orderBy":"DESC"}'
| 术语 | 定义 |
|---|---|
| 数据丢失预防(DLP) | 检测并防止云环境中敏感数据未经授权泄露的安全控制技术 |
| Amazon Macie | 使用机器学习发现、分类和保护 S3 存储桶中敏感数据的 AWS 服务 |
| Google Cloud DLP | 用于检查、分类和去标识 Cloud Storage、BigQuery 和 Datastore 中敏感数据的 GCP API |
| 数据去标识(Data De-identification) | 通过掩码、令牌化、加密或编辑对敏感数据进行转换,去除识别特征同时保留数据实用性 |
| 敏感度标签(Sensitivity Label) | 应用于数据的分类标签(机密、高度机密),触发 DLP 策略执行和访问控制 |
| 自定义数据标识符(Custom Data Identifier) | 添加到 DLP 服务的组织特定模式(正则表达式或关键词),用于检测专有敏感数据格式 |
场景背景:合规审计发现分析团队的 S3 数据湖在 CSV 文件中包含客户 PII(姓名、邮件、社会安全号码),没有加密或访问控制。组织必须对所有数据进行分类并实施 DLP 控制。
方法:
常见陷阱:Macie 按扫描 GB 收费。大型数据湖可能产生大量费用。使用范围规则聚焦高风险对象类型(CSV、JSON、Parquet),排除已知安全格式(压缩档案、二进制文件)。去标识必须在去除重新识别风险的同时保留分析数据的实用性。
云 DLP 合规报告
==============================
组织: Acme Corp
扫描周期: 2026-02-01 至 2026-02-23
环境: AWS(12 个存储桶)、GCP(3 个数据集)、Azure(5 个存储账户)
数据发现摘要:
扫描的对象/记录总数: 2,847,000
含敏感数据的对象: 45,200(1.6%)
唯一敏感度分类数: 8
敏感数据发现结果:
PII(姓名、邮件、电话): 23,400 个对象
金融(信用卡、银行): 8,700 个对象
医疗(PHI、医疗记录): 3,200 个对象
凭据(API 密钥、令牌): 1,400 个对象
政府 ID(SSN、护照): 5,800 个对象
自定义(员工 ID、账号): 2,700 个对象
按严重程度划分的发现结果:
严重: 1,400(凭据暴露)
高: 14,200(未受保护的 PII/PHI)
中: 18,600(标准 PII)
低: 11,000(非敏感模式)
保护状态:
静态加密数据: 78%
有访问控制的数据: 65%
有敏感度标签的数据: 12%
有 DLP 门控的管道数据: 30%
修复操作:
已隔离对象: 1,400
已应用去标识: 8,200
已加强访问控制: 14,200
已应用敏感度标签: 45,200