Reduces SOC alert fatigue via detection rule tuning, duplicate merging, risk-based alerting, and quality metrics measurement. For high alert volumes, false positives >70%, or analyst overload.
npx claudepluginhub killvxk/cybersecurity-skills-zhThis skill uses the workspace's default tool permissions.
以下情况使用本技能:
Implements SOC alert fatigue reduction: measures quality via Splunk queries, tunes rules, consolidates duplicates, adds risk-based alerting. For high-volume, high false-positive environments.
Implements SOC alert fatigue reduction: tunes detection rules, consolidates duplicates, adds risk-based alerting, measures quality metrics like FP rates. For high alert volumes and false positives overwhelming analysts.
Reduces SIEM false positives using Splunk SPL for rule tuning, threshold adjustments, whitelisting, correlations, and time exclusions. For SOC alert fatigue.
Share bugs, ideas, or general feedback.
以下情况使用本技能:
不适用于在未经分析的情况下关闭检测规则——减少告警不能造成检测盲区。
在进行变更之前量化问题:
--- 告警量和处置分析(最近 90 天)
index=notable earliest=-90d
| stats count AS total_alerts,
sum(eval(if(status_label="Resolved - True Positive", 1, 0))) AS true_positives,
sum(eval(if(status_label="Resolved - False Positive", 1, 0))) AS false_positives,
sum(eval(if(status_label="Resolved - Benign", 1, 0))) AS benign,
sum(eval(if(status_label="New" OR status_label="In Progress", 1, 0))) AS unresolved
by rule_name
| eval fp_rate = round(false_positives / total_alerts * 100, 1)
| eval tp_rate = round(true_positives / total_alerts * 100, 1)
| eval signal_to_noise = round(true_positives / (false_positives + 0.01), 2)
| sort - total_alerts
| table rule_name, total_alerts, true_positives, false_positives, benign, fp_rate, tp_rate, signal_to_noise
--- 噪声最大的前 10 条规则(需要调优的候选规则)
| search fp_rate > 70 OR total_alerts > 1000
| sort - false_positives
| head 10
每位分析师的日均告警量:
index=notable earliest=-30d
| bin _time span=1d
| stats count AS daily_alerts by _time
| stats avg(daily_alerts) AS avg_daily, max(daily_alerts) AS peak_daily,
stdev(daily_alerts) AS stdev_daily
| eval alerts_per_analyst = round(avg_daily / 6, 0) --- 每班 6 位分析师
| eval capacity_status = case(
alerts_per_analyst > 100, "严重 — 超出分析师容量",
alerts_per_analyst > 50, "警告 — 接近容量上限",
1=1, "健康 — 在可管理范围内"
)
将基于阈值的告警转换为 Splunk ES 中的风险评分:
--- 不再为每次登录失败生成告警,而是贡献风险分值
--- 风险规则:认证失败(贡献风险分值,不生成告警)
index=wineventlog EventCode=4625
| stats count by src_ip, TargetUserName, ComputerName
| where count > 5
| eval risk_score = case(
count > 50, 40,
count > 20, 25,
count > 10, 15,
count > 5, 5
)
| eval risk_object = src_ip
| eval risk_object_type = "system"
| eval risk_message = count." 次来自 ".src_ip." 针对 ".TargetUserName." 的登录失败"
| collect index=risk
--- 风险规则:失败后成功登录(叠加风险)
index=wineventlog EventCode=4624 Logon_Type=3
| lookup risk_scores src_ip AS src_ip OUTPUT total_risk
| where total_risk > 0
| eval risk_score = 30
| eval risk_message = "在 ".total_risk." 风险分后来自 ".src_ip." 的成功登录"
| collect index=risk
--- 风险阈值告警:仅当累计风险超过阈值时才发送告警
index=risk earliest=-24h
| stats sum(risk_score) AS total_risk, values(risk_message) AS risk_events,
dc(source) AS contributing_rules by risk_object
| where total_risk >= 75
| eval urgency = case(
total_risk >= 150, "critical",
total_risk >= 100, "high",
total_risk >= 75, "medium"
)
--- 该单条告警替代了 10+ 条独立阈值告警
RBA 前后对比:
实施 RBA 前:
规则:"登录失败 > 5" → 847 条告警/天 (误报率:92%)
规则:"可疑进程" → 234 条告警/天 (误报率:78%)
规则:"网络异常" → 156 条告警/天 (误报率:85%)
合计:1,237 条告警/天
实施 RBA 后:
风险聚合告警 → 23 条告警/天 (误报率:18%)
每条告警包含来自多个风险贡献的完整上下文
减少幅度:告警量减少 98%,同时真阳性率更高
系统化调优噪声最大的规则:
--- 识别常见误报模式
index=notable rule_name="Suspicious PowerShell Execution" status_label="Resolved - False Positive"
earliest=-90d
| stats count by src, dest, user, CommandLine
| sort - count
| head 20
--- 结果显示:SCCM 客户端产生了 80% 的误报
应用调优:
--- 原始规则(产生误报)
index=sysmon EventCode=1 Image="*\\powershell.exe"
(CommandLine="*-enc*" OR CommandLine="*-encodedcommand*" OR CommandLine="*invoke-expression*")
| where count > 0
--- 调优后规则(排除已知合法来源)
index=sysmon EventCode=1 Image="*\\powershell.exe"
(CommandLine="*-enc*" OR CommandLine="*-encodedcommand*" OR CommandLine="*invoke-expression*")
NOT [| inputlookup powershell_whitelist.csv | fields CommandLine_pattern]
NOT (ParentImage="*\\ccmexec.exe" OR ParentImage="*\\sccm*")
NOT (User="SYSTEM" AND ParentImage="*\\services.exe" AND
CommandLine="*Microsoft\\ConfigMgr*")
| where count > 0
记录调优决策:
rule_name: Suspicious PowerShell Execution
tuning_date: 2024-03-15
original_fp_rate: 78%
tuned_fp_rate: 22%
exclusions_added:
- 包含 ccmexec.exe 的 ParentImage(SCCM 客户端)
- User=SYSTEM 且 CommandLine 含 ConfigMgr
- 计划任务:Windows Update PowerShell 模块
alerts_reduced: 每天消除约 180 条
detection_impact: 无 — 已根据 ATT&CK 测试用例验证排除项
approved_by: detection_engineering_lead
将相关告警分组为单个事件:
--- 在时间窗口内按源 IP 合并告警
index=notable earliest=-1h
| sort _time
| dedup src, rule_name span=300
| stats count AS alert_count, values(rule_name) AS related_rules,
earliest(_time) AS first_alert, latest(_time) AS last_alert
by src
| where alert_count > 3
| eval consolidated_alert = src." 触发了 ".alert_count." 条相关告警:".mvjoin(related_rules, ", ")
Splunk ES Notable 事件抑制:
--- 在 1 小时内对相同源/目标对的重复告警进行抑制
| notable
| dedup src, dest, rule_name span=3600
根据置信度和严重性路由告警:
告警路由策略
━━━━━━━━━━━━━━━━━━━━━
一级(自动化处理):
- 风险分数 < 30:自动关闭并记录富化数据
- 已知误报模式:自动抑制(每季度审查)
- 信息性告警:仅路由到仪表板(不进入队列)
二级(分析师审查):
- 风险分数 30-75:标准分诊队列
- 中置信度告警:需要分析师决策
- 已通过自动上下文富化(VT、AbuseIPDB、资产信息)
三级(优先调查):
- 风险分数 > 75:立即调查
- 诱捕告警:自动升级(零误报)
- 已知恶意软件检测:自动遏制 + 分析师审查
在 Splunk 中实施:
index=notable
| eval routing = case(
urgency="critical" OR source="deception", "TIER3_IMMEDIATE",
urgency="high" AND risk_score > 75, "TIER3_IMMEDIATE",
urgency="high" OR urgency="medium", "TIER2_STANDARD",
urgency="low" AND fp_rate > 80, "TIER1_AUTO_CLOSE",
1=1, "TIER2_STANDARD"
)
| where routing != "TIER1_AUTO_CLOSE" --- 自动关闭的告警从队列中移除
跟踪告警疲劳指标变化趋势:
--- 每周告警质量趋势
index=notable earliest=-90d
| bin _time span=1w
| stats count AS total,
sum(eval(if(status_label="Resolved - True Positive", 1, 0))) AS tp,
sum(eval(if(status_label="Resolved - False Positive", 1, 0))) AS fp
by _time
| eval tp_rate = round(tp / total * 100, 1)
| eval fp_rate = round(fp / total * 100, 1)
| eval alerts_per_analyst = round(total / 42, 0) --- 6 位分析师 * 7 天
| table _time, total, tp, fp, tp_rate, fp_rate, alerts_per_analyst
| 术语 | 定义 |
|---|---|
| 告警疲劳(Alert Fatigue) | 过量告警导致分析师认知超载,进而忽视或关闭有效告警 |
| 基于风险的告警(RBA) | 在生成单条高上下文告警之前,聚合来自多个事件的风险贡献的检测方法 |
| 信噪比(Signal-to-Noise Ratio) | 真阳性告警与误报的比率——比率越高表示告警质量越好 |
| 误报率(False Positive Rate) | 调查后被分类为良性的告警比例——生产规则目标 <30% |
| 告警合并(Alert Consolidation) | 将来自同一来源/活动的相关告警归组为单个调查单元 |
| 检测调优(Detection Tuning) | 细化规则逻辑以排除已知良性模式,同时保持真阳性检测的过程 |
告警疲劳减少报告 — 2024 年第一季度
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
实施前(2024 年 1 月):
日均告警量: 1,847 条
每班每位分析师告警量:154 条
误报率: 82%
真阳性率: 8%
信噪比: 0.10
分析师士气: 低(第四季度 2 人离职)
实施后(2024 年 3 月):
日均告警量: 287 条(-84%)
每班每位分析师告警量:24 条
误报率: 23%(改善 -72%)
真阳性率: 41%(改善 +413%)
信噪比: 1.78
已实施变更:
[1] 部署基于风险告警(转换 15 条规则) 每天减少 1,200 条告警
[2] 调优前 10 条噪声规则(添加排除列表) 每天减少 280 条告警
[3] 告警合并(5 分钟去重窗口) 每天减少 80 条告警
[4] 低置信度告警一级自动关闭 N/A(从队列中移除)
检测覆盖影响:无 — ATT&CK 覆盖率保持 67%
真阳性检测率:提升 — 每周额外发现 12 个真阳性