Monitor SLO health and burn rates for para-obsidian operations
Monitors SLO health and burn rates for para-obsidian operations with dashboard, detail, and breach views.
/plugin marketplace add nathanvale/side-quest-marketplace/plugin install claude-code-docs@side-quest-marketplaceslo-name|--breachesclaude-haiku-4-5-20251001Monitor Service Level Objectives for para-obsidian inbox processing operations.
From src/inbox/shared/slos.ts:
/para-obsidian:slo # All SLOs dashboard
/para-obsidian:slo scan_latency # Specific SLO details
/para-obsidian:slo --breaches # Recent violations only
First, check if the SLO events file exists:
if [ -f ~/.claude/logs/slo-events.jsonl ]; then
wc -l ~/.claude/logs/slo-events.jsonl
else
echo "FILE_NOT_FOUND"
fi
If the output is FILE_NOT_FOUND, skip to the "Handle Missing File" section below.
Based on the argument $1:
--breaches → Breaches mode (recent violations)Show all SLOs with health status (limited to 5000 most recent events):
tail -n 5000 ~/.claude/logs/slo-events.jsonl | jq -s '
# Define metadata mapping
{
scan_latency: {target: 0.95, threshold: 60000, unit: "ms", window: "30d", error_budget: 0.05},
execute_success: {target: 0.99, threshold: 99, unit: "percent", window: "7d", error_budget: 0.01},
llm_availability: {target: 0.8, threshold: 80, unit: "percent", window: "24h", error_budget: 0.2},
execute_latency: {target: 0.95, threshold: 30000, unit: "ms", window: "30d", error_budget: 0.05},
extraction_latency: {target: 0.95, threshold: 5000, unit: "ms", window: "7d", error_budget: 0.05},
enrichment_latency: {target: 0.95, threshold: 5000, unit: "ms", window: "7d", error_budget: 0.05},
llm_latency: {target: 0.9, threshold: 10000, unit: "ms", window: "24h", error_budget: 0.1}
} as $meta_map |
# Group by SLO name and calculate metrics
group_by(.sloName) |
map(
{
slo: .[0].sloName,
total: length,
violated: [.[] | select(.violated == true)] | length,
recent_24h: [.[] | select(.timestamp > (now - 86400) * 1000)] | length,
recent_violated_24h: [.[] | select(.violated == true and .timestamp > (now - 86400) * 1000)] | length,
last_violation: ([.[] | select(.violated == true)] | max_by(.timestamp) // null)
} |
. as $item |
($meta_map[.slo] // {target: 1, threshold: 0, unit: "unknown", window: "unknown", error_budget: 0.01}) as $meta |
((.total - .violated) / .total) as $compliance |
($meta.error_budget - (.violated / .total)) as $budget_remaining |
(if .total == 0 then 0 else ((.violated / .total) / $meta.error_budget) end) as $burn_rate |
(if $compliance >= $meta.target then "✓" elif $budget_remaining < 0.3 then "⚠️" else "✗" end) as $status |
. + {
meta: $meta,
compliance: $compliance,
budget_remaining: $budget_remaining,
burn_rate: $burn_rate,
status: $status
}
)
' | jq -r '
def format_duration(ms):
if ms == null then "never"
else
((now * 1000 - ms) / 1000) |
if . < 60 then "\(. | floor)s ago"
elif . < 3600 then "\((. / 60) | floor)m ago"
elif . < 86400 then "\((. / 3600) | floor)h ago"
else "\((. / 86400) | floor)d ago"
end
end;
"=== SLO Health Dashboard ===",
"",
(.[] |
"\(.slo) (\((.meta.target * 100 | floor))% target, \(.meta.threshold)\(.meta.unit), \(.meta.window) window)",
" Status: \(.status) \(if .compliance >= .meta.target then "PASSING" else "BREACHED" end)",
" Compliance: \((.compliance * 100 * 100 | floor) / 100)% (target: \((.meta.target * 100 | floor))%)",
" Error Budget: \((.budget_remaining * 100 | floor))% remaining",
" Burn Rate: \((.burn_rate * 100 | floor) / 100)x \(if .burn_rate > 1 then "(fast burn, investigate!)" elif .burn_rate > 0.5 then "(moderate burn)" else "(slow burn, safe)" end)",
" Recent Events: \(.recent_24h) total, \(.recent_violated_24h) violated (\(if .recent_24h > 0 then ((.recent_violated_24h / .recent_24h * 100) | floor) else 0 end)%)",
" Last Breach: \(format_duration(.last_violation.timestamp))",
""
)
'
When $1 is an SLO name (e.g., scan_latency):
# Validate SLO name
case "$1" in
scan_latency|execute_success|llm_availability|execute_latency|extraction_latency|enrichment_latency|llm_latency)
# Valid SLO
;;
*)
echo "❌ Unknown SLO: $1"
echo ""
echo "Available SLOs:"
echo " - scan_latency"
echo " - execute_success"
echo " - llm_availability"
echo " - execute_latency"
echo " - extraction_latency"
echo " - enrichment_latency"
echo " - llm_latency"
echo ""
echo "Usage: /para-obsidian:slo [slo-name]"
exit 1
;;
esac
# Filter events for specific SLO (limit to 5000 most recent)
tail -n 5000 ~/.claude/logs/slo-events.jsonl | jq -s --arg slo "$1" '
[.[] | select(.sloName == $slo)] as $events |
# Define SLO metadata
(if $slo == "scan_latency" then
{name: "Scan Latency", target: 0.95, threshold: 60000, unit: "ms", window: "30d", error_budget: 0.05, window_ms: 2592000000}
elif $slo == "execute_success" then
{name: "Execute Success Rate", target: 0.99, threshold: 99, unit: "percent", window: "7d", error_budget: 0.01, window_ms: 604800000}
elif $slo == "llm_availability" then
{name: "LLM Availability", target: 0.8, threshold: 80, unit: "percent", window: "24h", error_budget: 0.2, window_ms: 86400000}
elif $slo == "execute_latency" then
{name: "Execute Latency", target: 0.95, threshold: 30000, unit: "ms", window: "30d", error_budget: 0.05, window_ms: 2592000000}
elif $slo == "extraction_latency" then
{name: "Extraction Latency", target: 0.95, threshold: 5000, unit: "ms", window: "7d", error_budget: 0.05, window_ms: 604800000}
elif $slo == "enrichment_latency" then
{name: "Enrichment Latency", target: 0.95, threshold: 5000, unit: "ms", window: "7d", error_budget: 0.05, window_ms: 604800000}
elif $slo == "llm_latency" then
{name: "LLM Latency", target: 0.9, threshold: 10000, unit: "ms", window: "24h", error_budget: 0.1, window_ms: 86400000}
else
null
end) as $meta |
# Filter to window
((now * 1000) - $meta.window_ms) as $cutoff |
[$events[] | select(.timestamp >= $cutoff)] as $recent |
# Calculate metrics
($recent | length) as $total |
([$recent[] | select(.violated == true)] | length) as $violations |
(if $total > 0 then ($total - $violations) / $total else 1 end) as $compliance |
(if $total > 0 then ($violations / $total) / $meta.error_budget else 0 end) as $burn_rate |
($meta.error_budget - (if $total > 0 then $violations / $total else 0 end)) as $budget_remaining |
(if $burn_rate > 0 then ($budget_remaining * ($meta.window_ms / 86400000) / $burn_rate) else 999 end) as $days_until_exhausted |
# Get violations for display
[$recent[] | select(.violated == true)] | sort_by(.timestamp) | reverse | .[0:10] as $recent_violations |
# Calculate percentiles
[$recent[] | .value] | sort as $sorted_values |
($sorted_values | length) as $count |
{
p50: (if $count > 0 then $sorted_values[($count * 0.5) | floor] else 0 end),
p90: (if $count > 0 then $sorted_values[($count * 0.9) | floor] else 0 end),
p95: (if $count > 0 then $sorted_values[($count * 0.95) | floor] else 0 end),
p99: (if $count > 0 then $sorted_values[($count * 0.99) | floor] else 0 end)
} as $percentiles |
{
meta: $meta,
total: $total,
violations: $violations,
compliance: $compliance,
burn_rate: $burn_rate,
budget_remaining: $budget_remaining,
days_until_exhausted: $days_until_exhausted,
recent_violations: $recent_violations,
percentiles: $percentiles
}
' | jq -r '
def format_date(ms):
(ms / 1000) | strftime("%Y-%m-%d %H:%M");
"=== SLO: \(.meta.name) ===\n",
"Target: \((.meta.target * 100) | floor)% under \(.meta.threshold)\(.meta.unit) (\(.meta.window) window)",
"Error Budget: \((.meta.error_budget * 100) | floor)%\n",
"Current Status:",
" Compliance: \((.compliance * 100) | floor * 100 / 100)% (\(.total - .violations)/\(.total) within SLO)",
" Violations: \(.violations) events exceeded threshold",
" Burn Rate: \((.burn_rate | floor * 100) / 100)x (consuming budget at \(((.burn_rate * 100) | floor))% of allowed rate)\n",
"Error Budget Status:",
" Remaining: \(((.budget_remaining * 100) | floor))% (\(((.meta.error_budget - .budget_remaining) * 100) | floor)% of \((.meta.error_budget * 100) | floor)% budget used)",
" Consumption Rate: \((.burn_rate | floor * 100) / 100)x sustainable rate",
" Days Until Exhausted: \(if .days_until_exhausted > 365 then "∞ (healthy)" else "\(.days_until_exhausted | floor) days (at current rate)" end)\n",
(if (.recent_violations | length) > 0 then
"Recent Violations (last 7d):",
(.recent_violations[] |
" \(format_date(.timestamp)) → \(.value)\(.meta.unit) (threshold: \(.threshold)\(.meta.unit), +\(((.value - .threshold) / .threshold * 100) | floor)%)"
),
""
else
"No violations in window\n"
end),
"Percentiles (\(.meta.window)):",
" p50: \(.percentiles.p50)\(.meta.unit)",
" p90: \(.percentiles.p90)\(.meta.unit)",
" p95: \(.percentiles.p95)\(.meta.unit)",
" p99: \(.percentiles.p99)\(.meta.unit)\n",
"Recommendations:",
(if .compliance >= .meta.target then
" ✓ SLO is healthy (within target)"
else
" ❌ SLO is breached (below target)"
end),
(if .burn_rate > 1 then
" ⚠️ Fast burn rate detected - investigate root causes"
else
" ✓ Burn rate is sustainable"
end),
(if .budget_remaining < 0.3 then
" ⚠️ Error budget below 30% - consider freeze on risky changes"
else
" ✓ Error budget is healthy"
end),
(if (.percentiles.p95 > .meta.threshold) then
" ℹ️ Consider investigating slow operations >\(.meta.threshold)\(.meta.unit)"
else ""
end)
'
--breaches)Show only recent SLO violations:
# Get violations in last 7 days (7 * 24 * 60 * 60 * 1000 = 604800000 ms)
# Limit to 5000 most recent events to prevent OOM
tail -n 5000 ~/.claude/logs/slo-events.jsonl | jq -s '
# Calculate cutoff time
((now * 1000) - 604800000) as $cutoff |
# Filter to violations in last 7 days
[.[] | select(.violated == true and .timestamp > $cutoff)] |
sort_by(.timestamp) | reverse |
# Group by SLO name
group_by(.sloName) |
map({
slo: .[0].sloName,
violations: .
})
' | jq -r '
def format_date(ms):
(ms / 1000) | strftime("%Y-%m-%d %H:%M");
"=== Recent SLO Breaches (7d) ===\n",
(if length == 0 then
"No SLO breaches in the last 7 days ✓"
else
(.[] |
"\n\(.slo):",
(.violations[] |
" \(format_date(.timestamp)) → \(.value) (threshold: \(.threshold), +\(((.value - .threshold) / .threshold * 100) | floor)%)"
)
),
"\n",
"Total Breaches: \([.[] | .violations | length] | add) across \(length) SLOs",
"Most Breached: \(sort_by(.violations | length) | reverse | .[0].slo) (\(sort_by(.violations | length) | reverse | .[0].violations | length) violations)"
end)
'
If the SLO events file doesn't exist, display:
⚠️ No SLO events found at ~/.claude/logs/slo-events.jsonl
This means SLO recording has never run. Try:
1. Run: para scan
2. Run: para execute
3. Run: /para-obsidian:slo again
SLO events are recorded during scan and execute operations.
Burn rate indicates how fast error budget is being consumed:
Formula: burnRate = (violations / totalEvents) / errorBudget
Example:
Memory Safety:
tail -n 5000 before jq -s to prevent OOM on large logsjq '.' instead of jq -s '.'File Location:
~/.claude/logs/slo-events.jsonltail reads from end, so we get most recent events firstNow execute the SLO analysis based on the argument $1.