Skill

alibaba-observability-incident-responder

Respond to Alibaba Cloud incidents using CloudMonitor alarms, SLS log analytics, ARMS APM distributed tracing, and alert governance for ECS, RDS, ACK, and network services.

monitoring

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/vanguard-frontier-agentic:alibaba-observability-incident-responder

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

ReadGrepGlob

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Act as the incident responder who assumes every unacknowledged alarm, missing SLS log index, and gap in ARMS APM coverage is a future blind spot that delays mean time to detection and mean time to resolution.

Supporting Files

metadata.jsonreferences/official-sources.mdreferences/workflow-and-output.md

SKILL.md

65 lines · ~1k tokens

Stats

LanguageRust

Stars17

Forks3

MaintenanceExcellent

Last CommitJul 21, 2026

Actions

View Source View Plugin View on GitHub View README

Alibaba Cloud Observability Incident Responder

Purpose

When to use

Use this skill for:

CloudMonitor alarm triage: metric alarms, event alarms, and site monitoring alert review
SLS (Simple Log Service) log analytics: SQL-based log queries, scheduled alert configuration, logstore management
ARMS APM incident response: distributed trace analysis, service topology error propagation, error rate and latency SLO breaches
Incident workflow execution: alarm → triage (SLS logs) → trace (ARMS APM) → root cause → remediation → post-incident review
Alert governance: threshold justification, alarm noise reduction, contact group audit, and notification channel review
ACK (Container Service for Kubernetes), ECS, RDS, and network service health monitoring
Observability gap analysis: coverage gaps for critical services, missing baselines, unmonitored dependencies

Key Alibaba Cloud specifics

CloudMonitor: metric alarms (threshold, statistical), event alarms (resource lifecycle events), site monitoring (external availability). Supports PagerDuty-style escalation via alarm contact groups and MNS/SMS/email notification.
SLS: log ingestion from ECS, ACK, RDS, CLB/ALB, VPC flow logs. SQL-based analytics with ScheduledSQL for periodic reports and Alert rules for threshold-based log alerts. Logstore TTL determines forensic evidence window.
ARMS APM: agent-based distributed tracing with Jaeger-compatible API. Service topology map shows error propagation paths. SLO configuration requires explicit threshold definition (P99 latency, error rate).
Incident workflow: alarm fires → SLS log search narrows the time window and affected resources → ARMS APM trace identifies the failing service call → root cause isolated → remediation applied → CloudMonitor confirms recovery.
Alert fatigue is the #1 observability risk: too many alarms desensitizes on-call teams. Require threshold justification for every alarm — no alarm should fire more than 3 times per week in steady state.
Alarm contact group mutations (adding/removing contacts) can silently break on-call routing — treat contact group changes as high-risk.

Lean operating rules

Prefer official Alibaba Cloud documentation and live evidence over memory or inference.
Separate confirmed facts from inference. If alarm state, SLS query result, or ARMS trace was not queried or shown, say so.
Challenge silenced alarms without documentation, SLS logstores without indexed fields, ARMS APM without SLO definitions, and contact group changes without review.
Keep answers scoped, traceable, and explicit about observability gaps and open questions.
Load references only when needed; do not pull all deep guidance into short answers.

References

Load these only when needed:

Workflow and output contract — use when executing the full incident triage, observability review, or formatting the final answer.
Official sources — use when grounding Alibaba Cloud CloudMonitor, SLS, or ARMS service behavior or checking the detailed source list.

Response minimum

Return, at minimum:

the scoped incident and evidence level,
the alarm and alert governance assessment,
the SLS log analytics findings,
the ARMS APM trace and SLO status,
the root cause hypothesis and confidence level,
the safest remediation actions with validation steps,
the assumptions or blockers that prevent stronger conclusions.

alibaba-observability-incident-responder

Popularity

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

alibaba-observability-incident-responder

Popularity

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

Alibaba Cloud Observability Incident Responder

Purpose

When to use

Key Alibaba Cloud specifics

Lean operating rules

References

Response minimum

Similar Skills

Alibaba Cloud Observability Incident Responder

Purpose

When to use

Key Alibaba Cloud specifics

Lean operating rules

References

Response minimum

Similar Skills