From enterprise-harness-engineering
Manages SRE workflows in four modes: oncall alert triage, root cause diagnosis, preventive patrols, and self-improvement iteration using PagerDuty and infrastructure context.
npx claudepluginhub addxai/enterprise-harness-engineering --plugin enterprise-harness-engineeringThis skill uses the workspace's default tool permissions.
SRE Agent. Four operating modes, which can invoke each other.
references/capability-feishu.mdreferences/capability-pagerduty.mdreferences/capability-scripts-cleanup.mdreferences/infra-context.mdreferences/known-issue-evidence-standard.mdreferences/known-issues.mdreferences/mode-diagnosis.mdreferences/mode-iteration.mdreferences/mode-oncall.mdreferences/mode-patrol.mdreferences/patrol-playbook.mdreferences/report-standard.mdreferences/role-diagnosis.mdreferences/role-entry.mdreferences/role-patrol-l1.mdreferences/role-patrol-l2.mdreferences/role-triage.mdreferences/setup.mdscripts/feishu_notify.pyscripts/pagerduty_api.pyManages SRE production incidents: assesses impact, establishes command, investigates via observability (Prometheus, OpenTelemetry, Grafana), conducts blameless post-mortems, handles error budgets.
Guides SRE incident response with severity assessment, command setup, stabilization, and observability-driven investigation using Prometheus and OpenTelemetry.
Guides incident management with lifecycle stages, severity levels, roles, metrics like MTTR. Use for runbooks, on-call rotations, postmortems.
Share bugs, ideas, or general feedback.
SRE Agent. Four operating modes, which can invoke each other.
Before using sre-agent, configure the following:
| Variable | Description | Required For |
|---|---|---|
PAGERDUTY_API_TOKEN | PagerDuty API v2 Access Key | oncall / diagnosis / patrol |
NOTIFICATION_WEBHOOK_URL | Notification webhook URL (e.g. Slack, Feishu, Teams) | oncall / patrol notifications |
NOTIFICATION_WEBHOOK_SECRET | Webhook signing secret (if applicable) | oncall / patrol notifications |
Additionally, populate references/infra-context.md with your infrastructure details:
Route to the appropriate mode based on $ARGUMENTS or user input characteristics:
| Input Characteristics | Mode | Rules File |
|---|---|---|
| "oncall", "check alerts", scheduled trigger | oncall | references/mode-oncall.md |
| Contains specific incidents / alert / alert content | diagnosis | references/mode-diagnosis.md |
| "patrol", "health check", "inspection" | patrol | references/mode-patrol.md |
| "iterate", "retrospective", "improve sre-agent" | iteration | references/mode-iteration.md |
| "check alerts", "ack", "resolve", PagerDuty operations | Use PagerDuty capability directly | references/capability-pagerduty.md |
After entering the corresponding mode, the rules file for that mode must be read and strictly followed.
oncall ──invokes──> diagnosis (Triage dispatches Diagnosis Agent for deep investigation)
patrol ──invokes──> diagnosis (deep analysis of critical-level patrol findings)
diagnosis ─references─> patrol-playbook (consults known failure patterns to assist investigation)
oncall ──persists──> known-issues (written after user confirmation)
diagnosis ─reads─> known-issues (references known issues)
iteration ─reads/writes─> all references (improves sre-agent itself based on feedback)
The following rules apply across all modes and do not require additional file reads.
Absolutely prohibited (in oncall / patrol / diagnosis modes):
Allowed: All GET / list / describe / logs / query read-only operations.
sre-agent is designed for autonomous operation, independent of human interaction.
missing_signals, do not stop and wait for a personreferences/infra-context.mdThree absolute prohibitions (violations trigger mandatory human review):
&&, ||, or ;; one Bash call executes one command only2>&1, 2>/dev/null, or > fileCore principle: simple commands (one command + arguments, no shell syntax) are executed directly; commands with pipes, redirections, or special characters must be written as sh/py scripts using the Write tool first.
When script execution errors occur (such as missing environment variables, uninstalled tools, or authentication failures), read references/setup.md and follow its instructions to guide the user through configuration. Do not guess at solutions.
references/capability-pagerduty.mdreferences/capability-feishu.md.scripts/ directory after Teammate completion -> references/capability-scripts-cleanup.mdLayer 0: SKILL.md — loaded on skill trigger (routing + global rules)
Layer 1: mode-*.md — Lead reads when entering a mode (orchestration logic)
Layer 2: role-*.md — Lead reads when creating a Teammate (role contract, prompt blueprint)
Layer 3: capability/data — each Teammate reads on demand during execution (tool usage + data)
Each layer is only loaded when needed, avoiding reading all files at once.
User: oncall
Agent: What do you want me to do? Should I check alerts? Or do you want to see the patrol report?
Problem: Violates the "No Human Intervention Principle". Should not ask the user questions; should autonomously route to oncall mode and start pulling alerts.
User: oncall
Agent: [read mode-oncall.md] -> [call PagerDuty API to pull triggered incidents]
-> [deduplicate and correlate] -> [triage by severity] -> [dispatch diagnosis agents in parallel]
-> [output structured incident_report] -> [Feishu notification]
Correct: Autonomously routes to oncall mode, executes the full diagnostic pipeline, no human intervention needed.
| File | Layer | Content |
|---|---|---|
references/mode-oncall.md | Orchestration | oncall Lead orchestration: architecture, lifecycle, messaging protocol |
references/mode-diagnosis.md | Orchestration | Direct diagnosis invocation orchestration (simple -> direct, complex -> create Team) |
references/mode-patrol.md | Orchestration | patrol Lead orchestration: entry discovery, report aggregation, lifecycle |
references/mode-iteration.md | Orchestration | Iteration methodology (self-learning, diagnosis quality assessment, incident retrospective) |
references/role-entry.md | Role | Entry: alert pulling (cron poll PagerDuty) |
references/role-triage.md | Role | Triage: triage dispatch (dedup/correlate/dispatch) |
references/role-diagnosis.md | Role | Diagnosis: diagnostic investigation (multi-dimensional parallel) |
references/role-patrol-l1.md | Role | Patrol L1: service discovery + five-domain inspection |
references/role-patrol-l2.md | Role | Patrol L2: targeted deep inspection |
references/capability-pagerduty.md | Capability | PagerDuty script usage |
references/capability-feishu.md | Capability | Feishu notifications (including patrol card templates) |
references/capability-scripts-cleanup.md | Capability | Temp script cleanup |
references/infra-context.md | Data | Infrastructure mapping (endpoints, accounts, clusters) |
references/known-issues.md | Data | Known issues database |
references/report-standard.md | Data | Unified report standard (incident_report YAML structure + Feishu mapping, shared by Diagnosis + Triage) |
references/known-issue-evidence-standard.md | Data | expected_evidence quality standard (shared by Triage + iteration mode) |
references/patrol-playbook.md | Data | Patrol experience database |
references/setup.md | Data | Installation and configuration (environment variables, required tools, troubleshooting) |