Create and maintain operational documentation including runbooks, deployment guides, infrastructure documentation, and incident response procedures.
/plugin marketplace add marcel-Ngan/ai-dev-team/plugin install marcel-ngan-ai-dev-team@marcel-Ngan/ai-dev-teamThis skill inherits all available tools. When active, it can use any tool Claude has access to.
Create and maintain operational documentation including runbooks, deployment guides, infrastructure documentation, and incident response procedures.
| Tool | Purpose |
|---|---|
Atlassian:createConfluencePage | Create new documentation |
Atlassian:updateConfluencePage | Update procedures |
Atlassian:getConfluencePage | Read documentation |
Atlassian:searchConfluenceUsingCql | Find runbooks |
{
"cloudId": "{{confluence.cloudId}}",
"spaceId": "{{confluence.spaceId}}",
"spaceKey": "{{confluence.spaceKey}}",
"parentPages": {
"operations": "{{confluence.parentPages.operations}}"
}
}
# Runbook: {{serviceName}}
**Last Updated:** {{date}}
**Owner:** DevOps Engineer
**On-Call:** [Link to on-call schedule]
---
## Service Overview
**Service:** {{serviceName}}
**Purpose:** [What this service does]
**Criticality:** Critical | High | Medium | Low
### Dependencies
| Service | Type | Impact if Down |
|---------|------|----------------|
| [Service] | Upstream/Downstream | [Impact] |
### Endpoints
| Environment | URL | Health Check |
|-------------|-----|--------------|
| Production | | /health |
| Staging | | /health |
---
## Monitoring & Alerts
### Dashboards
- [Link to main dashboard]
- [Link to metrics dashboard]
### Key Metrics
| Metric | Normal Range | Alert Threshold |
|--------|--------------|-----------------|
| Response time | <200ms | >500ms |
| Error rate | <1% | >5% |
| CPU | <70% | >85% |
| Memory | <80% | >90% |
### Alert Runbooks
| Alert | Severity | Runbook Section |
|-------|----------|-----------------|
| High Error Rate | P1 | #high-error-rate |
| High Latency | P2 | #high-latency |
| Pod Crash Loop | P1 | #pod-crash |
---
## Common Procedures
### Restart Service
```bash
# Kubernetes
kubectl rollout restart deployment/{{service}} -n {{namespace}}
# Verify
kubectl get pods -n {{namespace}} -l app={{service}}
# Scale up
kubectl scale deployment/{{service}} --replicas=5 -n {{namespace}}
# Scale down
kubectl scale deployment/{{service}} --replicas=2 -n {{namespace}}
# Recent logs
kubectl logs -l app={{service}} -n {{namespace}} --tail=100
# Stream logs
kubectl logs -l app={{service}} -n {{namespace}} -f
Symptoms:
Diagnosis:
kubectl logs -l app={{service}} | grep ERRORResolution:
Symptoms:
Diagnosis:
Resolution:
Symptoms:
Diagnosis:
kubectl describe pod {{pod-name}} -n {{namespace}}
kubectl logs {{pod-name}} -n {{namespace}} --previous
Resolution:
# Update image
kubectl set image deployment/{{service}} {{service}}={{image}}:{{tag}} -n {{namespace}}
# Monitor rollout
kubectl rollout status deployment/{{service}} -n {{namespace}}
# Rollback to previous
kubectl rollout undo deployment/{{service}} -n {{namespace}}
# Rollback to specific revision
kubectl rollout undo deployment/{{service}} --to-revision={{revision}} -n {{namespace}}
| Role | Contact | When to Escalate |
|---|---|---|
| On-Call | [Rotation] | First responder |
| Service Owner | [Name] | Design decisions |
| Infra Team | [Channel] | Infrastructure issues |
### Create Runbook Example
```javascript
Atlassian:createConfluencePage({
cloudId: "{{confluence.cloudId}}",
spaceId: "{{confluence.spaceId}}",
parentId: "{{confluence.parentPages.operations}}",
title: "Runbook: Authentication Service",
body: `# Runbook: Authentication Service
**Last Updated:** 2025-01-06
**Owner:** DevOps Engineer Agent
**Criticality:** Critical
---
## Service Overview
**Service:** auth-service
**Purpose:** Handles user authentication, session management, and JWT token issuance
### Dependencies
| Service | Type | Impact if Down |
|---------|------|----------------|
| PostgreSQL | Downstream | Auth fails |
| Redis | Downstream | Sessions lost |
| API Gateway | Upstream | No traffic |
### Endpoints
| Environment | URL | Health Check |
|-------------|-----|--------------|
| Production | auth.example.com | /health |
| Staging | auth-staging.example.com | /health |
---
## Monitoring
### Key Metrics
| Metric | Normal | Alert |
|--------|--------|-------|
| Login latency | <100ms | >300ms |
| Failed logins | <5/min | >20/min |
| Token issuance | <50ms | >200ms |
---
## Common Procedures
### Restart Service
\`\`\`bash
kubectl rollout restart deployment/auth-service -n production
\`\`\`
### View Logs
\`\`\`bash
kubectl logs -l app=auth-service -n production --tail=100
\`\`\`
---
## Troubleshooting
### High Failed Login Rate
**Diagnosis:**
1. Check if brute force attack (same IP)
2. Check if credential stuffing (many IPs)
3. Check if legitimate (password reset needed)
**Resolution:**
1. Enable rate limiting
2. Block suspicious IPs
3. Notify security team if attack
`,
contentFormat: "markdown"
})
# Deployment Guide: {{serviceName}}
**Version:** {{version}}
**Last Updated:** {{date}}
---
## Prerequisites
- [ ] Access to {{environment}} cluster
- [ ] Required credentials/secrets
- [ ] Approval from [approver]
## Pre-Deployment Checklist
- [ ] All tests passing
- [ ] Code review approved
- [ ] Change request approved
- [ ] Rollback plan documented
- [ ] Monitoring dashboards ready
## Deployment Steps
### 1. Prepare
```bash
# Verify current state
kubectl get deployment {{service}} -n {{namespace}}
# Apply changes
kubectl apply -f deployment.yaml
# Check rollout status
kubectl rollout status deployment/{{service}}
# Verify pods healthy
kubectl get pods -l app={{service}}
# Check logs for errors
kubectl logs -l app={{service}} --tail=50
If issues detected:
kubectl rollout undo deployment/{{service}} -n {{namespace}}
---
## Used By Agents
| Agent | Document Types |
|-------|---------------|
| **DevOps Engineer** | Runbooks, Deployment Guides, Infra Docs |
| **Senior Developer** | Service-specific procedures |
| **Orchestrator** | Reference during incidents |
## Error Handling
| Error | Cause | Resolution |
|-------|-------|------------|
| 400 Bad Request | Code block formatting | Use proper markdown fencing |
| 404 Not Found | Parent page missing | Create Operations parent first |
This skill should be used when the user asks to "create a slash command", "add a command", "write a custom command", "define command arguments", "use command frontmatter", "organize commands", "create command with file references", "interactive command", "use AskUserQuestion in command", or needs guidance on slash command structure, YAML frontmatter fields, dynamic arguments, bash execution in commands, user interaction patterns, or command development best practices for Claude Code.
This skill should be used when the user asks to "create an agent", "add an agent", "write a subagent", "agent frontmatter", "when to use description", "agent examples", "agent tools", "agent colors", "autonomous agent", or needs guidance on agent structure, system prompts, triggering conditions, or agent development best practices for Claude Code plugins.
This skill should be used when the user asks to "create a hook", "add a PreToolUse/PostToolUse/Stop hook", "validate tool use", "implement prompt-based hooks", "use ${CLAUDE_PLUGIN_ROOT}", "set up event-driven automation", "block dangerous commands", or mentions hook events (PreToolUse, PostToolUse, Stop, SubagentStop, SessionStart, SessionEnd, UserPromptSubmit, PreCompact, Notification). Provides comprehensive guidance for creating and implementing Claude Code plugin hooks with focus on advanced prompt-based hooks API.