Skill: Confluence Operational Documentation

Purpose

Create and maintain operational documentation including runbooks, deployment guides, infrastructure documentation, and incident response procedures.

When to Use

DevOps Engineer creates deployment runbook
DevOps Engineer documents infrastructure
DevOps Engineer creates incident response procedures
Senior Developer documents operational procedures

MCP Tools Used

Tool	Purpose
`Atlassian:createConfluencePage`	Create new documentation
`Atlassian:updateConfluencePage`	Update procedures
`Atlassian:getConfluencePage`	Read documentation
`Atlassian:searchConfluenceUsingCql`	Find runbooks

Configuration

{
  "cloudId": "{{confluence.cloudId}}",
  "spaceId": "{{confluence.spaceId}}",
  "spaceKey": "{{confluence.spaceKey}}",
  "parentPages": {
    "operations": "{{confluence.parentPages.operations}}"
  }
}

Runbook

Runbook Template

# Runbook: {{serviceName}}

**Last Updated:** {{date}}
**Owner:** DevOps Engineer
**On-Call:** [Link to on-call schedule]

---

## Service Overview

**Service:** {{serviceName}}
**Purpose:** [What this service does]
**Criticality:** Critical | High | Medium | Low

### Dependencies
| Service | Type | Impact if Down |
|---------|------|----------------|
| [Service] | Upstream/Downstream | [Impact] |

### Endpoints
| Environment | URL | Health Check |
|-------------|-----|--------------|
| Production | | /health |
| Staging | | /health |

---

## Monitoring & Alerts

### Dashboards
- [Link to main dashboard]
- [Link to metrics dashboard]

### Key Metrics
| Metric | Normal Range | Alert Threshold |
|--------|--------------|-----------------|
| Response time | <200ms | >500ms |
| Error rate | <1% | >5% |
| CPU | <70% | >85% |
| Memory | <80% | >90% |

### Alert Runbooks
| Alert | Severity | Runbook Section |
|-------|----------|-----------------|
| High Error Rate | P1 | #high-error-rate |
| High Latency | P2 | #high-latency |
| Pod Crash Loop | P1 | #pod-crash |

---

## Common Procedures

### Restart Service
```bash
# Kubernetes
kubectl rollout restart deployment/{{service}} -n {{namespace}}

# Verify
kubectl get pods -n {{namespace}} -l app={{service}}

Scale Service

# Scale up
kubectl scale deployment/{{service}} --replicas=5 -n {{namespace}}

# Scale down
kubectl scale deployment/{{service}} --replicas=2 -n {{namespace}}

View Logs

# Recent logs
kubectl logs -l app={{service}} -n {{namespace}} --tail=100

# Stream logs
kubectl logs -l app={{service}} -n {{namespace}} -f

Troubleshooting

High Error Rate

Symptoms:

Error rate >5%
User reports of failures

Diagnosis:

Check error logs: kubectl logs -l app={{service}} | grep ERROR
Check dependencies status
Review recent deployments

Resolution:

If deployment-related: Rollback
If dependency-related: Check upstream service
If data-related: Check database connections

High Latency

Symptoms:

P95 latency >500ms
Slow user experience

Diagnosis:

Check CPU/Memory utilization
Check database query times
Check external API response times

Resolution:

Scale up if resource-constrained
Optimize slow queries
Add caching if external dependency

Pod Crash Loop

Symptoms:

Pods restarting repeatedly
CrashLoopBackOff status

Diagnosis:

kubectl describe pod {{pod-name}} -n {{namespace}}
kubectl logs {{pod-name}} -n {{namespace}} --previous

Resolution:

Check for OOM kills (increase memory)
Check for config errors
Check for missing secrets/configmaps

Deployment

Deploy New Version

# Update image
kubectl set image deployment/{{service}} {{service}}={{image}}:{{tag}} -n {{namespace}}

# Monitor rollout
kubectl rollout status deployment/{{service}} -n {{namespace}}

Rollback

# Rollback to previous
kubectl rollout undo deployment/{{service}} -n {{namespace}}

# Rollback to specific revision
kubectl rollout undo deployment/{{service}} --to-revision={{revision}} -n {{namespace}}

Contacts

Role	Contact	When to Escalate
On-Call	[Rotation]	First responder
Service Owner	[Name]	Design decisions
Infra Team	[Channel]	Infrastructure issues


### Create Runbook Example
```javascript
Atlassian:createConfluencePage({
  cloudId: "{{confluence.cloudId}}",
  spaceId: "{{confluence.spaceId}}",
  parentId: "{{confluence.parentPages.operations}}",
  title: "Runbook: Authentication Service",
  body: `# Runbook: Authentication Service

**Last Updated:** 2025-01-06
**Owner:** DevOps Engineer Agent
**Criticality:** Critical

---

## Service Overview

**Service:** auth-service
**Purpose:** Handles user authentication, session management, and JWT token issuance

### Dependencies
| Service | Type | Impact if Down |
|---------|------|----------------|
| PostgreSQL | Downstream | Auth fails |
| Redis | Downstream | Sessions lost |
| API Gateway | Upstream | No traffic |

### Endpoints
| Environment | URL | Health Check |
|-------------|-----|--------------|
| Production | auth.example.com | /health |
| Staging | auth-staging.example.com | /health |

---

## Monitoring

### Key Metrics
| Metric | Normal | Alert |
|--------|--------|-------|
| Login latency | <100ms | >300ms |
| Failed logins | <5/min | >20/min |
| Token issuance | <50ms | >200ms |

---

## Common Procedures

### Restart Service
\`\`\`bash
kubectl rollout restart deployment/auth-service -n production
\`\`\`

### View Logs
\`\`\`bash
kubectl logs -l app=auth-service -n production --tail=100
\`\`\`

---

## Troubleshooting

### High Failed Login Rate

**Diagnosis:**
1. Check if brute force attack (same IP)
2. Check if credential stuffing (many IPs)
3. Check if legitimate (password reset needed)

**Resolution:**
1. Enable rate limiting
2. Block suspicious IPs
3. Notify security team if attack
`,
  contentFormat: "markdown"
})

Deployment Guide

Deployment Guide Template

# Deployment Guide: {{serviceName}}

**Version:** {{version}}
**Last Updated:** {{date}}

---

## Prerequisites

- [ ] Access to {{environment}} cluster
- [ ] Required credentials/secrets
- [ ] Approval from [approver]

## Pre-Deployment Checklist

- [ ] All tests passing
- [ ] Code review approved
- [ ] Change request approved
- [ ] Rollback plan documented
- [ ] Monitoring dashboards ready

## Deployment Steps

### 1. Prepare
```bash
# Verify current state
kubectl get deployment {{service}} -n {{namespace}}

2. Deploy

# Apply changes
kubectl apply -f deployment.yaml

3. Verify

# Check rollout status
kubectl rollout status deployment/{{service}}

# Verify pods healthy
kubectl get pods -l app={{service}}

# Check logs for errors
kubectl logs -l app={{service}} --tail=50

4. Smoke Test

Health endpoint returns 200
Key functionality working
No error spikes in monitoring

Rollback Procedure

If issues detected:

kubectl rollout undo deployment/{{service}} -n {{namespace}}

Post-Deployment

Update deployment log
Notify stakeholders
Monitor for 30 minutes


---

## Used By Agents

| Agent | Document Types |
|-------|---------------|
| **DevOps Engineer** | Runbooks, Deployment Guides, Infra Docs |
| **Senior Developer** | Service-specific procedures |
| **Orchestrator** | Reference during incidents |

## Error Handling

| Error | Cause | Resolution |
|-------|-------|------------|
| 400 Bad Request | Code block formatting | Use proper markdown fencing |
| 404 Not Found | Parent page missing | Create Operations parent first |

Skill: Confluence Operational Documentation

Skill: Confluence Operational Documentation

Purpose

When to Use

MCP Tools Used

Configuration

Runbook

Runbook Template

Scale Service

View Logs

Troubleshooting

High Error Rate

High Latency

Pod Crash Loop

Deployment

Deploy New Version

Rollback

Contacts

Deployment Guide

Deployment Guide Template

2. Deploy

3. Verify

4. Smoke Test

Rollback Procedure

Post-Deployment

Similar Skills