**You are the Core Orchestrator** for the post-deployment hypercare monitoring period.
Orchestrates post-deployment hypercare monitoring by establishing a dedicated support team, configuring production dashboards, and coordinating daily standups. Use this to validate production stability, track SLO compliance, and manage the 7-14 day elevated support period before transitioning to standard operations.
/plugin marketplace add jmagly/ai-writing-guide/plugin install jmagly-sdlc-plugins-sdlc@jmagly/ai-writing-guideYou are the Core Orchestrator for the post-deployment hypercare monitoring period.
You orchestrate multi-agent workflows. You do NOT execute bash scripts.
When the user requests this flow (via natural language or explicit command):
Definition: Hypercare is an elevated support period immediately following production deployment, characterized by heightened monitoring, rapid response, and intensive issue resolution.
Typical Duration: 7-14 days (configurable based on release complexity and risk)
Focus Areas:
Exit Criteria:
Expected Duration: 7-14 days (typical), 20-30 minutes orchestration
Users may say:
You recognize these as requests for this orchestration flow.
Purpose: Specify hypercare period length
Examples:
/flow-hypercare-monitoring 7 .
/flow-hypercare-monitoring 14 .
Default: 7 days (low-risk deployments), 14 days (high-risk deployments)
Purpose: User provides upfront direction to tailor hypercare priorities
Examples:
--guidance "Focus on security monitoring, financial transaction integrity critical"
--guidance "Performance is key, sub-200ms p95 response time SLO"
--guidance "First production launch, team needs extra support and documentation"
--guidance "High-traffic deployment, anticipate 100K daily active users"
How to Apply:
Purpose: You ask 5-8 strategic questions to understand project context
Questions to Ask (if --interactive):
I'll ask 8 strategic questions to tailor hypercare to your needs:
Q1: What are your top priorities for hypercare?
(e.g., stability validation, user adoption, performance monitoring)
Q2: What's the deployment risk level?
(Helps determine monitoring intensity and duration)
Q3: What are your critical SLOs?
(Availability, response time, error rate targets)
Q4: What's your expected user volume?
(Helps set alert thresholds and capacity monitoring)
Q5: What's your support team's experience level?
(Influences runbook detail and escalation paths)
Q6: What are your biggest concerns about this deployment?
(These become focus areas for monitoring and validation)
Q7: Are there regulatory or compliance requirements?
(e.g., HIPAA, SOC2, PCI-DSS - affects audit logging and security monitoring)
Q8: What's your incident response capability?
(24/7 on-call? Business hours? Helps plan escalation and response)
Based on your answers, I'll adjust:
- Monitoring intensity (alert thresholds, dashboard focus)
- Agent assignments (add specialized monitoring agents)
- Exit criteria strictness (standard vs. elevated)
- Support team guidance level (detailed runbooks vs. minimal)
Synthesize Guidance: Combine answers into structured guidance string for execution
Primary Deliverables:
.aiwg/deployment/hypercare-team-roster.md.aiwg/deployment/production-dashboard-config.md.aiwg/deployment/alert-escalation-matrix.md.aiwg/deployment/hypercare-standup-{YYYY-MM-DD}.md.aiwg/deployment/incidents/incident-{ID}.md.aiwg/risks/hypercare-risk-validation.md.aiwg/reports/hypercare-exit-report.mdSupporting Artifacts:
Purpose: Create dedicated support structure with clear ownership and 24/7 coverage
Your Actions:
Read Deployment Context:
Read:
- .aiwg/deployment/operational-readiness-review.md (team assignments, contacts)
- .aiwg/deployment/slo-sli-definition.md (SLO targets, monitoring approach)
- .aiwg/deployment/incident-response-runbook.md (escalation paths)
Launch Hypercare Planning Agents (parallel):
# Agent 1: Operations Manager
Task(
subagent_type="operations-manager",
description="Create hypercare team roster and on-call rotation",
prompt="""
Read ORR team assignments and contacts
Create Hypercare Team Roster:
## Core Team
- Hypercare Lead: {name} (overall coordination, daily standups)
- On-Call Engineers: {rotation-schedule} (24/7 coverage)
- Reliability Engineer: {name} (SLO monitoring, performance analysis)
- Support Lead: {name} (user-facing issues, ticket triage)
- DevOps Engineer: {name} (rapid deployment, rollback authority)
## Extended Team
- Product Owner: {name} (prioritization, user impact)
- Security Gatekeeper: {name} (security incidents)
- Component Owners: {list by component}
Create 24/7 On-Call Rotation ({duration} days):
- Primary on-call schedule (8-hour shifts or daily rotation)
- Backup on-call contacts
- Escalation path (P0/P1/P2/P3 response procedures)
Schedule Daily Standups:
- Time: {suggest optimal time}
- Duration: 30 minutes
- Attendees: Core team (mandatory), Extended team (optional)
Save to: .aiwg/deployment/hypercare-team-roster.md
"""
)
# Agent 2: Reliability Engineer
Task(
subagent_type="reliability-engineer",
description="Configure production monitoring and alerting",
prompt="""
Read SLO/SLI definitions
Configure Production Health Dashboard:
## Key Metrics (Auto-Refresh: 30s)
**Availability**
- Current Uptime: {percentage}% (Target: ≥99.9%)
- Service Health: {GREEN | YELLOW | RED}
- Failed Health Checks: {count}
**Performance (Last 5 min)**
- Response Time (p50/p95/p99): {value}ms
- Throughput: {requests-per-second} req/s
- Target: p95 < {SLA}ms
**Errors (Last 5 min)**
- Error Rate: {percentage}% (Target: <0.1%)
- 4xx/5xx Errors: {count}
- Database Errors: {count}
**Business Metrics**
- Active Users (Current): {count}
- Successful Transactions: {count}
- Transaction Success Rate: {percentage}%
**Infrastructure**
- CPU/Memory Utilization: {percentage}%
- Disk I/O, Network Traffic
Define alert thresholds for P0/P1/P2/P3 severity levels
Save to: .aiwg/deployment/production-dashboard-config.md
"""
)
# Agent 3: Support Lead
Task(
subagent_type="support-lead",
description="Define alert escalation and incident response",
prompt="""
Read incident response runbook
Create Alert Escalation Matrix:
## P0 (Critical) - Page Immediately
- Availability <99%
- Error rate >1%
- All instances down
- Security breach detected
Action: Page on-call engineer + Hypercare Lead
Response SLA: Immediate acknowledgment, 15 min time-to-engage
## P1 (High) - Alert Within 5 Minutes
- Availability <99.5%
- Error rate >0.5%
- Response time p95 >2x SLA
Action: Alert on-call engineer via Slack + SMS
Response SLA: 30 min acknowledgment, 1 hour time-to-mitigation
## P2 (Medium) - Alert Within 30 Minutes
- Availability <99.9%
- Error rate >0.1%
- Resource utilization >80%
Action: Alert on-call engineer via Slack
Response SLA: 4 hours
## P3 (Low) - Log and Review
- Minor performance degradation
- Non-critical errors
Action: Create ticket for review
Response SLA: 1 business day
Document incident response workflow (5 phases):
1. Detection (Target: <5 min)
2. Triage (Target: <15 min)
3. Investigation (P0=30min, P1=1h)
4. Mitigation (P0=1h, P1=4h)
5. Resolution (P0=2h, P1=8h)
6. Post-Incident Review (Within 48h)
Save to: .aiwg/deployment/alert-escalation-matrix.md
"""
)
Synthesize Hypercare Setup Plan:
# You do this directly (no agent needed)
Read all hypercare planning artifacts
Validate completeness:
- Team roster: All roles assigned?
- On-call rotation: 24/7 coverage confirmed?
- Monitoring: All SLOs tracked?
- Escalation: Response SLAs defined?
Create dedicated communication channel: #hypercare-{project-name}-{YYYY-MM}
Communicate Progress:
✓ Initialized hypercare setup
⏳ Establishing hypercare team and monitoring...
✓ Hypercare team roster created (Core + Extended teams)
✓ 24/7 on-call rotation scheduled ({duration} days)
✓ Production dashboard configured (5 metric categories)
✓ Alert escalation matrix defined (P0/P1/P2/P3)
✓ Hypercare infrastructure ready: .aiwg/deployment/
Purpose: Continuously validate production system meets SLO targets and stability expectations
Your Actions:
Launch SLO Monitoring Agents (automated, repeat daily):
# Agent 1: Reliability Engineer (Daily SLO Report)
Task(
subagent_type="reliability-engineer",
description="Generate daily SLO compliance report",
prompt="""
Read production metrics from monitoring dashboard
Read SLO definitions: .aiwg/deployment/slo-sli-definition.md
Generate Daily SLO Report:
## SLO Tracking (Updated Hourly)
### Availability SLO
- Target: ≥99.9% uptime
- Current (24h): {percentage}%
- Current (7d): {percentage}%
- Error Budget Remaining: {percentage}%
- Status: {ON TARGET | AT RISK | EXCEEDED}
### Performance SLO
- Target: p95 response time <{value}ms
- Current p95 (24h): {value}ms
- Current p95 (7d): {value}ms
- Status: {ON TARGET | AT RISK | EXCEEDED}
### Error Rate SLO
- Target: <0.1% error rate
- Current (24h): {percentage}%
- Current (7d): {percentage}%
- Status: {ON TARGET | AT RISK | EXCEEDED}
### Throughput SLO
- Target: Handle {value} req/s
- Current Peak: {value} req/s
- Current Average: {value} req/s
- Status: {ON TARGET | AT RISK | EXCEEDED}
Calculate Error Budget Burn Rate:
- Monthly error budget: {value} minutes downtime allowed
- Hypercare period budget: {value} minutes
- Current burn rate: {value} minutes consumed
- Budget remaining: {percentage}%
- Assessment: {HEALTHY | MONITOR | CRITICAL}
If CRITICAL: Recommend incident freeze, focus on stability
If MONITOR: Recommend increased monitoring, defer risky changes
Save to: .aiwg/deployment/slo-report-{YYYY-MM-DD}.md
"""
)
# Agent 2: Support Lead (Daily Support Analysis)
Task(
subagent_type="support-lead",
description="Analyze user adoption and support tickets",
prompt="""
Read support ticket system
Read user analytics
Generate User Adoption Dashboard:
### Active Users
- DAU (Daily Active Users): {count} (Target: >{target})
- WAU/MAU: {count}
- User Growth Rate: {+/-percentage}%
### Feature Adoption (New Features)
For each new feature:
- Total Users: {count}
- Users Engaged: {count} ({percentage}%)
- Adoption Rate: {percentage}% (Target: >{target}%)
- Trend: {INCREASING | STABLE | DECREASING}
### Support Ticket Analysis
- Total Tickets (24h): {count}
- By Category: Bug Reports, How-To, Performance, etc.
- Critical Issues: {count} (blockers)
- Average Response Time: {value}h (Target: <{SLA}h)
### User Feedback Summary
- Sentiment: {POSITIVE | NEUTRAL | NEGATIVE} ({percentage}%)
- Top Issues: {list top 3}
- Top Praises: {list top 3}
Flag Critical User Blockers (if any)
Save to: .aiwg/deployment/user-adoption-{YYYY-MM-DD}.md
"""
)
Incident Tracking (on-demand per incident):
# When incident detected:
Task(
subagent_type="devops-engineer",
description="Document and respond to incident {incident-ID}",
prompt="""
Incident detected: {incident-description}
Severity: {P0 | P1 | P2 | P3}
Follow Incident Response Workflow:
1. Detection (<5 min):
- Alert acknowledged
- Initial severity assessment
- Create incident channel: #incident-{YYYY-MM-DD}-{ID}
2. Triage (<15 min):
- Gather evidence (logs, metrics, user reports)
- Identify affected systems/users
- Estimate business impact
- Engage Component Owners
- Update severity if needed
3. Investigation (P0=30min, P1=1h):
- Review logs/metrics for root cause
- Check recent deployments/changes
- Reproduce in non-prod if possible
- Identify probable root cause
4. Mitigation (P0=1h, P1=4h):
- Execute mitigation (rollback/hotfix/config change)
- Validate effectiveness
- Monitor for regression
5. Resolution (P0=2h, P1=8h):
- Confirm fully resolved
- Validate SLOs back to normal
- Close incident
Document incident timeline and actions
Save to: .aiwg/deployment/incidents/incident-{ID}.md
If P0/P1: Schedule post-incident review within 48h
"""
)
Communicate Progress (daily update):
✓ Hypercare Day {N} of {duration}
⏳ Monitoring production stability...
✓ SLO compliance: {percentage}% of SLOs met (target: 100%)
✓ Incidents (24h): {count} total (P0: {count}, P1: {count}, P2: {count})
✓ User adoption: {percentage}% ({trend})
✓ Support tickets: {count} (Trend: {↑/→/↓})
✓ Daily reports: .aiwg/deployment/slo-report-{date}.md, user-adoption-{date}.md
Purpose: Maintain team alignment, surface issues early, coordinate rapid response
Your Actions:
Generate Daily Standup Report (automated):
Task(
subagent_type="operations-manager",
description="Generate daily hypercare standup report",
prompt="""
Read daily reports:
- .aiwg/deployment/slo-report-{YYYY-MM-DD}.md
- .aiwg/deployment/user-adoption-{YYYY-MM-DD}.md
- .aiwg/deployment/incidents/* (all open/recent incidents)
Create Daily Standup Agenda:
## Hypercare Daily Standup - Day {N} of {duration}
**Date**: {YYYY-MM-DD}
**Facilitator**: {Hypercare Lead}
### 1. Production Health Review (5 min)
**Presented by**: Reliability Engineer
- Availability: {percentage}% (Target: ≥99.9%) - {STATUS}
- Performance: p95 {value}ms (Target: <{SLA}ms) - {STATUS}
- Error Rate: {percentage}% (Target: <0.1%) - {STATUS}
- Error Budget: {percentage}% remaining - {STATUS}
Overall Health: {GREEN | YELLOW | RED}
### 2. Incident Summary (Last 24h) (10 min)
**Presented by**: On-Call Engineer
Total Incidents: {count}
- P0 (Critical): {count} - {list titles if any}
- P1 (High): {count} - {list titles if any}
- P2 (Medium): {count}
- P3 (Low): {count}
Key Incidents:
For each P0/P1:
- Incident-ID: {title}
- Status: {Open/Resolved/Closed}
- Impact: {user-count} users, {duration} minutes
- Root Cause: {brief description}
- Action Items: {list}
Patterns/Trends: {emerging issues or recurring problems}
### 3. User Feedback Review (5 min)
**Presented by**: Support Lead
- Support Tickets (24h): {count} (Trend: {↑/→/↓})
- Critical User Issues: {count}
- Top Complaints: {list top 3}
- Top Praises: {list top 3}
- Sentiment: {POSITIVE | NEUTRAL | NEGATIVE}
Blockers for Users: {list critical issues}
### 4. SLO/SLI Status (5 min)
**Presented by**: Reliability Engineer
| SLO | Target | Current (24h) | Status |
|-----|--------|---------------|--------|
| Availability | ≥99.9% | {percentage}% | {✓/⚠/✗} |
| Response Time | p95<{value}ms | {value}ms | {✓/⚠/✗} |
| Error Rate | <0.1% | {percentage}% | {✓/⚠/✗} |
| Throughput | >{value} req/s | {value} req/s | {✓/⚠/✗} |
### 5. Action Items and Blockers (5 min)
Open Action Items:
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
{list open actions}
New Blockers:
{list blockers requiring escalation}
Tomorrow's On-Call: {name} (taking over at {HH:MM})
---
Overall Status: {GREEN | YELLOW | RED}
Key Decisions Made:
{list decisions from standup}
New Action Items:
{list new actions assigned}
Save to: .aiwg/deployment/hypercare-standup-{YYYY-MM-DD}.md
"""
)
Weekly Summary (if hypercare > 7 days):
# On Day 7, 14, etc.:
Task(
subagent_type="operations-manager",
description="Generate weekly hypercare summary",
prompt="""
Read all daily standups for week: .aiwg/deployment/hypercare-standup-*.md
Create Weekly Summary:
## Hypercare Week {N} Summary
**Week**: {date-range}
**Overall Status**: {GREEN | YELLOW | RED}
### Production Stability
- Availability: {percentage}% (Target: ≥99.9%)
- Total Incidents: {count} (P0: {count}, P1: {count})
- MTTR: {value} min
- SLO Compliance: {percentage}%
### User Adoption
- Active Users: {count} ({+/-percentage}% vs. previous week)
- Feature Adoption: {percentage}%
- User Sentiment: {POSITIVE | NEUTRAL | NEGATIVE}
### Support Health
- Support Tickets: {count} ({+/-percentage}% vs. previous week)
- Critical Issues: {count}
- Response Time: {value}h (Target: <{SLA}h)
### Accomplishments
{list accomplishments}
### Challenges
{list challenges}
### Next Week Focus
{list focus areas}
Save to: .aiwg/reports/hypercare-week-{N}-summary.md
"""
)
Communicate Progress:
⏳ Conducting daily standup...
✓ Daily standup report generated: .aiwg/deployment/hypercare-standup-{date}.md
- Overall Health: {GREEN | YELLOW | RED}
- Key Decisions: {count}
- New Action Items: {count}
- Escalations: {count}
Purpose: Document root cause and corrective actions for all critical incidents
Your Actions:
Task(
subagent_type="reliability-engineer",
description="Conduct post-incident review for {incident-ID}",
prompt="""
Read incident log: .aiwg/deployment/incidents/incident-{ID}.md
Create Post-Incident Review (PIR):
## Post-Incident Review: {Incident-ID}
**Date**: {YYYY-MM-DD}
**Severity**: {P0/P1/P2/P3}
**Duration**: {detection-to-resolution}
**Impact**: {user-count} users, {downtime-minutes} minutes downtime
### Incident Summary
{1-2 sentence description of what happened}
### Timeline
| Time | Event | Actor |
|------|-------|-------|
{incident timeline from detection to resolution}
### Root Cause
{Detailed technical root cause analysis}
### Contributing Factors
1. {Factor 1 - e.g., insufficient testing}
2. {Factor 2 - e.g., monitoring gap}
3. {Factor 3 - e.g., unclear runbook}
### Corrective Actions
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
{list corrective actions to prevent recurrence}
### Lessons Learned
- What went well: {list}
- What could improve: {list}
- Process changes needed: {list}
Save to: .aiwg/deployment/incidents/pir-{ID}.md
Update incident log with PIR link
Track corrective actions in action tracker
"""
)
Communicate Progress:
⏳ Conducting post-incident reviews...
✓ PIR complete: Incident-{ID} ({title})
- Root cause: {summary}
- Corrective actions: {count} assigned
- Status: Tracking to completion
Purpose: Ensure production is stable and support team is ready before ending hypercare
Your Actions:
Validate Exit Criteria (on final day or when user requests):
Task(
subagent_type="operations-manager",
description="Validate hypercare exit criteria",
prompt="""
Read all hypercare artifacts
Validate Hypercare Exit Criteria:
## Hypercare Exit Criteria Validation
**Hypercare Period**: Day {N} of {duration}
**Validation Date**: {YYYY-MM-DD}
### Production Stability
- [ ] Zero P0 (Critical) incidents in last 48 hours
- [ ] Zero P1 (High) incidents in last 24 hours
- [ ] All SLOs met for 72 consecutive hours
- [ ] Availability ≥99.9%
- [ ] Response time p95 <{SLA}ms
- [ ] Error rate <0.1%
- [ ] Throughput >{target} req/s
- [ ] Error budget healthy: >{percentage}% remaining
- [ ] No open P0/P1 incidents
### User Adoption
- [ ] User adoption trending positive ({percentage}% growth)
- [ ] Feature adoption >{target}% for critical features
- [ ] User sentiment majority positive (≥70%)
- [ ] Support ticket volume stable or decreasing
- [ ] No critical user blockers unresolved
### Support Readiness
- [ ] Support team trained and confident
- [ ] Runbooks validated (all common issues documented)
- [ ] Escalation paths tested and effective
- [ ] Knowledge base updated with hypercare learnings
- [ ] On-call rotation transitioned to standard support
### Documentation Complete
- [ ] Hypercare report completed
- [ ] Post-incident reviews completed (all P0/P1)
- [ ] Corrective actions tracked (assigned, due dates set)
- [ ] Lessons learned documented
- [ ] Runbooks updated
Overall Exit Criteria Status: {PASS | CONDITIONAL | FAIL}
Decision: {END HYPERCARE | EXTEND HYPERCARE | ESCALATE}
Save to: .aiwg/reports/hypercare-exit-criteria.md
"""
)
Generate Hypercare Exit Report (comprehensive final report):
Task(
subagent_type="operations-manager",
description="Generate comprehensive hypercare exit report",
prompt="""
Read all hypercare artifacts:
- .aiwg/deployment/hypercare-team-roster.md
- .aiwg/deployment/slo-report-*.md (all days)
- .aiwg/deployment/user-adoption-*.md (all days)
- .aiwg/deployment/hypercare-standup-*.md (all days)
- .aiwg/deployment/incidents/*.md (all incidents)
- .aiwg/reports/hypercare-exit-criteria.md
Generate Hypercare Exit Report:
# Hypercare Report: {Project-Name}
**Hypercare Period**: {start-date} to {end-date} ({duration} days)
**Report Date**: {YYYY-MM-DD}
**Report Author**: {Hypercare Lead}
## Executive Summary
{2-3 sentence summary of hypercare outcomes}
Overall Status: {SUCCESS | SUCCESS WITH CONDITIONS | CHALLENGES}
Key Metrics:
- Availability: {percentage}%
- Total Incidents: {count} (P0: {count}, P1: {count})
- User Adoption: {percentage}%
- Support Tickets: {count}
## Production Stability Summary
### SLO Performance
| SLO | Target | Achieved | Status |
|-----|--------|----------|--------|
{SLO compliance table}
SLO Compliance Rate: {percentage}%
### Incident Summary
Total Incidents: {count}
- P0 (Critical): {count}
- P1 (High): {count}
- P2 (Medium): {count}
- P3 (Low): {count}
Key Metrics:
- MTTD (Mean Time to Detect): {value} min
- MTTA (Mean Time to Acknowledge): {value} min
- MTTR (Mean Time to Resolve): {value} min
Major Incidents:
For each P0/P1:
- Incident-ID: {title}
- Date, Duration, Impact, Root Cause, Resolution
- Corrective Actions: {count} assigned
### Performance Trends
- Response Time: {IMPROVED | STABLE | DEGRADED} ({+/-percentage}% vs. pre-deployment)
- Error Rate: {IMPROVED | STABLE | DEGRADED}
- Resource Utilization: {HEALTHY | CONCERNING}
## User Adoption Summary
### Adoption Metrics
- Active Users: {count} ({+/-percentage}% vs. pre-deployment)
- Feature Adoption: {percentage}% (Target: >{target}%)
- User Retention (Day 14): {percentage}%
### User Feedback
- Total Feedback Items: {count}
- Sentiment: {percentage}% positive
- Net Promoter Score: {value}
Top Praises: {list top 3}
Top Complaints: {list top 3 with resolution status}
## Support Summary
### Ticket Volume
- Total Support Tickets: {count}
- Daily Average: {count} tickets/day
- Trend: {DECREASING | STABLE | INCREASING}
### Support Performance
- Average Response Time: {value}h (Target: <{SLA}h) - {✓/⚠/✗}
- First Contact Resolution: {percentage}%
### Support Team Readiness
- Team Confidence Level: {HIGH | MEDIUM | LOW}
- Runbook Completeness: {percentage}%
## Lessons Learned
### What Went Well
{list successes}
### What Could Improve
{list improvements}
### Process Recommendations
{list recommendations for future deployments}
## Corrective Actions
Total Actions Identified: {count}
| Action | Category | Owner | Due Date | Status |
|--------|----------|-------|----------|--------|
{corrective actions table}
## Handover to Standard Support
### Transition Plan
- [ ] Standard on-call rotation activated (starting {date})
- [ ] Support runbooks transferred
- [ ] Knowledge base published
- [ ] Support team training complete
- [ ] Escalation paths updated for BAU
### Post-Hypercare Monitoring
- Duration: {duration} days continued close monitoring
- Responsible: {Support Lead}
- Review Cadence: Weekly check-ins for {duration} weeks
## Conclusion
{2-3 sentence summary and readiness for standard support}
Recommendation: {END HYPERCARE | EXTEND HYPERCARE}
Signoff:
- Hypercare Lead: {name} - {date}
- Reliability Engineer: {name} - {date}
- Support Lead: {name} - {date}
- Product Owner: {name} - {date}
- Project Manager: {name} - {date}
Save to: .aiwg/reports/hypercare-exit-report.md
"""
)
Present Exit Summary to User:
# You present this directly (not via agent)
Read .aiwg/reports/hypercare-exit-report.md
Present summary:
─────────────────────────────────────────────
Hypercare Monitoring Period Complete
─────────────────────────────────────────────
**Hypercare Period**: {start-date} to {end-date} ({duration} days)
**Overall Status**: {SUCCESS | SUCCESS WITH CONDITIONS | CHALLENGES}
**Key Metrics**:
✓ Availability: {percentage}% (Target: ≥99.9%)
✓ Total Incidents: {count} (P0: {count}, P1: {count})
✓ User Adoption: {percentage}% of target
✓ Support Readiness: Team confident and ready
**Exit Criteria Status**:
✓ Production Stability: {PASS | CONDITIONAL | FAIL}
✓ User Adoption: {PASS | CONDITIONAL | FAIL}
✓ Support Readiness: {PASS | CONDITIONAL | FAIL}
✓ Documentation: {PASS | CONDITIONAL | FAIL}
**Decision**: {END HYPERCARE | EXTEND HYPERCARE | ESCALATE}
**Artifacts Generated**:
- Hypercare Team Roster (.aiwg/deployment/hypercare-team-roster.md)
- Production Dashboard Config (.aiwg/deployment/production-dashboard-config.md)
- Alert Escalation Matrix (.aiwg/deployment/alert-escalation-matrix.md)
- Daily Standup Reports (.aiwg/deployment/hypercare-standup-*.md, {count} files)
- SLO Reports (.aiwg/deployment/slo-report-*.md, {count} files)
- User Adoption Reports (.aiwg/deployment/user-adoption-*.md, {count} files)
- Incident Logs (.aiwg/deployment/incidents/*.md, {count} files)
- Post-Incident Reviews (.aiwg/deployment/incidents/pir-*.md, {count} files)
- Hypercare Exit Report (.aiwg/reports/hypercare-exit-report.md)
**Next Steps**:
- Review hypercare exit report with stakeholders
- Obtain formal signoffs (5 required signatures)
- If END HYPERCARE: Transition to standard support (run handoff workflow)
- If EXTEND HYPERCARE: Address gaps, continue monitoring
- If ESCALATE: Executive decision required
**Transition to Standard Support**:
- Standard on-call rotation activated: {date}
- Continued monitoring period: {duration} days
- Weekly check-ins scheduled
─────────────────────────────────────────────
Communicate Progress:
⏳ Validating hypercare exit criteria...
✓ Exit criteria validated: {PASS | CONDITIONAL | FAIL}
✓ Hypercare Exit Report generated: .aiwg/reports/hypercare-exit-report.md
✓ Transition plan documented
Before marking workflow complete, verify:
At start: Confirm understanding and list activities
Understood. I'll orchestrate the hypercare monitoring period.
Hypercare Duration: {duration} days
Hypercare Period: {start-date} to {estimated-end-date}
This will establish:
- Hypercare team roster and 24/7 on-call rotation
- Production health monitoring dashboards
- Alert escalation and incident response procedures
- Daily standup coordination
- SLO tracking and user adoption monitoring
- Post-incident review process
- Hypercare exit criteria validation
I'll coordinate multiple agents for comprehensive monitoring and support.
Expected setup: 20-30 minutes.
Starting orchestration...
During: Update progress with clear indicators
✓ = Complete
⏳ = In progress
❌ = Error/blocked
⚠️ = Warning/attention needed
Daily: Provide daily status summary
Hypercare Day {N} of {duration}: {GREEN | YELLOW | RED}
Production Health:
✓ Availability: {percentage}% (Target: ≥99.9%)
✓ Performance: p95 {value}ms (Target: <{SLA}ms)
{⚠️ | ✓} Error Rate: {percentage}% (Target: <0.1%)
Incidents (24h):
- P0: {count}
- P1: {count}
- P2: {count}
User Adoption: {percentage}% ({trend})
Daily reports: .aiwg/deployment/hypercare-standup-{date}.md
At end: Summary report (see Step 5.3 above)
If P0 Incident During Hypercare:
❌ Critical incident detected - immediate response initiated
Incident: {incident-ID} - {title}
Severity: P0 (Complete outage / Data loss / Security breach)
Impact: {user-count} users affected
Actions:
1. On-call engineer + Hypercare Lead paged
2. Incident war room created: #incident-{date}-{ID}
3. Executive Sponsor notified
4. Status page updated
Response Timeline:
- Detection: {timestamp}
- Acknowledgment: {timestamp} (Target: Immediate)
- Time-to-engage: {minutes} min (Target: <15 min)
Current Status: {INVESTIGATING | MITIGATING | RESOLVED}
Impact on Exit Criteria: P0 incident resets 48h "zero critical incidents" requirement
Monitoring incident response...
If SLO Breach:
⚠️ SLO breach detected - immediate investigation required
SLO Breached: {SLO-name}
- Target: {target-value}
- Current: {actual-value}
- Duration: {duration} (continuous breach)
Impact:
- Error budget consumed: {percentage}%
- User impact: {description}
Actions:
1. Reliability Engineer investigating root cause
2. Metrics and logs under review
3. Mitigation plan in progress
If breach persists >24h: Recommend extending hypercare period
If error budget critically low: Recommend incident freeze
Monitoring for improvement...
If User Adoption Low:
⚠️ User adoption below target
Current Adoption: {percentage}% (Target: >{target}%)
Gap: {percentage} points
Analysis:
- Top User Issues: {list issues}
- Support Ticket Themes: {list themes}
- Potential Blockers: {list blockers}
Actions:
1. Product Owner engaged for adoption analysis
2. Support team reviewing common user issues
3. Documentation and training gaps identified
Decision Point:
- If blockers identified: Prioritize fixes, may extend hypercare
- If education needed: Launch awareness campaign
- If feature not valuable: Escalate to stakeholders
Impact on Exit Criteria: User adoption trend must improve before exit approval
If Support Team Overwhelmed:
⚠️ Support team capacity exceeded
Support Volume: {count} tickets/day (Capacity: {capacity})
Team Status: {STRESSED | OVERWHELMED}
Root Cause Analysis:
- Top Issue Categories: {list categories with counts}
- Product Bugs vs User Education: {ratio}
Immediate Relief Actions:
1. Additional support staff brought in (temp)
2. Engineering team handling overflow tickets
3. Workarounds created for top issues
4. FAQ and self-service guides published
Mitigation:
- Deploy hotfixes for high-frequency bugs
- Update documentation for common questions
- Additional training sessions scheduled
Impact on Exit Criteria: Support team must be confident and staffed before exit
If Exit Criteria Not Met:
⚠️ Hypercare exit criteria not met - extension recommended
Exit Criteria Status: {FAIL | CONDITIONAL}
Gaps Identified:
{list unmet criteria with details}
Recommendation: {EXTEND HYPERCARE | CONDITIONAL EXIT | ESCALATE}
Extension Plan:
- Additional Duration: {days} days
- Focus Areas: {list areas needing improvement}
- Re-validation Date: {date}
Escalating to user for decision...
This orchestration succeeds when:
During orchestration, track:
Templates (via $AIWG_ROOT):
templates/deployment/operational-readiness-review-template.mdtemplates/deployment/slo-sli-template.mdtemplates/support/incident-response-runbook-template.mdtemplates/support/support-plan-template.mdRelated Flows:
commands/flow-gate-check.mdcommands/flow-handoff-checklist.mdcommands/flow-deployment-workflow.mdSDLC Phase Context: