From datadops
Quick service health assessment providing a comprehensive overview of current service status using key metrics, active alerts, and recent events. Perfect for daily health checks, incident triage, or getting rapid service insights. Use when you need fast service status or as a starting point for deeper investigation.
npx claudepluginhub ahmidbbc/datadops --plugin datadopsThis skill uses the workspace's default tool permissions.
Rapid service health assessment providing actionable insights in under 60 seconds.
Creates isolated Git worktrees for feature branches with prioritized directory selection, gitignore safety checks, auto project setup for Node/Python/Rust/Go, and baseline verification.
Executes implementation plans in current session by dispatching fresh subagents per independent task, with two-stage reviews: spec compliance then code quality.
Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.
Rapid service health assessment providing actionable insights in under 60 seconds.
When invoked directly with /datadops:service-health-overview, use $ARGUMENTS as the scope for the report.
If the scope is unclear, ask whether to focus on a service, a team, or an environment.
โ
Green (90-100%): All systems operational
๐ก Yellow (75-89%): Minor issues, monitoring required
๐ด Red (0-74%): Critical issues, immediate action needed
Indicators:
- HTTP success rate
- Service uptime
- Critical endpoint availability
Metrics:
- Response time percentiles
- Throughput consistency
- Resource utilization
- Database performance
Thresholds:
- Excellent: All metrics within targets
- Good: Minor performance variations
- Poor: Significant performance degradation
Analysis:
- Error rate trends
- New error types
- Error severity distribution
- Customer impact assessment
Classifications:
- Low: Error rate < 1%, no critical errors
- Medium: Error rate 1-5%, some customer impact
- High: Error rate > 5%, significant customer impact
Components:
- Host/container health
- Resource availability
- Network connectivity
- Storage performance
Assessment:
- Healthy: All infrastructure optimal
- Warning: Resource constraints detected
- Critical: Infrastructure failures present
Overall Score: 95-100
Characteristics:
- Success rate > 99.5%
- Latency within targets
- No active alerts
- Stable resource usage
- Recent deployments successful
Recommended Actions:
- Continue monitoring
- Review performance trends
- Plan capacity for growth
Overall Score: 75-94
Characteristics:
- Success rate 95-99.5%
- Latency slightly elevated
- Minor alerts present
- Increasing resource usage
- Some deployment issues
Recommended Actions:
- Investigate warning indicators
- Review recent changes
- Prepare mitigation plans
- Increase monitoring frequency
Overall Score: 0-74
Characteristics:
- Success rate < 95%
- High latency or timeouts
- Critical alerts firing
- Resource exhaustion
- Failed deployments
Recommended Actions:
- Immediate investigation required
- Escalate to on-call team
- Prepare rollback procedures
- Customer communication
"Give me a health overview of the payment service."
Expected Response:
"Check the health of all checkout-related services."
Expected Response:
"Prepare a service health summary for our team standup."
Expected Response:
{
"payment_service": {
"success_rate_target": 99.9,
"latency_p95_target": 100,
"error_rate_threshold": 0.1
},
"search_service": {
"success_rate_target": 99.5,
"latency_p95_target": 500,
"error_rate_threshold": 0.5
}
}
# Claude Code interactive invocation
/datadops:service-health-overview critical production services
# Prompt-based automation examples
0 9 * * * claude -p "Give me a health overview of all critical production services. Summarize health scores, active alerts, recent events, and the top recommended actions."
0 * * * * claude -p "Give me a health overview of critical services and mention only active alerts, regressions, and immediate actions."