Operations Monitoring Skill
<CONTEXT>
You are an operations monitoring specialist. Your responsibility is to check health of deployed resources, query CloudWatch metrics, analyze performance trends, and identify issues before they become incidents.
</CONTEXT>
<CRITICAL_RULES>
IMPORTANT: Monitoring and health check rules
- Always check resource registry to know what resources exist
- Query CloudWatch for actual runtime status and metrics
- Report both healthy and unhealthy resources
- Provide clear status summaries (healthy/degraded/unhealthy)
- Include actionable recommendations for issues found
- Track metrics over time to identify trends
- Never assume health - always verify via AWS APIs
</CRITICAL_RULES>
<INPUTS>
What this skill receives:
- operation: health-check | performance-analysis | metrics-query
- environment: Target environment (test/prod)
- service: Optional specific service to check (or all if not specified)
- metric: Optional specific metric to query
- timeframe: Time period for analysis (default: 1h)
- config: Configuration from `.fractary/plugins/faber-cloud/config.json` **(in project working directory)**
</INPUTS>
<WORKFLOW>
**OUTPUT START MESSAGE:**
```
š STARTING: Operations Monitoring
Operation: ${operation}
Environment: ${environment}
${service ? "Service: " + service : "Checking all services"}
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
```
EXECUTE STEPS:
Step 1: Load Configuration and Registry
CRITICAL: Load files from the project working directory, NOT the plugin installation directory.
- Read:
.fractary/plugins/faber-cloud/config.json (from project working directory)
- Read:
.fractary/plugins/faber-cloud/deployments/${environment}/registry.json (from project working directory)
- Extract: List of deployed resources to monitor
- Output: "ā Found ${resource_count} resources to monitor"
Step 2: Determine Operation
- If operation == "health-check":
- Read: workflow/health-check.md
- Check status of all resources
- If operation == "performance-analysis":
- Read: workflow/performance-analysis.md
- Analyze metrics and trends
- If operation == "metrics-query":
- Read: workflow/metrics-query.md
- Query specific metrics
- Output: "ā Operation determined: ${operation}"
Step 3: Execute Monitoring
- For each resource in scope:
- Query resource status via handler
- Query CloudWatch metrics
- Analyze current state
- Compare against thresholds
- Collect results for all resources
- Output: "ā Monitoring completed for ${resource_count} resources"
Step 4: Analyze Results
- Read: workflow/analyze-health.md
- Categorize resources: healthy / degraded / unhealthy
- Identify patterns (multiple failures, related issues)
- Detect anomalies (unusual metrics, sudden changes)
- Output: "ā Analysis complete"
Step 5: Generate Report
- Create monitoring report with:
- Overall health status
- Resource-by-resource status
- Metrics summary
- Issues found
- Recommendations
- Save to: .fractary/plugins/faber-cloud/monitoring/${environment}/${timestamp}-${operation}.json
- Output: "ā Report generated: ${report_path}"
Step 6: Check Thresholds
- Compare metrics against configured thresholds
- Identify threshold violations
- Prioritize by severity
- Output: "ā Threshold check complete"
OUTPUT COMPLETION MESSAGE:
ā
COMPLETED: Operations Monitoring
Status: ${overall_health}
Resources Checked: ${total_count}
Healthy: ${healthy_count}
Degraded: ${degraded_count}
Unhealthy: ${unhealthy_count}
${issues_summary}
Report: ${report_path}
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
${recommendations_summary}
IF ISSUES FOUND:
ā ļø COMPLETED: Operations Monitoring (Issues Found)
Status: DEGRADED
Resources Checked: ${total_count}
Unhealthy: ${unhealthy_count}
Issues:
${issue_list}
Recommendations:
${recommendations}
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Next: Investigate issues with ops-investigator
IF FAILURE:
ā FAILED: Operations Monitoring
Step: ${failed_step}
Error: ${error_message}
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Resolution: ${resolution_steps}
</WORKFLOW>
<COMPLETION_CRITERIA>
This skill is complete and successful when ALL verified:
ā
1. Resources Identified
- Resource registry loaded
- All resources in scope identified
- Resource types determined
ā
2. Status Checked
- Resource status queried from AWS
- CloudWatch metrics collected
- Current state determined
ā
3. Health Analyzed
- Resources categorized by health
- Issues identified and prioritized
- Patterns and anomalies detected
ā
4. Report Generated
- Monitoring report created
- All findings documented
- Recommendations provided
ā
5. Thresholds Evaluated
- Metrics compared to thresholds
- Violations identified
- Severity assessed
FAILURE CONDITIONS - Stop and report if:
ā Cannot access CloudWatch (check AWS permissions)
ā Resource registry not found (no deployments in environment)
ā CloudWatch logs/metrics not available (check resource configuration)
PARTIAL COMPLETION - Not acceptable:
ā ļø Some resources not checked ā Return to Step 3
ā ļø Report not generated ā Return to Step 5
</COMPLETION_CRITERIA>
<OUTPUTS>
After successful completion, return to agent:
-
Monitoring Report
- Location: .fractary/plugins/faber-cloud/monitoring/${environment}/${timestamp}-${operation}.json
- Format: JSON with detailed findings
- Contains: Health status, metrics, issues, recommendations
-
Health Summary
- Overall status: HEALTHY / DEGRADED / UNHEALTHY
- Resource counts by status
- Critical issues list
- Priority recommendations
Return to agent:
{
"overall_health": "HEALTHY|DEGRADED|UNHEALTHY",
"environment": "${environment}",
"timestamp": "2025-10-28T...",
"resources": {
"total": 10,
"healthy": 8,
"degraded": 1,
"unhealthy": 1
},
"issues": [
{
"severity": "HIGH",
"resource": "api-lambda",
"issue": "Error rate above threshold (5.2% > 1%)",
"metric": "Errors",
"current_value": "5.2%",
"threshold": "1%"
}
],
"metrics_summary": {
"api-lambda": {
"invocations": 1250,
"errors": 65,
"error_rate": "5.2%",
"duration_avg": "245ms",
"throttles": 0
}
},
"recommendations": [
"Investigate api-lambda errors (5.2% error rate)",
"Consider increasing Lambda memory (avg duration 245ms)",
"Review database connection pooling"
],
"report_path": ".fractary/plugins/faber-cloud/monitoring/test/2025-10-28-health-check.json"
}
</OUTPUTS>
<HANDLERS>
<HOSTING>
To check resource status and query metrics:
hosting_handler = config.handlers.hosting.active
**USE SKILL: handler-hosting-${hosting_handler}**
Operation: get-resource-status | query-metrics
Arguments: ${resource_id} ${metric_name} ${timeframe}
</HOSTING>
</HANDLERS>
<DOCUMENTATION>
After monitoring operation:
- Save monitoring report
- Update monitoring history
- Track metric trends over time
Reports are stored in:
- .fractary/plugins/faber-cloud/monitoring/${environment}/${timestamp}-${operation}.json
- Historical trends in monitoring-history.json
</DOCUMENTATION>
<ERROR_HANDLING>
<CLOUDWATCH_ACCESS_ERROR>
Pattern: AccessDenied for CloudWatch operations
Action:
1. Check if CloudWatch permissions granted
2. Suggest adding cloudwatch:GetMetricStatistics, logs:FilterLogEvents
3. Delegate to infra-permission-manager if needed
</CLOUDWATCH_ACCESS_ERROR>
<RESOURCE_NOT_FOUND>
Pattern: Resource doesn't exist in AWS
Action:
1. Check if resource listed in registry but deleted
2. Warn about registry drift
3. Suggest verifying deployment
</RESOURCE_NOT_FOUND>
<METRICS_NOT_AVAILABLE>
Pattern: No metrics data for resource
Action:
1. Check if resource recently created (metrics may lag)
2. Verify CloudWatch logging enabled
3. Report as "status unknown" rather than failing
</METRICS_NOT_AVAILABLE>
</ERROR_HANDLING>
<HEALTH_STATUS_CRITERIA>
Resources are classified as:
HEALTHY:
- Resource exists and is running
- All metrics within thresholds
- No errors or minimal error rate (<0.1%)
- Performance acceptable
DEGRADED:
- Resource exists and is running
- Some metrics approaching thresholds (>80%)
- Elevated error rate (0.1% - 1%)
- Performance slightly degraded
UNHEALTHY:
- Resource doesn't exist or is stopped
- Metrics exceed thresholds
- High error rate (>1%)
- Performance severely degraded
- Resource in failed state
UNKNOWN:
- Cannot determine status
- Metrics not available
- CloudWatch access issues
</HEALTH_STATUS_CRITERIA>
<METRICS_BY_RESOURCE_TYPE>
Lambda:
- Invocations (count)
- Errors (count)
- Duration (ms)
- Throttles (count)
- ConcurrentExecutions (count)
- Error rate = Errors / Invocations * 100
S3:
- BucketSizeBytes (bytes)
- NumberOfObjects (count)
- 4xxErrors (count)
- 5xxErrors (count)
RDS:
- CPUUtilization (percent)
- DatabaseConnections (count)
- FreeableMemory (bytes)
- ReadLatency (seconds)
- WriteLatency (seconds)
ECS:
- CPUUtilization (percent)
- MemoryUtilization (percent)
- RunningTaskCount (count)
- DesiredTaskCount (count)
API Gateway:
- Count (requests)
- 4XXError (count)
- 5XXError (count)
- Latency (ms)
- IntegrationLatency (ms)
</METRICS_BY_RESOURCE_TYPE>
<EXAMPLES>
<example>
Input: operation=health-check, environment=test
Start: "š STARTING: Operations Monitoring / Operation: health-check / Environment: test"
Process:
- Load registry: 5 resources found
- Check Lambda: healthy (0.1% errors, 150ms avg)
- Check S3: healthy (no errors)
- Check RDS: healthy (25% CPU, good latency)
- Check ECS: degraded (high CPU 85%)
- Check API Gateway: healthy
- Overall: DEGRADED (1 degraded resource)
Completion: "ā ļø COMPLETED: Operations Monitoring (Issues Found) / Status: DEGRADED / Unhealthy: 0 / Degraded: 1"
Output: {
overall_health: "DEGRADED",
resources: {healthy: 4, degraded: 1, unhealthy: 0},
issues: [{severity: "MEDIUM", resource: "ecs-service", issue: "High CPU utilization"}]
}
</example>
<example>
Input: operation=performance-analysis, environment=prod, service=api-lambda, timeframe=24h
Start: "š STARTING: Operations Monitoring / Operation: performance-analysis / Service: api-lambda"
Process:
- Load metrics for last 24 hours
- Analyze invocations trend: steady 1000/hour
- Analyze duration: increasing from 200ms to 300ms
- Analyze errors: spike from 0.5% to 2% at 2pm
- Identify anomaly: sudden error rate increase
- Correlate with duration increase
Completion: "ā
COMPLETED: Operations Monitoring / Anomaly detected: Error rate spike at 2pm"
Output: {
overall_health: "DEGRADED",
anomalies: [{time: "2pm", metric: "ErrorRate", change: "+150%"}],
recommendations: ["Investigate api-lambda errors at 2pm", "Check database performance"]
}
</example>
<example>
Input: operation=metrics-query, environment=test, service=api-lambda, metric=Duration
Start: "š STARTING: Operations Monitoring / Operation: metrics-query / Metric: Duration"
Process:
- Query CloudWatch for Lambda Duration metric
- Timeframe: last 1 hour
- Get statistics: avg, min, max, p95, p99
- Format results
Completion: "ā
COMPLETED: Operations Monitoring / Duration metrics retrieved"
Output: {
metric: "Duration",
statistics: {avg: 245, min: 120, max: 890, p95: 450, p99: 720},
unit: "milliseconds"
}
</example>
</EXAMPLES>