npx claudepluginhub arbazkhan971/godmodeThis skill uses the workspace's default tool permissions.
Designs and optimizes AI agent action spaces, tool definitions, observation formats, error recovery, and context for higher task completion rates.
Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
/godmode:observe, "add monitoring", "set up logging"| Pillar | Status | Coverage | Tools |
| Metrics | PARTIAL | 40% | Prometheus |
| Logging | BASIC | 60% | stdout |
| Tracing | NONE | 0% | — |
| Alerting | MINIMAL | 20% | PagerDuty |
# Detect observability libraries
grep -rE "prom-client|datadog|opentelemetry" \
package.json requirements.txt 2>/dev/null
RED (request-driven):
http_requests_total (counter: method, path, status)
http_request_duration_sec (histogram: method, path)
http_requests_in_flight (gauge)
USE (infrastructure):
CPU, memory, disk, network, connections
Business:
user_signups_total, orders_completed_total,
payment_failures_total, active_sessions
IF cardinality > 10K labels: remove high-cardinality (user_id, request_id go in logs, not metrics).
{
"timestamp": "2025-01-15T10:30:45Z",
"level": "info",
"service": "api-gateway",
"trace_id": "abc123",
"request_id": "req-456",
"message": "request completed",
"duration_ms": 45
}
| Level | When | Alert? |
| FATAL | Cannot continue | Page on-call |
| ERROR | Operation failed | Alert channel |
| WARN | Unexpected but handled | Track trend |
| INFO | Business events | No |
| DEBUG | Diagnostic (dev only) | No |
NEVER log PII, tokens, passwords, or credit cards. ALWAYS include request_id in every log entry.
OpenTelemetry setup (recommended):
Auto-instrument: HTTP, DB, Redis, gRPC
Propagate: trace_id across all service calls
Backend: Jaeger|Tempo|Zipkin|Datadog
IF sampling rate too low (< 1%): miss rare errors. IF 100% sampling: too expensive at scale. Default: 10% sampling, 100% for errors.
| SLI | Target | Window | Error Budget |
| Availability | 99.9% | 30d | 43.2 min |
| Latency p99 | < 500ms | 30d | 0.1% slow |
| Error rate | < 0.1% | 30d | budget: 0.1% |
Burn rate alerts: Fast: 14.4x over 1h (consumes 2% budget) Slow: 6x over 6h (consumes 5% budget)
| Alert | Condition | Severity |
| HighErrorRate | 5xx > 1% for 5m | Critical |
| HighLatencyP99 | p99 > 2s for 5m | Critical |
| DiskSpaceLow | usage > 85% | Warning |
| PodCrashLoop | restarts > 3/5m | Critical |
EVERY alert MUST have for duration (min 2m critical,
5m warning). No flapping alerts.
Row 1: Request Rate, Error Rate, Latency, Saturation
Row 2: SLO status, error budget remaining
Row 3: Per-service breakdown
Row 4: Infrastructure (CPU, memory, disk, network)
for duration.Append .godmode/observe-results.tsv:
timestamp pillar tool items_configured coverage_pct status
KEEP if: pillar produces verified output AND
latency increase < 5%.
DISCARD if: no output OR perf degrades > 5%
OR cardinality explosion.
STOP when FIRST of:
- All 3 pillars verified (metrics/logs/traces)
- SLOs defined + alerts configured
- User requests stop
On failure: git reset --hard HEAD~1. Never pause.
| Failure | Action |
|---|---|
| Cardinality explosion | Remove labels, use histograms |
| Missing trace spans | Check sampling, propagation |
| Alert fatigue | Tune thresholds, add for: duration |
| Log volume exceeds budget | Drop DEBUG, sample paths |