Help us improve
Share bugs, ideas, or general feedback.
From amg-toolkit
Fleet-wide PostgreSQL Flexible Server health check — scans CPU, memory, storage, IOPS, disk bandwidth, and connection metrics across all servers, then deep-dives into abnormal servers with resource logs and correlation analysis. Tracks known issues across sessions via persistent report. Uses AMG-MCP pulse check for Tier 1 triage, then batched Azure Monitor queries for Tier 2 investigation. On first run, auto-discovers datasource UID and prompts for subscription ID.
npx claudepluginhub azure/amg-skills --plugin amg-toolkitHow this skill is triggered — by the user, by Claude, or both
Slash command
/amg-toolkit:amg-check-pg-flexThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
- Current UTC time: !`date -u +%Y-%m-%dT%H:%M:%SZ`
Monitors PostgreSQL, MySQL, MongoDB health via CLI queries for connections, throughput, disk usage, cache ratios, locks, with alerting thresholds and automation guidance.
Audits Cloud SQL Postgres health: identifies storage bloat, invalid indexes, analyzes table statistics, manages autovacuum configs, fetches query metrics via PromQL.
Sets up Supabase observability with dashboard reports, CLI inspect commands for Postgres diagnostics, pg_stat_statements, log drains, and alerting.
Share bugs, ideas, or general feedback.
date -u +%Y-%m-%dT%H:%M:%SZcat memory/amg-check-pg-flex/config.md 2>/dev/null || echo "NOT_CONFIGURED"[ -f memory/amg-check-pg-flex/report.md ] && echo "exists ($(grep -c '^### BUG-' memory/amg-check-pg-flex/report.md) bugs documented)" || echo "not found"Known Issues: Before presenting findings, cross-reference results against
memory/amg-check-pg-flex/report.md.
from/to — NEVER use timespan (it causes errors).node -e "..." if installed; otherwise fall back to python -c "...", jq, or pwsh -Command "...". Bash permission for the chosen interpreter will be prompted on first use.Update checkboxes as you complete each phase:
memory/amg-check-pg-flex/report.mdIf Config shows NOT_CONFIGURED: Run First-Run Setup at the bottom of this file, then return here.
If Config is populated: Extract the datasource UID and subscription ID from the pre-loaded Runtime Context above and use them for all queries. Use $1 as the subscription override if provided.
## Azure Monitor Datasource > UID## Subscription (or $1 if provided)microsoft.dbforpostgresql/flexibleservers (lowercase)/subscriptions/{SUB}/resourceGroups/{RG}/providers/Microsoft.DBforPostgreSQL/flexibleServers/{name}Default: 7 days for metrics, 24 hours for logs. Override with $0 (e.g., 3d). Keep log queries to 1-2 days to avoid timeouts.
Call amgmcp_datasource_list (no parameters). Find entry with type == "grafana-azure-monitor-datasource".
memory/amg-check-pg-flex/config.md, warn user, use new UID.azureMonitorDatasourceUid: {DATASOURCE_UID}
query: |
resources
| where type == 'microsoft.dbforpostgresql/flexibleservers'
| where subscriptionId == '{SUBSCRIPTION_ID}'
| project name, resourceGroup, location, properties.state, sku.name, sku.tier
| order by location asc, name asc
If multiple subscriptions are configured, query each separately and merge results. Derive region summary by counting servers per location. Flag servers not in "Ready" state. Stop if zero servers found.
If any servers are not in "Ready" state, query the activity log for up to 3 of them:
azureMonitorDatasourceUid: {DATASOURCE_UID}
scope: /subscriptions/{SUB}/resourceGroups/{RG}/providers/Microsoft.DBforPostgreSQL/flexibleServers/{name}
startTime: now-3d
endTime: now
select: eventTimestamp,operationName,status,caller,subStatus
If the response exceeds 500 KB, retry with startTime: now-6h. Summarize: operations performed, caller type (service principal vs human), success/in-progress status, likely cause.
azureMonitorDatasourceUid: {DATASOURCE_UID}
pastDays: 7
scenarios: pg_flex
Scans all servers across 5 scenarios: pg_flex_cpu, pg_flex_memory, pg_flex_storage, pg_flex_disk_iops, pg_flex_disk_bandwidth.
Before moving to Phase 3, verify:
scanSummary.totalResourcesScanned matches Phase 1 server count.status: "completed" in scenarioResults.errors non-empty, retry affected scenarios individually (e.g., scenarios: pg_flex_cpu).amgmcp_query_resource_metric for unscanned servers.Severity thresholds (findings array):
| Severity | CPU | Memory | Storage | Disk IOPS | Disk BW |
|---|---|---|---|---|---|
| Critical | >90% | >90% | >85% | >90% | >90% |
| Warning | >80% | >80% | >75% | >80% | >80% |
Read reference/phase3-deep-dive.md before starting Phase 3. It contains:
Read reference/phase4-resource-logs.md before starting Phase 4. It contains:
Present the report using the structure in reference/output-format.md.
Classification:
| Severity | Criteria |
|---|---|
| CRITICAL | is_db_alive sustained 0, OR CPU max >90%, OR Memory max >90%, OR Storage >85% |
| WARNING | CPU avg >80%, Memory avg >85%, sustained >60% for 6h+, spike >30pp in 1h, Storage >70%, connections_failed >0, deadlocks >0, disk IOPS/BW >80%, transactionIDs >1B, longest query >300s, replication delay >30s |
| HEALTHY | All metrics within normal ranges |
After presenting findings, update memory/amg-check-pg-flex/report.md:
Only add genuine issues: sustained high utilization, crash patterns, connection storms, persistent errors. Skip transient single-hour spikes or expected maintenance windows.
See reference/error-handling.md for the full recovery table.
Run only when Config shows NOT_CONFIGURED. After completing, return to the Workflow above.
1. Discover Datasource UID: Call amgmcp_datasource_list. Filter type == "grafana-azure-monitor-datasource". Prefer uid == "azure-monitor-oob" if multiple match. Abort if zero match.
2. Discover Subscription ID: Run this Resource Graph query to list all subscriptions with PostgreSQL Flexible Servers, then present the results as a table and ask the user which subscription(s) to use:
resources
| where type == 'microsoft.dbforpostgresql/flexibleservers'
| join kind=inner (
resourcecontainers
| where type == 'microsoft.resources/subscriptions'
| project subscriptionId, subscriptionName=name
) on subscriptionId
| summarize ServerCount=count() by subscriptionId, subscriptionName
| order by ServerCount desc
Present the results as a table with columns: Subscription Name, Subscription ID, Server Count. Then ask the user: "Which subscription ID(s) should I configure for this health check?"
3. Write config: Write memory/amg-check-pg-flex/config.md:
# amg-check-pg-flex Configuration
User-specific values for the PostgreSQL Flexible Server health check skill.
This file is auto-generated on first run and can be edited manually.
## Azure Monitor Datasource
- **UID**: {discovered_uid}
- **Name**: {discovered_name}
## Subscription
- {subscription_id}
4. Confirm: Show the resolved config and ask for confirmation before proceeding.