From agent-almanac
Builds production-ready Grafana dashboards with panels, template variables, annotations, and provisioning for Prometheus/Loki metrics. Use for SRE operational views, SLO reporting, or version-controlled deployments.
npx claudepluginhub pjt222/agent-almanacThis skill is limited to using the following tools:
Design and deploy Grafana dashboards with best practices for maintainability, reusability, and version control.
Creates, modifies, and organizes Grafana dashboards including panels (time series, stat, table, gauge), variables, transformations, thresholds, and alerting.
Creates and manages production Grafana dashboards for visualizing Prometheus metrics, APIs, SLOs, infrastructure, and KPIs using RED/USE methods. Use for monitoring and observability.
Creates and manages production-ready Grafana dashboards for monitoring Prometheus metrics, infrastructure, SLOs, and business KPIs using RED/USE methods.
Share bugs, ideas, or general feedback.
Design and deploy Grafana dashboards with best practices for maintainability, reusability, and version control.
See Extended Examples for complete configuration files and templates.
Plan dashboard layout and organization before building panels.
Create a dashboard specification document:
# Service Overview Dashboard
## Purpose
Real-time operational view for on-call engineers monitoring the API service.
## Rows
1. High-Level Metrics (collapsed by default)
- Request rate, error rate, latency (RED metrics)
- Service uptime, instance count
2. Detailed Metrics (expanded by default)
- Per-endpoint latency breakdown
- Error rate by status code
- Database connection pool status
3. Resource Utilization
- CPU, memory, disk usage per instance
- Network I/O rates
4. Logs (collapsed by default)
- Recent errors from Loki
- Alert firing history
## Variables
- `environment`: production, staging, development
- `instance`: all instances or specific instance selection
- `interval`: aggregation window (5m, 15m, 1h)
## Annotations
- Deployment events from CI/CD system
- Alert firing/resolving events
Key design principles:
Expected: Clear dashboard structure documented, stakeholders aligned on metrics and layout priorities.
On failure:
Build the dashboard foundation with reusable variables for filtering.
Create dashboard JSON structure (or use UI, then export):
{
"dashboard": {
"title": "API Service Overview",
"uid": "api-service-overview",
"version": 1,
"timezone": "browser",
"editable": true,
"graphTooltip": 1,
"time": {
"from": "now-6h",
"to": "now"
},
"refresh": "30s",
"templating": {
"list": [
{
"name": "environment",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(up{job=\"api-service\"}, environment)",
"multi": false,
"includeAll": false,
"refresh": 1,
"sort": 1,
"current": {
"selected": false,
"text": "production",
"value": "production"
}
},
{
"name": "instance",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(up{job=\"api-service\",environment=\"$environment\"}, instance)",
"multi": true,
"includeAll": true,
"refresh": 1,
"allValue": ".*",
"current": {
"selected": true,
"text": "All",
"value": "$__all"
}
},
{
"name": "interval",
"type": "interval",
"options": [
{"text": "1m", "value": "1m"},
{"text": "5m", "value": "5m"},
{"text": "15m", "value": "15m"},
{"text": "1h", "value": "1h"}
],
"current": {
"text": "5m",
"value": "5m"
},
"auto": false
}
]
},
"annotations": {
"list": [
{
"name": "Deployments",
"datasource": "Prometheus",
"enable": true,
"expr": "changes(app_version{job=\"api-service\",environment=\"$environment\"}[5m]) > 0",
"step": "60s",
"iconColor": "rgba(0, 211, 255, 1)",
"tagKeys": "version"
}
]
}
}
}
Variable types and use cases:
label_values(), query_result())Expected: Variables populate correctly from data source, cascading filters work (environment filters instances), default selections appropriate.
On failure:
allValue field for multi-select variablesCreate panels for each metric with appropriate visualization types.
Time series panel (request rate):
{
"type": "timeseries",
"title": "Request Rate",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"targets": [
{
"expr": "sum(rate(http_requests_total{job=\"api-service\",environment=\"$environment\",instance=~\"$instance\"}[$interval])) by (method)",
"legendFormat": "{{method}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps",
"color": {
"mode": "palette-classic"
},
"custom": {
"drawStyle": "line",
"lineInterpolation": "smooth",
"fillOpacity": 10,
"spanNulls": true
},
"thresholds": {
"mode": "absolute",
"steps": [
{"value": null, "color": "green"},
{"value": 1000, "color": "yellow"},
{"value": 5000, "color": "red"}
]
}
}
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
},
"legend": {
"displayMode": "table",
"placement": "right",
"calcs": ["mean", "max", "last"]
}
}
}
Stat panel (error rate):
{
"type": "stat",
"title": "Error Rate",
"gridPos": {"h": 4, "w": 6, "x": 12, "y": 0},
"targets": [
{
# ... (see EXAMPLES.md for complete configuration)
Heatmap panel (latency distribution):
{
"type": "heatmap",
"title": "Request Duration Heatmap",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
"targets": [
{
# ... (see EXAMPLES.md for complete configuration)
Panel selection guide:
Expected: Panels render correctly with data, visualizations match intended metric types, legends descriptive, thresholds highlight problems.
On failure:
Organize panels into collapsible rows for logical grouping.
{
"panels": [
{
"type": "row",
"title": "High-Level Metrics",
"collapsed": false,
# ... (see EXAMPLES.md for complete configuration)
Layout best practices:
w (width) and h (height)Expected: Dashboard layout organized logically, rows collapse/expand correctly, panels align visually without gaps.
On failure:
Create navigation paths between related dashboards.
Dashboard-level links in JSON:
{
"links": [
{
"title": "Service Details",
"type": "link",
"icon": "external link",
# ... (see EXAMPLES.md for complete configuration)
Panel-level data links:
{
"fieldConfig": {
"defaults": {
"links": [
{
"title": "View Logs for ${__field.labels.instance}",
# ... (see EXAMPLES.md for complete configuration)
Link variables:
$service, $environment: Dashboard template variables${__field.labels.instance}: Label value from clicked data point${__from}, ${__to}: Current dashboard time range$__url_time_range: Encoded time range for URLExpected: Clicking panel elements or dashboard links navigates to related views with context preserved (time range, variables).
On failure:
includeVars and keepTime flags work as expectedVersion control dashboards as code for reproducible deployments.
Create provisioning directory structure:
mkdir -p /etc/grafana/provisioning/{dashboards,datasources}
Datasource provisioning (/etc/grafana/provisioning/datasources/prometheus.yml):
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
# ... (see EXAMPLES.md for complete configuration)
Dashboard provisioning (/etc/grafana/provisioning/dashboards/default.yml):
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: 'Services'
type: file
disableDeletion: false
updateIntervalSeconds: 30
allowUiUpdates: true
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: true
Store dashboard JSON files in /var/lib/grafana/dashboards/:
/var/lib/grafana/dashboards/
├── api-service/
│ ├── overview.json
│ └── details.json
├── database/
│ └── postgres.json
└── infrastructure/
├── nodes.json
└── kubernetes.json
Using Docker Compose:
version: '3.8'
services:
grafana:
image: grafana/grafana:10.2.0
ports:
- "3000:3000"
volumes:
- ./grafana/provisioning:/etc/grafana/provisioning
- ./grafana/dashboards:/var/lib/grafana/dashboards
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Viewer
Expected: Dashboards automatically loaded on Grafana startup, changes to JSON files reflected after update interval, version control tracks dashboard changes.
On failure:
docker logs grafana | grep -i provisioningpython -m json.tool dashboard.jsonchmod 644 *.jsonallowUiUpdates: false to prevent UI modificationscurl http://localhost:3000/api/admin/provisioning/dashboards/reload -X POST -H "Authorization: Bearer $GRAFANA_API_KEY"$variable syntax, not hardcoded values. Check variable refresh settings.legendFormat to show only relevant labels, not full metric name. Example: {{method}} - {{status}} instead of default.allowUiUpdates: false in production.${__field.labels.labelname} carefully, verify label exists in query result.setup-prometheus-monitoring - Configure Prometheus data sources that feed Grafana dashboardsconfigure-log-aggregation - Set up Loki for log panel queries and log-based annotationsdefine-slo-sli-sla - Visualize SLO compliance and error budgets with Grafana stat and gauge panelsinstrument-distributed-tracing - Add trace ID links from metrics panels to Tempo trace views