From grafana-app-sdk
Guides Grafana Cloud AI/ML setup: Assistant for natural language queries/dashboards/incidents, Dynamic Alerting for ML forecasting/outliers, Sift/Knowledge Graph for root cause analysis, LLM plugins for OpenAI/Anthropic integration.
npx claudepluginhub grafana/skills --plugin grafana-app-sdkThis skill uses the workspace's default tool permissions.
> **Docs**: https://grafana.com/docs/grafana-cloud/alerting-and-irm/machine-learning/
Forecasts infrastructure metrics like CPU, memory, disk, costs using Prophet/statsmodels for capacity planning and scaling. Visualizes predictions in Grafana with alerts.
Investigates Grafana alerts using gcx CLI to check states, query datasources like Prometheus, determine firing causes, scope, and impact. For diagnosing specific firing or pending alerts.
Monitors Grafana Cloud usage and costs, attributes spending by labels/teams, sets quota alerts, manages invoices, optimizes with Adaptive Metrics/Logs for cardinality reduction. Use for budgeting and FinOps.
Share bugs, ideas, or general feedback.
Docs: https://grafana.com/docs/grafana-cloud/alerting-and-irm/machine-learning/
Context-aware LLM sidebar agent (GA). Integrates with your Grafana Cloud stack.
Capabilities:
Assistant Investigations (public preview): Multi-agent autonomous incident analysis mode — launches multiple specialized agents in parallel to investigate different signals.
Enable: Grafana Cloud → Administration → AI & LLM → Enable Grafana Assistant
In panel editor: Click the magic wand / "Assistant" icon to get query suggestions and explanations.
ML-based alerting without static thresholds.
Trained on 90 days of history; learns daily and weekly seasonality patterns.
# Create forecast job
curl -X POST https://yourstack.grafana.net/api/plugins/grafana-ml-app/resources/ml/v1/forecast \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json" \
-d '{
"name": "cpu-forecast",
"metric": "avg(rate(node_cpu_seconds_total{mode=\"user\"}[5m]))",
"datasourceId": 1,
"interval": 300,
"trainingWindow": "90d",
"forecastWindow": "7d",
"algorithm": { "name": "prophet", "config": {} }
}'
Generated metric pairs for alert rules:
# Predicted value
ml_forecast{job="cpu-forecast"}
# Confidence bounds
ml_forecast_lower{job="cpu-forecast"}
ml_forecast_upper{job="cpu-forecast"}
# Alert: actual > upper bound (anomaly above forecast)
avg(rate(node_cpu_seconds_total{mode="user"}[5m]))
> ml_forecast_upper{job="cpu-forecast"} * 1.1
Detects when one series in a group deviates from its peers.
curl -X POST https://yourstack.grafana.net/api/plugins/grafana-ml-app/resources/ml/v1/outlier \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json" \
-d '{
"name": "service-error-outliers",
"metric": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service)",
"datasourceId": 1,
"interval": 300,
"algorithm": {
"name": "dbscan",
"sensitivity": 0.5,
"config": { "epsilon": 0.5 }
}
}'
# Score > 0: series is an outlier (use in alert rule)
ml_outlier_score{job="service-error-outliers", service="checkout"}
groups:
- name: ml-alerts
rules:
- alert: CPUAboveForecast
expr: |
avg(rate(node_cpu_seconds_total{mode="user"}[5m]))
> ml_forecast_upper{job="cpu-forecast"} * 1.1
for: 5m
labels:
severity: warning
annotations:
summary: "CPU usage significantly above forecast"
- alert: ServiceErrorRateAnomaly
expr: ml_outlier_score{job="service-error-outliers"} > 0.8
for: 5m
labels:
severity: critical
annotations:
summary: "Anomalous error rate on {{ $labels.service }}"
Free for all Grafana Cloud accounts. Automatically investigates incidents by correlating signals.
8 Analysis Types:
| Analysis | What it checks |
|---|---|
| Error Pattern Logs | Clusters log errors by pattern, ranks by frequency/recency |
| HTTP Error Series | Finds HTTP 4xx/5xx spikes correlated with incident window |
| Kube Crashes | OOMKills, pod restarts, evictions in K8s |
| Log Query | Custom LogQL query results correlated to incident time |
| Metric Query | Custom PromQL anomalies around incident window |
| Noisy Neighbors | Detects resource contention from co-located services |
| Recent Deployments | Correlates recent Helm/K8s deployments with incident start |
| Resource Contention | CPU throttling, memory pressure, disk I/O saturation |
Trigger Sift from:
Cmd+K) → "Start Sift investigation"# Trigger via API
curl -X POST https://yourstack.grafana.net/api/plugins/grafana-sift-app/resources/sift/v1/investigations \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json" \
-d '{
"name": "checkout-latency-spike",
"start": "2024-02-01T10:00:00Z",
"end": "2024-02-01T10:30:00Z",
"filters": { "service": "checkout", "namespace": "production" }
}'
Auto-discovers services, pods, nodes, and namespaces from metric labels and trace data. Updates every minute.
Access: Observability → Entity graph
Search syntax:
Show Service api-server
Show all services in namespace production
Show Pod frontend-abc123
RCA Workbench: Structured troubleshooting interface built on the knowledge graph — traces relationships between entities to identify blast radius and upstream causes.
Acts as an authenticated proxy for LLM provider API calls from Grafana panels and plugins.
Supported providers: OpenAI, Anthropic (Claude), Azure OpenAI, vLLM, Ollama, LiteLLM
Powered features: Flame graph interpretation, incident auto-summary, panel title generation, Sift log explanations, natural language panel descriptions.
Enable: Administration → Plugins → LLM Plugin → "Enable OpenAI/LLM access via Grafana"
# provisioning/plugins/llm.yaml
apiVersion: 1
apps:
- type: grafana-llm-app
jsonData:
# OpenAI
openAIUrl: https://api.openai.com
openAIModel: gpt-4o
# Or Anthropic:
# provider: anthropic
# anthropicModel: claude-sonnet-4-6
# Or Azure OpenAI:
# openAIUrl: https://your-resource.openai.azure.com
# azureModelMapping: '[["gpt-4o","your-deployment-name"]]'
secureJsonData:
openAIKey: sk-your-openai-key
Identifies unused metrics to reduce cardinality and storage costs.
# Get aggregation recommendations
curl https://yourstack.grafana.net/api/plugins/grafana-adaptive-metrics-app/resources/v1/recommendations \
-H "Authorization: Bearer <token>"
Aggregation rule (drops high-cardinality labels):
- match: "^http_request_duration_seconds.*"
action: keep
match_labels: [method, status, service]
# Drops: pod, container, instance