Skill

ml-ai

Guides Grafana Cloud AI/ML setup: Assistant for natural language queries/dashboards/incidents, Dynamic Alerting for ML forecasting/outliers, Sift/Knowledge Graph for root cause analysis, LLM plugins for OpenAI/Anthropic integration.

Prometheus

OpenAI

npx claudepluginhub grafana/skills --plugin grafana-app-sdk

Tool Access

This skill uses the workspace's default tool permissions.

Preview

> **Docs**: https://grafana.com/docs/grafana-cloud/alerting-and-irm/machine-learning/

SKILL.md

Similar Skills

forecast-operational-metrics

Forecasts infrastructure metrics like CPU, memory, disk, costs using Prophet/statsmodels for capacity planning and scaling. Visualizes predictions in Grafana with alerts.

1 file1 tool

agent-almanac

investigate-alert

Investigates Grafana alerts using gcx CLI to check states, query datasources like Prometheus, determine firing causes, scope, and impact. For diagnosing specific firing or pending alerts.

1 file

gcx

cost-management

Monitors Grafana Cloud usage and costs, attributes spending by labels/teams, sets quota alerts, manages invoices, optimizes with Adaptive Metrics/Logs for cardinality reduction. Use for budgeting and FinOps.

grafana-app-sdk

Stats

Stars14

Forks1

Last CommitApr 14, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Grafana Cloud AI & ML

Docs: https://grafana.com/docs/grafana-cloud/alerting-and-irm/machine-learning/

Grafana Assistant

Context-aware LLM sidebar agent (GA). Integrates with your Grafana Cloud stack.

Capabilities:

Convert natural language to PromQL/LogQL/TraceQL
Explain existing queries in plain English
Build and edit dashboards from descriptions
Investigate incidents (correlate metrics, logs, traces)
MCP server integration — connect external tools to Assistant
RBAC controls per organization
Slack integration for on-call workflows

Assistant Investigations (public preview): Multi-agent autonomous incident analysis mode — launches multiple specialized agents in parallel to investigate different signals.

Enable: Grafana Cloud → Administration → AI & LLM → Enable Grafana Assistant

In panel editor: Click the magic wand / "Assistant" icon to get query suggestions and explanations.

Dynamic Alerting

ML-based alerting without static thresholds.

Forecasting (Prophet model)

Trained on 90 days of history; learns daily and weekly seasonality patterns.

# Create forecast job
curl -X POST https://yourstack.grafana.net/api/plugins/grafana-ml-app/resources/ml/v1/forecast \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "cpu-forecast",
    "metric": "avg(rate(node_cpu_seconds_total{mode=\"user\"}[5m]))",
    "datasourceId": 1,
    "interval": 300,
    "trainingWindow": "90d",
    "forecastWindow": "7d",
    "algorithm": { "name": "prophet", "config": {} }
  }'

Generated metric pairs for alert rules:

# Predicted value
ml_forecast{job="cpu-forecast"}

# Confidence bounds
ml_forecast_lower{job="cpu-forecast"}
ml_forecast_upper{job="cpu-forecast"}

# Alert: actual > upper bound (anomaly above forecast)
avg(rate(node_cpu_seconds_total{mode="user"}[5m]))
  > ml_forecast_upper{job="cpu-forecast"} * 1.1

Outlier Detection

Detects when one series in a group deviates from its peers.

curl -X POST https://yourstack.grafana.net/api/plugins/grafana-ml-app/resources/ml/v1/outlier \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "service-error-outliers",
    "metric": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service)",
    "datasourceId": 1,
    "interval": 300,
    "algorithm": {
      "name": "dbscan",
      "sensitivity": 0.5,
      "config": { "epsilon": 0.5 }
    }
  }'

# Score > 0: series is an outlier (use in alert rule)
ml_outlier_score{job="service-error-outliers", service="checkout"}

Alert Rules using ML

groups:
  - name: ml-alerts
    rules:
      - alert: CPUAboveForecast
        expr: |
          avg(rate(node_cpu_seconds_total{mode="user"}[5m]))
          > ml_forecast_upper{job="cpu-forecast"} * 1.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU usage significantly above forecast"

      - alert: ServiceErrorRateAnomaly
        expr: ml_outlier_score{job="service-error-outliers"} > 0.8
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Anomalous error rate on {{ $labels.service }}"

Sift (Automated Root Cause Analysis)

Free for all Grafana Cloud accounts. Automatically investigates incidents by correlating signals.

8 Analysis Types:

Analysis	What it checks
Error Pattern Logs	Clusters log errors by pattern, ranks by frequency/recency
HTTP Error Series	Finds HTTP 4xx/5xx spikes correlated with incident window
Kube Crashes	OOMKills, pod restarts, evictions in K8s
Log Query	Custom LogQL query results correlated to incident time
Metric Query	Custom PromQL anomalies around incident window
Noisy Neighbors	Detects resource contention from co-located services
Recent Deployments	Correlates recent Helm/K8s deployments with incident start
Resource Contention	CPU throttling, memory pressure, disk I/O saturation

Trigger Sift from:

Explore → "Run Sift Investigation"
Dashboard panel → "Investigate with Sift"
Grafana Incident → "Run Sift" button
Command palette (Cmd+K) → "Start Sift investigation"
OnCall escalation chains → automatic trigger

# Trigger via API
curl -X POST https://yourstack.grafana.net/api/plugins/grafana-sift-app/resources/sift/v1/investigations \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "checkout-latency-spike",
    "start": "2024-02-01T10:00:00Z",
    "end": "2024-02-01T10:30:00Z",
    "filters": { "service": "checkout", "namespace": "production" }
  }'

Knowledge Graph

Auto-discovers services, pods, nodes, and namespaces from metric labels and trace data. Updates every minute.

Access: Observability → Entity graph

Search syntax:

Show Service api-server
Show all services in namespace production
Show Pod frontend-abc123

RCA Workbench: Structured troubleshooting interface built on the knowledge graph — traces relationships between entities to identify blast radius and upstream causes.

LLM Plugin

Acts as an authenticated proxy for LLM provider API calls from Grafana panels and plugins.

Supported providers: OpenAI, Anthropic (Claude), Azure OpenAI, vLLM, Ollama, LiteLLM

Powered features: Flame graph interpretation, incident auto-summary, panel title generation, Sift log explanations, natural language panel descriptions.

Enable: Administration → Plugins → LLM Plugin → "Enable OpenAI/LLM access via Grafana"

# provisioning/plugins/llm.yaml
apiVersion: 1
apps:
  - type: grafana-llm-app
    jsonData:
      # OpenAI
      openAIUrl: https://api.openai.com
      openAIModel: gpt-4o
      # Or Anthropic:
      # provider: anthropic
      # anthropicModel: claude-sonnet-4-6
      # Or Azure OpenAI:
      # openAIUrl: https://your-resource.openai.azure.com
      # azureModelMapping: '[["gpt-4o","your-deployment-name"]]'
    secureJsonData:
      openAIKey: sk-your-openai-key

Adaptive Metrics

Identifies unused metrics to reduce cardinality and storage costs.

# Get aggregation recommendations
curl https://yourstack.grafana.net/api/plugins/grafana-adaptive-metrics-app/resources/v1/recommendations \
  -H "Authorization: Bearer <token>"

Aggregation rule (drops high-cardinality labels):

- match: "^http_request_duration_seconds.*"
  action: keep
  match_labels: [method, status, service]
  # Drops: pod, container, instance