Help us improve
Share bugs, ideas, or general feedback.
From grafana-app-sdk
Audits and improves Grafana Loki label strategies using cardinality scoring, access-pattern alignment, and consistency checks to fix slow queries.
npx claudepluginhub grafana/skills --plugin grafana-coreHow this skill is triggered — by the user, by Claude, or both
Slash command
/grafana-app-sdk:loki-label-analyzerThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are an expert in Grafana Loki label strategy. When asked to evaluate, audit, design, or improve a Loki label strategy — or when a user asks why their Loki queries are slow — use this guide to provide structured, actionable advice.
Generates LogQL queries, stream selectors, metric queries, and alerting rules for Grafana Loki via interactive workflow handling versions, labels, and use cases like debugging or dashboards.
Guides LogQL query writing, Loki log aggregation pipeline configuration, and log troubleshooting with parsers, metric queries, and label filters.
Configures centralized log aggregation with Loki/Promtail or ELK stack, including parsing, label extraction, retention policies, and metrics correlation for multi-service troubleshooting.
Share bugs, ideas, or general feedback.
You are an expert in Grafana Loki label strategy. When asked to evaluate, audit, design, or improve a Loki label strategy — or when a user asks why their Loki queries are slow — use this guide to provide structured, actionable advice.
Streams are the fundamental unit in Loki. Each unique combination of label key-value pairs creates a new stream. Too many streams = performance problems. Too few = broad, slow queries.
Cardinality = the number of unique values a label can have. High-cardinality labels (like pod, user_id, request_id) dramatically increase stream count and hurt performance — especially when those labels are not specified in every query.
The dual impact rule: High-cardinality labels hurt on both paths:
The key question for any dynamic label: "Will this label be used in 9 out of 10 queries?" If no → it should NOT be a label.
When auditing a label strategy, assess each label against these criteria.
| Label Example | Cardinality | Verdict |
|---|---|---|
env (prod/staging/dev) | 2–5 values | ✅ Good |
level (info/warn/error) | 3–6 values | ✅ Good |
namespace (K8s) | Tens | ✅ Acceptable |
instance / hostname | Hundreds–thousands | ⚠️ Evaluate access patterns |
pod | Thousands + transient | ❌ Avoid as label |
user_id, request_id | Unbounded | ❌ Never use as label |
For each label, ask:
platform=linux, job=agent) add no cardinality cost relative to the query scope. Use freely for LBAC, exploration, and alert routing.Level ≠ level)INFO, info, Info should all become info)snake_case or camelCase — be consistent)When auditing a label set, produce a report in this structure:
## Loki Label Strategy Audit
### Summary
[1-2 sentence overall assessment]
### Label Analysis
| Label | Cardinality | Used in Queries? | Verdict | Action |
|---|---|---|---|---|
| app | Low (tens) | Always | ✅ Keep | — |
| pod | Very High (transient)| Rarely | ❌ Remove | Move to structured metadata or embed in log line |
### Estimated Impact
- Stream count reduction: [X streams → Y streams]
- Query performance: [describe improvement]
- Storage impact: [if log line changes are involved]
### Recommended Label Set
[Final recommended labels]
### Migration Notes
[How to implement changes via Alloy/Agent pipeline stages]
Every log source should consider these base labels — all low cardinality, high query value:
| Label | Purpose |
|---|---|
app / service | Identifying the generating application |
env | Environment (prod, staging, dev) |
cluster | Multi-cluster differentiation |
region | Geographic region |
level | Log severity — normalize to: info, warn, error, debug |
job | Collector job name |
team / squad | Ownership (also useful for LBAC) |
source | Log origin type (file, k8s-events, journal, syslog, etc.) |
classification | Data sensitivity level — for LBAC policies |
| Label | Description |
|---|---|
namespace | K8s namespace — delineates isolation boundaries |
container | Container name — low cardinality, differentiates log formats |
service | K8s service generating logs |
workload | {controller_kind}/{controller_name} e.g. ReplicaSet/payment-api — strongly recommended |
Why workload beats app for K8s: Derived from {{controller_kind}}/{{controller_name}} — static values that never change like pod names do. Unlike app (which may aggregate multiple workload types), workload is precise and predictable. Users always know exactly what value to query.
pod label ❌
pod → 10 × N streamsworkload as the label; store pod in structured metadata or embed in the log linefilename label (raw K8s path) ❌
/var/log/pods/{namespace}_{pod}_{pod_id}/{container}/{rotation}.logpod_id component makes this unbounded/var/log/pods/{namespace}/{controller_name}/{container}.log or drop entirely// Normalize K8s filename to remove pod UID
stage.replace {
source = "filename"
expression = "/var/log/pods/([^/]+)_[^_]+_[^/]+/([^/]+)/\\d+\\.log"
replace = "/var/log/pods/$1/$2/current.log"
}
In addition to common labels, add:
| Label | Description | Notes |
|---|---|---|
instance | Hostname of the machine | Cardinality = number of machines; acceptable for fixed infrastructure |
filename | Full path to the file being tailed | Normalize rotating filenames — strip date suffixes |
// Remove date suffixes from rotating log file names
// /var/log/myapp/logfile-20230927.txt → /var/log/myapp/logfile.txt
stage.replace {
source = "filename"
expression = "-\\d{8}(\\.log|\\.txt)$"
replace = "$1"
}
When collecting via loki.source.journal, many labels are auto-discovered under __journal__*:
boot_id, cap_effective, cmdline, comm, exe, gid, hostname, machine_id, pid, stream_id, systemd_cgroup, systemd_invocation_id, systemd_slice, systemd_unit, transport, uid
Almost all are high-cardinality. Keep only:
instance — hostname where journal logs were collectedunit — the systemd_unit name (e.g., nginx.service)Drop everything else:
loki.process "journal_labels" {
forward_to = [...]
stage.label_keep {
values = ["instance", "unit", "env", "cluster"]
}
}
Structured metadata attaches key-value pairs to log entries without making them index labels. The ideal home for high-cardinality values users occasionally need.
Requires: Loki 2.9+, Grafana Agent/Alloy. Enable via limits_config:
limits_config:
allow_structured_metadata: true
Good candidates for structured metadata (not labels):
pod — K8s pod namenode — K8s worker nodeversion / image / tagtrace_id / user_idprocess_idrestarted — pod restart timestampQuery structured metadata at query time without a parser:
{app="payment-api"} | pod="payment-api-7f9d4b-xk2r9"
When structured metadata isn't available, embed high-cardinality values into the log line rather than using them as labels.
loki.process "embed_pod" {
forward_to = [...]
// For JSON logs
stage.match {
selector = "{} |~ \"^\\s*\\{\""
stage.replace {
expression = "\\}$"
replace = ""
}
stage.template {
source = "log_line"
template = "{{ .Entry }},\"_pod\":\"{{ .pod }}\"}"
}
}
// For text logs
stage.match {
selector = "{} !~ \"^\\s*\\{\""
stage.template {
source = "log_line"
template = "{{ .Entry }} _pod={{ .pod }}"
}
}
stage.output { source = "log_line" }
}
Result: ts=... msg="..." _pod=agent-logs-cqhfk
Query by aggregate (normal use):
sum(count_over_time({workload="ReplicaSet/payment-api", level="error"}[1m]))
Query a specific pod (edge case debugging):
{workload="ReplicaSet/payment-api", level="error"} |= `_pod=payment-api-3`
loki.process "pack_pod" {
forward_to = [...]
stage.pack {
labels = ["pod"]
ingest_timestamp = false
}
}
Packed result: {"_entry": "original log line", "pod": "agent-logs-cqhfk"}
Unpack at query time:
{workload="ReplicaSet/payment-api", level="error"}
|= `agent-logs-cqhfk`
| unpack
When a user reports slow queries, identify where time is spent using Querier metrics.go logs.
| Stage | Metric | High Value Means | Fix |
|---|---|---|---|
| Queue | queue_time | Not enough Queriers | Add Queriers or reduce parallelism |
| Index | chunk_refs_fetch_time | Need more Index Gateway instances | Scale index-gateways; check CPU |
| Storage | store_chunks_download_time | Chunks too small OR storage bottleneck | Check avg chunk size: total_bytes / cache_chunk_req |
| Execution | duration - chunk_refs_fetch_time - store_chunks_download_time | CPU-intensive regex, or too many tiny log lines | Reduce regex; add CPU; increase parallelism |
Ideally, the majority of time is spent in Execution. If not, that indicates infrastructure or label design problems.
avg chunk size = total_bytes / cache_chunk_req
If the result is a few hundred bytes or kilobytes (instead of megabytes), chunks are too small. This means labels are over-splitting data into too many streams. Revisit and reduce label cardinality.
Problem: Query scans too many streams
Problem: High post_filter_lines discard ratio (post_filter_lines << total_lines)
level, workload, container)Problem: Small chunks
container or workload to narrow scope before line filterslevel label + always use it in queries (filters out 94%+ of logs when searching for errors)pod label → reduces stream count by ~5× in typical K8s deployments|~) with exact filters (|=) where possibleloki.process "normalize_level" {
forward_to = [...]
stage.replace { source = "level"; expression = "(?i)I(nfo)?"; replace = "info" }
stage.replace { source = "level"; expression = "(?i)W(arn(ing)?)?"; replace = "warn" }
stage.replace { source = "level"; expression = "(?i)E(rr(or)?)?"; replace = "error" }
stage.replace { source = "level"; expression = "(?i)D(ebug?)?"; replace = "debug" }
stage.labels { values = { level = "" } }
}
// Only extract when the relevant field is present — avoids unnecessary cardinality
loki.process "conditional_extraction" {
forward_to = [...]
stage.match {
selector = "{app=\"loki\"} |= \"component\""
stage.logfmt { mapping = { "component" = "" } }
stage.labels { values = { component = "" } }
}
}
loki.process "enforce_labels" {
forward_to = [loki.write.default.receiver]
// ... other stages ...
stage.label_keep {
values = ["app", "env", "cluster", "level", "namespace", "workload", "container"]
}
}
stage.template {
source = "team"
template = "{{ if .Value }}{{ .Value }}{{ else }}unknown{{ end }}"
}
stage.labels { values = { team = "" } }
These reduce storage costs. Establish a cost-per-GB baseline before implementing.
Each log entry already has a metadata timestamp — the inline timestamp is redundant (~30–34 bytes each, ~6% of a typical log line).
loki.process "drop_timestamp" {
forward_to = [...]
// logfmt timestamps
stage.replace {
expression = "(?i)((?:time_?(?:stamp)?|ts|logdate|start_?time)=[^ \\n]+(?: |$))"
replace = " "
}
// JSON timestamps
stage.replace {
expression = "(\"@?(?:time_?(?:stamp)?|ts|logdate|start_?time)\"\\s*:\\s*\"[^\"]+\",?)"
replace = " "
}
// ISO-8601 at start of line
stage.replace {
expression = "^(\\d{4}-\\d{2}-\\d{2})T\\d{2}:\\d{2}(?::\\d{2}(?:\\.\\d{1,9})?Z?)?"
replace = ""
}
}
The original timestamp is still accessible at query time: | line_format '{{ __timestamp__ | date "2006-01-02T15:04:05Z" }}'
loki.process "decolorize" {
forward_to = [...]
stage.decolorize {}
}
level is already a label)stage.replace { expression = "(level=[^ ]+ )"; replace = "" }
// Remove null values
stage.replace {
expression = "(\\s*(\"[^\"]+\"\\s*:\\s*null)(?:\\s*,)?\\s*)"
replace = ""
}
// Remove placeholder values ("-", "undefined", "null" strings)
stage.replace {
expression = "(\\s*(\"[^\"]+\"\\s*:\\s*\"(?:-|null|undefined)\")(?:\\s*,)?\\s*)"
replace = ""
}
// Remove empty values ("", [], {})
stage.replace {
expression = "(\\s*,\\s*(\"[^\"]+\"\\s*:\\s*(\\[\\s*\\]|\\{\\s*\\}|\"\\s*\"))|(\"[^\"]+\"\\s*:\\s*(\\[\\s*\\]|\\{\\s*\\}|\"\\s*\"))\\s*,\\s*)"
replace = ""
}
Practical savings (Istio access log example): Starting at 753 bytes (minified) → after removing nulls, placeholders, unused fields, normalizing keys: 464 bytes — 38% reduction.
Grafana Enterprise Logs (GEL) supports Label-Based Access Control (LBAC). Any label can serve as an access control selector.
Best labels for LBAC:
classification — data sensitivity (public, restricted, confidential, top-secret)source — controls which teams can see which log originsteam / squad — ownership-based accessenv — environment-level restrictionsStatic aggregate labels like owner=sysadmins or category=database are particularly effective: one label value gates access to many log files, rather than requiring a long allowlist of filenames or streams.
The most impactful improvements almost always come from these four changes:
pod as a label — biggest stream reduction in K8s environmentslevel as a label AND always specify it in queries — can eliminate 94%+ of scanned data when searching for errorsfilename in K8s — highly variable paths inflate stream count significantlyFocus on these before anything else.
| Label | Why | Alternative |
|---|---|---|
pod | Transient, unbounded | workload label + pod in structured metadata |
user_id | Unbounded | Keep only in log content |
request_id / trace_id | Unbounded | Structured metadata |
filename (raw K8s path) | Contains pod UID | Normalize or drop |
Unnormalized level | INFO/info/Info = 3 streams | Normalize at collection time |
| Any dynamically-named label key | Cannot be bounded | Use fixed keys with bounded values |