From service-mesh-debug
Diagnose and fix flaky e2e tests and general connectivity issues in service mesh environments (Kuma, Istio, Linkerd, Consul). Trigger when: a user mentions intermittent test failures, "test is flaky", e2e failures in CI that don't reproduce locally, Ginkgo/Gomega test files that fail sometimes, 503/connection refused errors, mTLS handshake failures, pods not getting traffic, xDS NACKs or warming resources, cert delivery timing issues, or "works locally but not in cluster". Covers Kuma flakiness patterns (timing races, xDS propagation delays, Envoy circuit breakers, mTLS readiness, Gomega misuse) AND universal mesh debugging (control plane connectivity, proxy lifecycle, certificate problems, traffic routing/policy, service discovery).
npx claudepluginhub smykla-skalski/sai --plugin service-mesh-debugThis skill is limited to using the following tools:
Two modes: **Flaky E2E Fix** (Kuma/Ginkgo test files) and **Mesh Connectivity Debug** (live cluster issues).
Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.
Builds 3-5 year financial models for startups with cohort revenue projections, cost structures, cash flow, headcount plans, burn rate, runway, and scenario analysis.
Two modes: Flaky E2E Fix (Kuma/Ginkgo test files) and Mesh Connectivity Debug (live cluster issues).
test/framework/ helpers (Mode 1) or a live K8s cluster with mesh sidecar injection (Mode 2)Kuma e2e tests use Ginkgo/Gomega with a custom framework in test/framework/. Flakiness almost
always traces to one of ~11 known root causes. The fix is usually a 1-3 line change.
Read the file the user provides (or grep for it). Focus on:
Eventually / Consistently calls and their timeout stringsExpect(...) calls inside Eventually blockstime.Sleep callsRead references/root-causes.md in full before matching — it contains code examples and fix patterns for all 11 root causes. The most common causes in order of frequency:
| # | Pattern | Fast signal |
|---|---|---|
| 1 | Short Eventually timeout | "30s" near gateway/policy/mTLS code |
| 2 | Missing xDS readiness gate | Traffic asserted immediately after policy apply |
| 3 | Bare Expect inside Eventually | Eventually(func() { Expect(...) }) — no g Gomega |
| 4 | Pod not available after create | WaitUntilNumPodsCreatedE without WaitUntilPodAvailableE |
| 5 | PodNameOfApp race after kill | Called immediately after KillAppPod |
| 6 | External component not awaited | SPIRE / cert-manager / Postgres creation without availability wait |
| 7 | xDS config diff before convergence | config_dump compare without Eventually wrapper |
| 8 | SDS secret timing | mTLS test fails with Secret is not supplied by SDS |
| 9 | Statistical tight margins | Traffic split % assertion with small N |
| 10 | Circuit breaker tripped | Concurrent test runs exhaust CB defaults |
| 11 | Outlier detection ejection | Fault-injection test leaves host ejected for 30s |
Read references/fix-patterns.md before applying any fix — it contains copy-paste templates matched to each root cause.
Key rules:
FlakeAttempts(n) as the primary fix because it hides root causes and lets the underlying race regress silently.time.Sleep with Eventually because sleep-based waits are fragile under variable CI load and slow down the suite unnecessarily.make format && make checkmake check reports failures, return to Step 2 with the error output and re-diagnose — the original root cause may have been misidentified.AfterEachFailure if missingIf the test lacks a failure debug hook, suggest adding:
AfterEachFailure(func() {
DebugKube(KubeCluster, meshName, namespace)
})
This dumps CP logs, dataplane state, and pod info on failure — essential for future debugging.
If the failure involves connectivity, xDS config, or mTLS and you need to guide the user through live debugging, read references/envoy-debug.md before starting — it contains the full admin API reference and 7-step diagnostic workflow.
For live cluster debugging, suggest running the scripts in scripts/ directly against the pod:
# Full diagnostic snapshot (saves to ./envoy-snapshot-<pod>-<ts>/)
"${CLAUDE_SKILL_DIR}/scripts/envoy_snapshot.py" <pod> -n <namespace>
# xDS health: CP connected? NACKs? Warming resources? Specific cluster present?
"${CLAUDE_SKILL_DIR}/scripts/xds_check.py" <pod> -n <namespace> --cluster <cluster-name>
# mTLS health: warming secrets? cert expiry? TLS error stats?
"${CLAUDE_SKILL_DIR}/scripts/mtls_check.py" <pod> -n <namespace>
Scripts require only kubectl in PATH and Python 3.9+. No extra dependencies.
All scripts support --admin-port (default 9901) for non-Kuma meshes (Istio: 15000, Consul: 19000).
For general connectivity issues in any service mesh (503s, cert errors, traffic blocked, no healthy hosts, policy denials), follow the 7-phase workflow.
# 1. Auto-detect mesh and check control plane health
"${CLAUDE_SKILL_DIR}/scripts/mesh_health.py"
# 2. Run Envoy diagnostics on the affected pod (adjust --admin-port for your mesh)
"${CLAUDE_SKILL_DIR}/scripts/xds_check.py" <pod> -n <ns> # Kuma
"${CLAUDE_SKILL_DIR}/scripts/xds_check.py" <pod> -n <ns> --admin-port 15000 --container istio-proxy # Istio
"${CLAUDE_SKILL_DIR}/scripts/mtls_check.py" <pod> -n <ns> # cert/mTLS issues
"${CLAUDE_SKILL_DIR}/scripts/envoy_snapshot.py" <pod> -n <ns> # full snapshot for offline analysis
Read references/failure-taxonomy.md to classify the issue into one of 6 categories before spending time on mesh-specific commands.
| Symptom | Category |
|---|---|
| All pods broken simultaneously | 1 – Control Plane |
| Single pod, no sidecar | 2 – Proxy Lifecycle |
Secret is not supplied by SDS | 3 – Certificates |
| TLS handshake failure | 3 – Certificates |
403 / UAEX flag | 4 – Policy |
UH / no healthy hosts | 5 – Service Discovery |
| Works on some nodes, not others | 6 – Infrastructure |
| Flaky in CI, passes locally | 1 or 3 |
Read references/mesh-debug-workflow.md for the 7-phase workflow with mesh-specific commands for Kuma, Istio, Linkerd, and Consul.
| Helper | Location | Use for |
|---|---|---|
WaitForMesh | test/framework/resources.go | Multi-zone mesh sync |
WaitForResource | test/framework/resources.go | Any resource to appear |
DebugKube / DebugUniversal | test/framework/debug.go | State dump on failure |
AfterEachFailure | test/framework/ginkgo.go | Hook debug to failure only |
ControlPlaneAssertions | test/framework/debug.go | Assert CP not crashed |
CollectEchoResponse | test/framework/client/collect.go | HTTP connectivity check |
CollectFailure | test/framework/client/collect.go | Assert expected conn failure |
MustPassRepeatedly(n) | Gomega | Require N consecutive passes |
Within(timeout, task) | pkg/test/within.go | Goroutine with timeout |
| Context | Timeout | Rationale |
|---|---|---|
| Pod creation + readiness | "30s" | Scheduler + kubelet startup |
| Policy propagation (simple) | "30s" | CP reconcile + xDS push |
| Gateway / ingress policies | "60s" | Extra reconcile cycles |
| mTLS / SVID / cross-zone | "2m" | Cert issuance + KDS sync |
MustPassRepeatedly(5) | "2m" | Needs many attempts to confirm stable |
Polling interval: "1s" is standard. Use "500ms" only for fast local assertions.
Expect(x) inside Eventually(func() { ... }) — must be Eventually(func(g Gomega) { g.Expect(x) })WaitUntilNumPodsCreatedE alone — always follow with WaitUntilPodAvailableE per podtime.Sleep(N * time.Second) — replace with EventuallyFlakeAttempts(3) as first resort