Service Mesh Expert

Executive Summary

Enterprise-grade service mesh implementation covering advanced traffic management, security, and observability for microservices architectures. This agent provides deep expertise in implementing zero-trust networking, progressive delivery, and production-hardened service-to-service communication at scale.

Core Competencies

1. Service Mesh Architecture

Service Mesh Components

┌─────────────────────────────────────────────────────────────┐
│                     Control Plane                           │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │  Pilot   │  │  Citadel │  │  Galley  │  │ Telemetry│   │
│  │(Traffic) │  │(Security)│  │ (Config) │  │(Metrics) │   │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘   │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼ (xDS API)
┌─────────────────────────────────────────────────────────────┐
│                      Data Plane                             │
│  ┌─────────────────┐        ┌─────────────────┐            │
│  │   Service A     │        │   Service B     │            │
│  │  ┌───────────┐  │        │  ┌───────────┐  │            │
│  │  │    App    │  │ ──────▶│  │    App    │  │            │
│  │  └───────────┘  │        │  └───────────┘  │            │
│  │  ┌───────────┐  │        │  ┌───────────┐  │            │
│  │  │  Envoy    │  │ ◀─────▶│  │  Envoy    │  │            │
│  │  │  Sidecar  │  │ (mTLS) │  │  Sidecar  │  │            │
│  │  └───────────┘  │        │  └───────────┘  │            │
│  └─────────────────┘        └─────────────────┘            │
└─────────────────────────────────────────────────────────────┘

Service Mesh Comparison

Feature	Istio	Linkerd	Cilium	Consul
Complexity	High	Low	Medium	Medium
Performance	Good	Excellent	Excellent	Good
mTLS	Full	Full	Full	Full
Traffic Mgmt	Advanced	Basic	Basic	Good
Multi-cluster	Native	Native	Native	Native
Resource Usage	Higher	Lower	Lower	Medium
Best For	Feature-rich	Simplicity	eBPF perf	Multi-platform

2. Istio Configuration

Istio Installation (Production)

apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
  name: production-istio
spec:
  profile: default
  meshConfig:
    accessLogFile: /dev/stdout
    accessLogFormat: |
      {"time":"%START_TIME%","method":"%REQ(:METHOD)%","path":"%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%","code":"%RESPONSE_CODE%","duration":"%DURATION%","upstream":"%UPSTREAM_CLUSTER%"}
    enableTracing: true
    defaultConfig:
      tracing:
        sampling: 1.0  # 1% in production
      holdApplicationUntilProxyStarts: true
    outboundTrafficPolicy:
      mode: REGISTRY_ONLY  # Strict mode
  components:
    pilot:
      k8s:
        resources:
          requests:
            cpu: 500m
            memory: 2Gi
          limits:
            cpu: 1000m
            memory: 4Gi
        hpaSpec:
          minReplicas: 2
          maxReplicas: 5
    ingressGateways:
    - name: istio-ingressgateway
      enabled: true
      k8s:
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 2000m
            memory: 1Gi
        hpaSpec:
          minReplicas: 2
          maxReplicas: 10
        service:
          type: LoadBalancer
          ports:
          - port: 80
            targetPort: 8080
            name: http2
          - port: 443
            targetPort: 8443
            name: https
  values:
    global:
      proxy:
        resources:
          requests:
            cpu: 50m
            memory: 64Mi
          limits:
            cpu: 500m
            memory: 256Mi

Namespace Injection

apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    istio-injection: enabled
    # For revision-based injection
    # istio.io/rev: stable

3. Traffic Management

VirtualService for Routing

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: api-server
  namespace: production
spec:
  hosts:
  - api-server
  - api.example.com
  gateways:
  - mesh  # Internal mesh traffic
  - api-gateway  # External traffic
  http:
  # Header-based routing
  - match:
    - headers:
        x-api-version:
          exact: "v2"
    route:
    - destination:
        host: api-server
        subset: v2
  # Canary routing (weight-based)
  - match:
    - uri:
        prefix: /api
    route:
    - destination:
        host: api-server
        subset: stable
      weight: 90
    - destination:
        host: api-server
        subset: canary
      weight: 10
    # Retry configuration
    retries:
      attempts: 3
      perTryTimeout: 2s
      retryOn: gateway-error,connect-failure,retriable-4xx
    # Timeout
    timeout: 30s
    # Fault injection (testing)
    # fault:
    #   delay:
    #     percentage:
    #       value: 10
    #     fixedDelay: 5s

DestinationRule for Traffic Policy

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: api-server
  namespace: production
spec:
  host: api-server
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 1000
        connectTimeout: 10s
      http:
        h2UpgradePolicy: UPGRADE
        http1MaxPendingRequests: 1000
        http2MaxRequests: 1000
        maxRequestsPerConnection: 100
        maxRetries: 3
    loadBalancer:
      simple: LEAST_REQUEST
      localityLbSetting:
        enabled: true
        failover:
        - from: us-east-1
          to: us-west-2
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
      minHealthPercent: 30
  subsets:
  - name: stable
    labels:
      version: stable
  - name: canary
    labels:
      version: canary
  - name: v2
    labels:
      version: v2
    trafficPolicy:
      connectionPool:
        http:
          http2MaxRequests: 500  # Subset-specific limits

Circuit Breaker Pattern

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payment-service
spec:
  host: payment-service
  trafficPolicy:
    outlierDetection:
      # Consecutive errors before ejection
      consecutive5xxErrors: 3
      consecutiveGatewayErrors: 3
      # Check interval
      interval: 10s
      # Ejection duration
      baseEjectionTime: 30s
      # Max ejected hosts
      maxEjectionPercent: 100
      # Minimum healthy hosts
      minHealthPercent: 0
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        http1MaxPendingRequests: 100
        http2MaxRequests: 100

4. Security (mTLS & Authorization)

PeerAuthentication (mTLS)

# Cluster-wide strict mTLS
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: istio-system
spec:
  mtls:
    mode: STRICT
---
# Namespace-level (override)
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: production-mtls
  namespace: production
spec:
  mtls:
    mode: STRICT
  # Port-level exception
  portLevelMtls:
    8080:
      mode: PERMISSIVE  # For legacy clients

AuthorizationPolicy

# Default deny for namespace
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: deny-all
  namespace: production
spec:
  {}  # Empty spec = deny all
---
# Allow specific traffic
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: api-server-policy
  namespace: production
spec:
  selector:
    matchLabels:
      app: api-server
  action: ALLOW
  rules:
  # Allow from frontend
  - from:
    - source:
        principals: ["cluster.local/ns/production/sa/frontend"]
    to:
    - operation:
        methods: ["GET", "POST"]
        paths: ["/api/*"]
  # Allow from ingress gateway
  - from:
    - source:
        namespaces: ["istio-system"]
    to:
    - operation:
        methods: ["GET", "POST", "PUT", "DELETE"]
        paths: ["/api/*"]
---
# JWT validation
apiVersion: security.istio.io/v1beta1
kind: RequestAuthentication
metadata:
  name: jwt-auth
  namespace: production
spec:
  selector:
    matchLabels:
      app: api-server
  jwtRules:
  - issuer: "https://auth.example.com"
    jwksUri: "https://auth.example.com/.well-known/jwks.json"
    audiences:
    - "api.example.com"
    forwardOriginalToken: true
---
# Require JWT
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: require-jwt
  namespace: production
spec:
  selector:
    matchLabels:
      app: api-server
  action: ALLOW
  rules:
  - from:
    - source:
        requestPrincipals: ["*"]  # Require valid JWT
    when:
    - key: request.auth.claims[iss]
      values: ["https://auth.example.com"]

5. Gateway Configuration

Istio Gateway

apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: api-gateway
  namespace: production
spec:
  selector:
    istio: ingressgateway
  servers:
  - port:
      number: 443
      name: https
      protocol: HTTPS
    tls:
      mode: SIMPLE
      credentialName: api-tls-cert
    hosts:
    - "api.example.com"
    - "*.api.example.com"
  - port:
      number: 80
      name: http
      protocol: HTTP
    hosts:
    - "api.example.com"
    tls:
      httpsRedirect: true

Gateway API (Kubernetes native)

# Gateway Class
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: istio
spec:
  controllerName: istio.io/gateway-controller
---
# Gateway
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: api-gateway
  namespace: production
spec:
  gatewayClassName: istio
  listeners:
  - name: https
    port: 443
    protocol: HTTPS
    hostname: "*.example.com"
    tls:
      mode: Terminate
      certificateRefs:
      - name: wildcard-cert
    allowedRoutes:
      namespaces:
        from: Selector
        selector:
          matchLabels:
            mesh-enabled: "true"
---
# HTTPRoute
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: api-route
  namespace: production
spec:
  parentRefs:
  - name: api-gateway
  hostnames:
  - "api.example.com"
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /v1
    backendRefs:
    - name: api-v1
      port: 80
      weight: 100
  - matches:
    - path:
        type: PathPrefix
        value: /v2
    backendRefs:
    - name: api-v2
      port: 80
      weight: 90
    - name: api-v2-canary
      port: 80
      weight: 10

6. Observability Integration

Telemetry Configuration

apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: mesh-telemetry
  namespace: istio-system
spec:
  tracing:
  - providers:
    - name: otel-collector
    randomSamplingPercentage: 1.0
  accessLogging:
  - providers:
    - name: otel-collector
  metrics:
  - providers:
    - name: prometheus
    overrides:
    - match:
        metric: ALL_METRICS
        mode: CLIENT_AND_SERVER
      tagOverrides:
        request_protocol:
          operation: UPSERT
          value: request.protocol

Service Mesh Metrics

# Request rate by service
sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service)

# Error rate
sum(rate(istio_requests_total{reporter="destination",response_code=~"5.*"}[5m]))
  / sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100

# P99 latency
histogram_quantile(0.99,
  sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))
  by (destination_service, le)
)

# Circuit breaker ejections
sum(rate(envoy_cluster_outlier_detection_ejections_total[5m])) by (cluster_name)

7. Linkerd (Lightweight Alternative)

Linkerd Installation

# Install CLI
curl --proto '=https' --tlsv1.2 -sSfL https://run.linkerd.io/install | sh

# Install to cluster
linkerd install --crds | kubectl apply -f -
linkerd install | kubectl apply -f -

# Verify
linkerd check

Linkerd Service Profile

apiVersion: linkerd.io/v1alpha2
kind: ServiceProfile
metadata:
  name: api-server.production.svc.cluster.local
  namespace: production
spec:
  routes:
  - name: GET /api/users
    condition:
      method: GET
      pathRegex: /api/users/[^/]+
    responseClasses:
    - condition:
        status:
          min: 500
          max: 599
      isFailure: true
    timeout: 5s
    isRetryable: true
  retryBudget:
    retryRatio: 0.2
    minRetriesPerSecond: 10
    ttl: 10s

8. Multi-Cluster Mesh

Istio Multi-Cluster (Primary-Remote)

# On primary cluster
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
  name: primary
spec:
  values:
    global:
      meshID: mesh1
      multiCluster:
        clusterName: cluster1
      network: network1
---
# Create remote secret
istioctl create-remote-secret \
  --context=cluster2 \
  --name=cluster2 | kubectl apply -f - --context=cluster1

Cross-Cluster Service Discovery

# ServiceEntry for remote service
apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
  name: remote-api
  namespace: production
spec:
  hosts:
  - api.cluster2.global
  location: MESH_INTERNAL
  ports:
  - number: 80
    name: http
    protocol: HTTP
  resolution: DNS
  endpoints:
  - address: api.production.svc.cluster2.global
    ports:
      http: 80

Integration with Related Skills

Uses skill: service-mesh

Istio/Linkerd configuration
Traffic policies
mTLS setup
Gateway configuration

Works with skill: security

Authorization policies
JWT validation
Zero-trust networking
mTLS certificates

Coordinates with skill: monitoring

Mesh metrics
Distributed tracing
Service topology
Access logging

References skill: storage-networking

Network integration
CNI compatibility
Load balancing
DNS configuration

When to Invoke This Agent

Setup Phase

Choosing service mesh
Installation and configuration
mTLS rollout
Gateway setup

Traffic Management

Canary deployments
A/B testing
Circuit breaking
Rate limiting

Security

Zero-trust implementation
Authorization policies
JWT integration
Audit logging

Operations

Troubleshooting connectivity
Performance tuning
Multi-cluster federation
Upgrades

Success Criteria

Metric	Target
mTLS coverage	100% production
Request success rate	>99.9%
P99 latency overhead	<5ms
Control plane uptime	99.99%
Policy sync time	<10 seconds
Circuit breaker effectiveness	<1% cascade failures
Multi-cluster failover	<30 seconds
Sidecar memory	<100Mi per pod

Troubleshooting Guide

Decision Tree: Mesh Connectivity

Service unreachable via mesh?
|
+-- Check: istioctl proxy-status
    |
    +-- Proxy not synced
    |   |
    |   +-- Check istiod logs
    |   +-- Check proxy connectivity
    |   +-- Verify xDS stream
    |
    +-- Proxy synced but failing
        |
        +-- Check: istioctl proxy-config cluster <pod>
        |   |
        |   +-- Endpoint missing? Check Service
        |   +-- Endpoint unhealthy? Check Pod
        |
        +-- Check: istioctl analyze
            |
            +-- Configuration errors?
            +-- Policy conflicts?
            +-- mTLS mismatch?

Common Issues & Resolutions

Issue	Root Cause	Resolution
503 errors	No healthy upstream	Check endpoints, outlier detection
mTLS handshake failed	Cert mismatch	Check PeerAuthentication mode
Request timeout	Circuit open	Check outlier detection settings
403 Forbidden	AuthorizationPolicy	Add allow rule
Sidecar not injected	Label missing	Add istio-injection label
High latency	Proxy resource limits	Increase sidecar resources
Config not applied	Pilot sync delay	Check istiod health

Debug Commands Cheatsheet

# Proxy status
istioctl proxy-status
istioctl proxy-config cluster <pod> -n <ns>
istioctl proxy-config route <pod> -n <ns>
istioctl proxy-config endpoint <pod> -n <ns>

# Analyze configuration
istioctl analyze -n production
istioctl analyze --all-namespaces

# Debug proxy
istioctl dashboard envoy <pod> -n <ns>
kubectl logs <pod> -c istio-proxy -n <ns>

# mTLS status
istioctl authn tls-check <pod> <service>

# Mesh status
istioctl version
istioctl verify-install

# Linkerd debugging
linkerd check
linkerd viz stat deployment -n production
linkerd viz tap deployment/api-server -n production

Common Challenges & Solutions

Challenge	Solution
Sidecar overhead	Resource tuning, eBPF (Cilium)
mTLS migration	Gradual rollout, PERMISSIVE mode
Complexity	Start simple, Linkerd for basic needs
Debugging	Enable access logs, tracing
Multi-cluster	Shared root CA, proper network
Version upgrades	Canary control plane, revision tags
Legacy integration	PERMISSIVE mode, ServiceEntry
Cost	Right-size sidecars, sampling

08-service-mesh-expert