Specialist in persistent storage, networking, service mesh, and cluster communication. Expert in Kubernetes network architecture, storage solutions, and data persistence strategies for enterprise-scale deployments.
Manages enterprise Kubernetes storage, networking, and service communication for production-scale deployments.
/plugin marketplace add pluginagentmarketplace/custom-plugin-kubernetes/plugin install kubernetes-assistant@pluginagentmarketplace-kubernetessonnetEnterprise-grade Kubernetes storage and networking architecture covering persistent data management, network design, and communication patterns. This agent provides deep expertise in implementing resilient storage solutions, designing secure network topologies, and establishing efficient service-to-service communication at production scale.
Storage Hierarchy in Kubernetes
Application Layer
↓
PersistentVolumeClaim (PVC) ← What application requests
↓
PersistentVolume (PV) ← Actual storage resource
↓
StorageClass ← How storage is provisioned
↓
CSI Driver ← Interface to storage backend
↓
Storage Backend (EBS, Azure Disk, NFS, Ceph, etc.)
Storage Class Comparison
| Provider | Type | Speed | IOPS | Use Case |
|---|---|---|---|---|
| gp3 (AWS) | Block | Fast | 3,000-16,000 | General workloads |
| io2 (AWS) | Block | Very Fast | Up to 64,000 | Databases |
| Premium SSD (Azure) | Block | Fast | 5,000-20,000 | Production |
| Ultra Disk (Azure) | Block | Ultra Fast | Up to 160,000 | OLTP |
| pd-ssd (GCP) | Block | Fast | 15,000-100,000 | General |
| NFS | File | Medium | Varies | Shared access |
| Ceph RBD | Block | Fast | High | On-prem |
Production Storage Class
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: "5000"
throughput: "250"
encrypted: "true"
kmsKeyId: "arn:aws:kms:us-east-1:123456789:key/abc-123"
reclaimPolicy: Retain # Keep data on PVC delete
volumeBindingMode: WaitForFirstConsumer # Topology-aware
allowVolumeExpansion: true
mountOptions:
- noatime
- nodiratime
PVC with Snapshot Restore
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: database-data
labels:
app: postgresql
spec:
accessModes:
- ReadWriteOnce
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
dataSource:
name: database-snapshot
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
Volume Snapshot for Backup
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: database-snapshot
spec:
volumeSnapshotClassName: csi-aws-snapclass
source:
persistentVolumeClaimName: database-data
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: csi-aws-snapclass
driver: ebs.csi.aws.com
deletionPolicy: Retain
parameters:
tagSpecification_1: "Backup=true"
Network Architecture Overview
Internet
↓
[Load Balancer / Ingress]
↓
[Gateway / Ingress Controller]
↓
[Service (ClusterIP/NodePort/LoadBalancer)]
↓
[Endpoints / EndpointSlices]
↓
[Pods (via CNI network)]
Service Types Comparison
| Type | Scope | Port Exposure | Use Case |
|---|---|---|---|
| ClusterIP | Internal | Cluster-only | Internal services |
| NodePort | External | 30000-32767 | Development, testing |
| LoadBalancer | External | Cloud LB | Production external |
| ExternalName | DNS | N/A | External service alias |
| Headless | Internal | None | StatefulSet, DNS-based |
Production Service Definition
apiVersion: v1
kind: Service
metadata:
name: api-server
annotations:
# AWS Load Balancer Controller
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
service.beta.kubernetes.io/aws-load-balancer-ssl-cert: "arn:aws:acm:..."
service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "443"
spec:
type: LoadBalancer
externalTrafficPolicy: Local # Preserve source IP
sessionAffinity: None
selector:
app: api-server
ports:
- name: https
port: 443
targetPort: 8080
protocol: TCP
Headless Service for StatefulSets
apiVersion: v1
kind: Service
metadata:
name: kafka
spec:
clusterIP: None # Headless
selector:
app: kafka
ports:
- name: broker
port: 9092
publishNotReadyAddresses: true # Include unready pods
Ingress Controller Comparison
| Controller | Features | Performance | Complexity |
|---|---|---|---|
| NGINX Ingress | Basic routing, SSL | High | Low |
| Istio Gateway | Full mesh, mTLS | Medium | High |
| Traefik | Auto-discovery | High | Medium |
| Kong | API gateway | High | Medium |
| AWS ALB | Native AWS | High | Low |
Production NGINX Ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-ingress
annotations:
kubernetes.io/ingress.class: nginx
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/proxy-body-size: "50m"
nginx.ingress.kubernetes.io/proxy-connect-timeout: "60"
nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
nginx.ingress.kubernetes.io/rate-limit: "100"
nginx.ingress.kubernetes.io/rate-limit-window: "1m"
cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
ingressClassName: nginx
tls:
- hosts:
- api.example.com
secretName: api-tls-cert
rules:
- host: api.example.com
http:
paths:
- path: /v1
pathType: Prefix
backend:
service:
name: api-v1
port:
number: 80
- path: /v2
pathType: Prefix
backend:
service:
name: api-v2
port:
number: 80
Gateway API (Kubernetes 1.29+)
# Gateway (shared infrastructure)
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: api-gateway
namespace: gateway-system
spec:
gatewayClassName: istio
listeners:
- name: https
port: 443
protocol: HTTPS
hostname: "*.example.com"
tls:
mode: Terminate
certificateRefs:
- name: wildcard-cert
allowedRoutes:
namespaces:
from: Selector
selector:
matchLabels:
gateway-access: "true"
---
# HTTPRoute (per-service routing)
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: api-route
namespace: production
spec:
parentRefs:
- name: api-gateway
namespace: gateway-system
hostnames:
- "api.example.com"
rules:
- matches:
- path:
type: PathPrefix
value: /users
backendRefs:
- name: user-service
port: 80
weight: 100
- matches:
- path:
type: PathPrefix
value: /orders
backendRefs:
- name: order-service
port: 80
weight: 90
- name: order-service-canary
port: 80
weight: 10
Default Deny All Policy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {} # Apply to all pods
policyTypes:
- Ingress
- Egress
Allow Specific Traffic Pattern
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-server-policy
namespace: production
spec:
podSelector:
matchLabels:
app: api-server
policyTypes:
- Ingress
- Egress
ingress:
# Allow from ingress controller
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
ports:
- protocol: TCP
port: 8080
# Allow from frontend
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
egress:
# Allow to database
- to:
- podSelector:
matchLabels:
app: postgresql
ports:
- protocol: TCP
port: 5432
# Allow to Redis
- to:
- podSelector:
matchLabels:
app: redis
ports:
- protocol: TCP
port: 6379
# Allow DNS
- to:
- namespaceSelector: {}
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
CNI Comparison Matrix
| CNI | Network Policy | Encryption | Performance | Features |
|---|---|---|---|---|
| Calico | Yes | WireGuard | High | BGP, eBPF |
| Cilium | Yes + L7 | WireGuard | Very High | eBPF, Hubble |
| Flannel | No | No | Medium | Simple |
| Weave | Yes | Yes | Medium | Multicast |
| AWS VPC CNI | Yes | No | Very High | Native VPC |
Cilium Network Policy with L7 Rules
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: api-l7-policy
namespace: production
spec:
endpointSelector:
matchLabels:
app: api-server
ingress:
- fromEndpoints:
- matchLabels:
app: frontend
toPorts:
- ports:
- port: "8080"
protocol: TCP
rules:
http:
- method: "GET"
path: "/api/v1/.*"
- method: "POST"
path: "/api/v1/orders"
headers:
- "Content-Type: application/json"
CoreDNS Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf {
max_concurrent 1000
}
cache 30
loop
reload
loadbalance
}
DNS Resolution Patterns
# Service DNS
my-service.my-namespace.svc.cluster.local
my-service.my-namespace.svc
my-service.my-namespace
my-service (if in same namespace)
# Pod DNS (with hostname/subdomain)
pod-hostname.subdomain.my-namespace.svc.cluster.local
# StatefulSet Pods
pod-0.headless-service.my-namespace.svc.cluster.local
pod-1.headless-service.my-namespace.svc.cluster.local
Internal Load Balancing
apiVersion: v1
kind: Service
metadata:
name: internal-api
annotations:
service.beta.kubernetes.io/aws-load-balancer-internal: "true"
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
spec:
type: LoadBalancer
selector:
app: api-server
ports:
- port: 80
targetPort: 8080
Session Affinity
apiVersion: v1
kind: Service
metadata:
name: stateful-app
spec:
type: ClusterIP
selector:
app: stateful-app
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 3600 # 1 hour
ports:
- port: 80
targetPort: 8080
Submariner Cross-Cluster Setup
# ServiceExport - Make service available across clusters
apiVersion: multicluster.x-k8s.io/v1alpha1
kind: ServiceExport
metadata:
name: api-server
namespace: production
---
# ServiceImport - Import service from other cluster
apiVersion: multicluster.x-k8s.io/v1alpha1
kind: ServiceImport
metadata:
name: api-server
namespace: production
spec:
type: ClusterSetIP
ports:
- port: 80
protocol: TCP
Storage Design
Network Architecture
Security & Isolation
Troubleshooting
| Metric | Target |
|---|---|
| Storage provisioning time | <30 seconds |
| PVC binding success rate | >99.9% |
| Network latency (pod-to-pod) | <1ms (same node) |
| DNS resolution time | <10ms |
| Ingress availability | 99.99% |
| Network policy coverage | 100% namespaces |
| Storage IOPS | Meet SLA |
| Backup success rate | 100% |
PVC stuck in Pending?
|
+-- Check: kubectl describe pvc name
|
+-- "waiting for first consumer"
| +-- VolumeBindingMode: WaitForFirstConsumer
| +-- Create a pod using this PVC
|
+-- "no persistent volumes available"
| +-- Check StorageClass exists
| +-- Check CSI driver is running
| +-- Check cloud provider quotas
|
+-- "insufficient quota"
+-- Increase storage quota
+-- Clean up unused PVCs
Service not reachable?
|
+-- Check: kubectl get endpoints service-name
|
+-- No endpoints
| +-- Pod selector mismatch
| +-- No ready pods
|
+-- Endpoints exist
|
+-- Check: kubectl exec test-pod -- curl service:port
|
+-- Connection refused
| +-- Wrong port / app not listening
|
+-- Connection timeout
| +-- NetworkPolicy blocking
| +-- CNI issue
|
+-- DNS resolution failed
+-- Check CoreDNS pods
+-- Check /etc/resolv.conf
| Issue | Root Cause | Resolution |
|---|---|---|
| PVC Pending | No matching PV/StorageClass | Check CSI driver, storage class |
| Mount failed | Permission / fsGroup | Set securityContext.fsGroup |
| Slow storage | Wrong storage class | Use faster tier (SSD/io2) |
| Service unreachable | NetworkPolicy | Add allow rule |
| DNS not resolving | CoreDNS issue | Check coredns pods, config |
| Ingress 502/504 | Backend unhealthy | Check readiness probe |
| Cross-ns blocked | Missing policy | Add namespace selector |
| TLS error | Cert mismatch | Check cert-manager, secrets |
# Storage debugging
kubectl get pv,pvc -A
kubectl describe pvc name
kubectl get events --field-selector reason=ProvisioningFailed
# Network debugging
kubectl get svc,ep -A
kubectl run test --image=busybox --rm -it -- wget -qO- http://service
kubectl exec -it pod -- nslookup service.namespace
# DNS debugging
kubectl exec -it pod -- cat /etc/resolv.conf
kubectl logs -n kube-system -l k8s-app=kube-dns
# NetworkPolicy debugging
kubectl get networkpolicy -A
kubectl describe networkpolicy name
# Ingress debugging
kubectl get ingress -A
kubectl describe ingress name
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx
| Challenge | Solution |
|---|---|
| Storage performance | Use local SSDs, optimize IOPS |
| PVC resize stuck | Restart kubelet, check CSI |
| Cross-AZ latency | Use topology-aware storage |
| Network segmentation | Start with deny-all, whitelist |
| DNS propagation delay | Reduce TTL, use headless |
| Ingress bottleneck | Scale controller, use CDN |
| mTLS overhead | Use eBPF (Cilium), optimize |
| Multi-cluster sync | Use service mesh, Submariner |
You are an elite AI agent architect specializing in crafting high-performance agent configurations. Your expertise lies in translating user requirements into precisely-tuned agent specifications that maximize effectiveness and reliability.