You are responsible for deploying a payment processing API on Kubernetes. The API currently runs as a Deployment with 6 replicas. You need to roll out a new version that changes the database schema. Requirements: zero downtime, ability to roll back within 60 seconds, and no request failures during the transition.
From coding-interview-agentnpx claudepluginhub preplabsai/interviewmentor --plugin coding-interview-agentManages AI Agent Skills on prompts.chat: search by keyword/tag, retrieve skills with files, create multi-file skills (SKILL.md required), add/update/remove files for Claude Code.
Manages AI prompt library on prompts.chat: search by keyword/tag/category, retrieve/fill variables, save with metadata, AI-improve for structure.
Reviews Claude Code skills for structure, description triggering/specificity, content quality, progressive disclosure, and best practices. Provides targeted improvements. Trigger proactively after skill creation/modification.
You are responsible for deploying a payment processing API on Kubernetes. The API currently runs as a Deployment with 6 replicas. You need to roll out a new version that changes the database schema. Requirements: zero downtime, ability to roll back within 60 seconds, and no request failures during the transition.
Deployment Configuration:
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-api
spec:
replicas: 6
revisionHistoryLimit: 5
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2
maxUnavailable: 0
template:
spec:
containers:
- name: payment-api
image: payment-api:v2
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
livenessProbe:
httpGet:
path: /livez
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
minReadySeconds: 30
Key Design Decisions:
maxUnavailable: 0 -- Never remove an old Pod until a new one is readymaxSurge: 2 -- Allow 2 extra Pods during rollout for faster transitionsminReadySeconds: 30 -- Wait 30 seconds after readiness before continuingrevisionHistoryLimit: 5 -- Keep 5 old ReplicaSets for rollbackDatabase Schema Migration Strategy:
Because the schema changes, you must use a backward-compatible migration approach:
This is the "expand and contract" migration pattern.
Rollback Process:
kubectl rollout undo deployment/payment-api
This reverts to the previous ReplicaSet. Because maxUnavailable: 0, old Pods are still running during rollout, and rollback is near-instant.
terminationGracePeriodSeconds and handle SIGTERM in your application to drain connections.A developer deployed a new microservice. All 3 Pods are in CrashLoopBackOff. The developer says the service runs perfectly in Docker on their laptop. The service is a Node.js API that connects to PostgreSQL and Redis.
Systematic Debugging Process:
Step 1: Describe the Pod
kubectl describe pod api-server-abc123
Check the Events section:
Events:
Type Reason Message
---- ------ -------
Normal Scheduled Successfully assigned to node-2
Normal Pulled Container image pulled successfully
Normal Started Started container api-server
Warning BackOff Back-off restarting failed container
Check Last State:
Last State: Terminated
Exit Code: 1
Reason: Error
Exit code 1 means the application itself crashed (not OOMKilled, which is 137).
Step 2: Check Logs
kubectl logs api-server-abc123 --previous
Output:
Error: connect ECONNREFUSED 10.96.0.50:5432
at TCPConnectWrap.afterConnect
The application cannot reach PostgreSQL.
Step 3: Verify the Database Service
kubectl get svc postgres-svc
kubectl get endpoints postgres-svc
If endpoints are empty, the label selector on the Service does not match the database Pod labels.
Common Root Causes Checklist:
| Symptom | Exit Code | Likely Cause |
|---|---|---|
| OOMKilled | 137 | Memory limit too low |
| Error | 1 | Application crash (missing env var, bad config, dependency unreachable) |
| Completed | 0 | Container exits normally (wrong command for long-running process) |
| ImagePullBackOff | N/A | Wrong image name, missing registry credentials |
| CrashLoopBackOff | 1 | Startup dependency not available (DB, config, secret) |
Step 4: Check ConfigMaps and Secrets
kubectl get configmap api-config -o yaml
kubectl get secret api-secrets -o yaml
Verify all expected environment variables are present and correct.
Step 5: Test Connectivity from Inside the Cluster
kubectl run debug --image=busybox --rm -it -- /bin/sh
# Inside the debug pod:
nslookup postgres-svc.default.svc.cluster.local
nc -zv postgres-svc 5432
Your e-commerce API handles 1,000 req/s normally. During flash sales (announced 1 hour in advance), traffic spikes to 10,000 req/s within 60 seconds. Each Pod handles 200 req/s. Currently, HPA takes 5 minutes to fully scale, causing timeouts during the spike. Design a scaling strategy.
Capacity Planning:
Multi-Layer Scaling Strategy:
Layer 1: Baseline Over-Provisioning
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ecommerce-api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ecommerce-api
minReplicas: 10 # 2x normal to absorb initial spike
maxReplicas: 80 # Headroom above 50 for safety
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100 # Double Pods every 15 seconds
periodSeconds: 15
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
Layer 2: Scheduled Pre-Scaling (for known events)
apiVersion: batch/v1
kind: CronJob
metadata:
name: prescale-flash-sale
spec:
schedule: "50 * * * *" # Run at :50 of the scheduled hour
jobTemplate:
spec:
template:
spec:
containers:
- name: scaler
image: bitnami/kubectl
command:
- kubectl
- scale
- deployment/ecommerce-api
- --replicas=50
Layer 3: Cluster Autoscaler
Layer 4: Pod Startup Optimization
initialDelaySeconds as low as possibleTimeline During Flash Sale:
T-10min: CronJob scales to 50 Pods
T-5min: All 50 Pods running and ready
T+0: Flash sale starts, 10,000 req/s hits
T+0: 50 Pods handle load comfortably
T+30min: Flash sale ends, traffic drops
T+35min: HPA scaleDown stabilization window expires
T+40min: Pods scale down to minReplicas (10)
maxConnections per Pod so total connections stay within the database limit.You need to deploy a 6-node Redis Cluster (3 primaries, 3 replicas) on Kubernetes. Redis nodes need stable network identities, persistent storage, and ordered startup. Design the Kubernetes resources.
Why StatefulSet, Not Deployment:
StatefulSet Configuration:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: redis-cluster
spec:
serviceName: redis-cluster-headless
replicas: 6
podManagementPolicy: Parallel # After initial setup
selector:
matchLabels:
app: redis-cluster
template:
spec:
containers:
- name: redis
image: redis:7
ports:
- containerPort: 6379
- containerPort: 16379 # Cluster bus port
volumeMounts:
- name: data
mountPath: /data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: fast-ssd
resources:
requests:
storage: 50Gi
Headless Service (for stable DNS):
apiVersion: v1
kind: Service
metadata:
name: redis-cluster-headless
spec:
clusterIP: None
selector:
app: redis-cluster
ports:
- port: 6379
targetPort: 6379
Each Pod gets a DNS name: redis-cluster-0.redis-cluster-headless.default.svc.cluster.local
Key Considerations:
Your company has 3 teams sharing one Kubernetes cluster: Frontend, Backend, and Data. Requirements:
Namespace Setup:
frontend-team (namespace)
backend-team (namespace)
data-team (namespace)
Role for Developers (per namespace):
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: developer
namespace: backend-team
rules:
- apiGroups: ["apps"]
resources: ["deployments", "statefulsets", "daemonsets"]
verbs: ["get", "list", "watch", "create", "update", "patch"]
- apiGroups: [""]
resources: ["pods", "pods/log", "pods/exec", "services", "configmaps"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get", "list", "create", "update", "patch"]
RoleBinding (binds Role to team):
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: backend-developers
namespace: backend-team
subjects:
- kind: Group
name: backend-team
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: developer
apiGroup: rbac.authorization.k8s.io
ClusterRole for Platform Team:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: platform-admin
rules:
- apiGroups: [""]
resources: ["namespaces", "nodes"]
verbs: ["*"]
- apiGroups: ["rbac.authorization.k8s.io"]
resources: ["clusterroles", "clusterrolebindings", "roles", "rolebindings"]
verbs: ["*"]
Security Boundaries:
backend-team cannot access Secrets in frontend-teamautomountServiceAccountToken: false.Interviewer: "Hey, welcome. I'm the DevOps lead here. I've been running Kubernetes clusters in production for about five years, across AWS, GCP, and bare metal. Today we're going to talk about Kubernetes fundamentals -- not trivia, but real operational knowledge. Let's start simple and go from there."
Interviewer: "First question -- what is a Pod? And more importantly, why does Kubernetes schedule Pods instead of individual containers?"
[Candidate discusses Pod abstraction]
Interviewer: "Good. Now tell me -- if I have a web server container and a log-shipping sidecar container in the same Pod, what do they share? And what don't they share?"
Interviewer: "Let's say you have a payment API running as a Deployment with 6 replicas. Product wants to push a new version right now. How do you ensure zero downtime during the rollout?"
[Candidate discusses rolling updates]
Interviewer: "The new version is deployed, but 2 minutes later alerts fire -- the new version has a bug. What do you do, and how fast can you fix it?"
Interviewer: "A junior developer comes to you. Their new service has all 3 Pods in CrashLoopBackOff. They say it works on their laptop. Walk me through your debugging steps."
Interviewer: "Our API does 1,000 req/s normally but we have a flash sale coming in 2 hours that will push it to 10,000 req/s. Current autoscaling takes 5 minutes to react. What do we do?"
Interviewer: "Let me give you some feedback..."
[Generate scorecard]