During peak traffic hours, several critical application pods were being evicted from nodes, causing service disruptions. The evictions occurred without any manual intervention or deployment changes.
# Kubernetes Scenarios
No summary provided
What Happened:
Diagnosis Steps:
Examined pod events with
kubectl describe pod <pod-name>
.Checked node conditions with
kubectl describe node <node-name>
.Analyzed resource usage with
kubectl top nodes
andkubectl top pods
.Reviewed kubelet logs on affected nodes.
Examined cluster-autoscaler logs for scaling events.
Root Cause:
The nodes were experiencing memory pressure due to a combination of factors: 1. Pods had no memory limits defined, allowing them to consume excessive memory. 2. System daemons on the nodes were using more memory than expected. 3. The kubelet was configured with a low eviction threshold for memory. 4. The cluster autoscaler was not scaling up quickly enough to handle the load.
Fix/Workaround:
• Added appropriate memory limits to all pods:
# Pod resource configuration
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
• Adjusted kubelet configuration to have more appropriate eviction thresholds:
# kubelet configuration
evictionHard:
memory.available: "500Mi"
nodefs.available: "10%"
nodefs.inodesFree: "5%"
evictionSoft:
memory.available: "1Gi"
nodefs.available: "15%"
evictionSoftGracePeriod:
memory.available: "1m"
nodefs.available: "1m"
• Implemented pod disruption budgets for critical services:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: critical-service-pdb
spec:
minAvailable: 2 # or maxUnavailable: 1
selector:
matchLabels:
app: critical-service
• Adjusted cluster autoscaler settings for faster scaling.
Lessons Learned:
Proper resource management and eviction policies are critical for cluster stability.
How to Avoid:
Always set appropriate resource requests and limits for all pods.
Configure kubelet eviction thresholds based on workload characteristics.
Implement pod disruption budgets for critical services.
Monitor node resource usage and set up alerts for pressure conditions.
Consider using vertical pod autoscaler to help set appropriate resource limits.
No summary provided
What Happened:
After scaling down a StatefulSet and later scaling it back up, the application reported data loss. The team discovered that persistent volumes that should have been retained were deleted during the scale-down operation.
Diagnosis Steps:
Examined StatefulSet configuration with
kubectl get statefulset <name> -o yaml
.Checked PersistentVolumeClaim status with
kubectl get pvc
.Reviewed events related to the StatefulSet and PVCs.
Analyzed the Kubernetes controller logs.
Tested scaling operations in a non-production environment.
Root Cause:
The StatefulSet was configured with the default persistentVolumeClaimRetentionPolicy
which uses Delete
for both whenDeleted
and whenScaled
. This caused PVCs to be deleted when pods were removed during scale-down operations.
Fix/Workaround:
• Updated the StatefulSet configuration to retain PVCs during scaling:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: database
spec:
replicas: 3
persistentVolumeClaimRetentionPolicy:
whenScaled: Retain # Keep PVCs when scaling down
whenDeleted: Delete # Remove PVCs when StatefulSet is deleted
selector:
matchLabels:
app: database
template:
metadata:
labels:
app: database
spec:
containers:
- name: database
image: postgres:14
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 10Gi
• Implemented a backup strategy for critical data.
• Added pre-scaling checks to verify data replication status.
Lessons Learned:
StatefulSet PVC retention policies must be explicitly configured to prevent data loss during scaling operations.
How to Avoid:
Always set explicit
persistentVolumeClaimRetentionPolicy
for StatefulSets with important data.Implement regular backups for stateful workloads.
Test scaling operations in non-production environments.
Add monitoring for PVC creation and deletion events.
Document StatefulSet scaling procedures with appropriate precautions.
No summary provided
What Happened:
Users reported connection errors when accessing several services through HTTPS. The errors indicated TLS certificate issues, despite certificates being managed through Kubernetes.
Diagnosis Steps:
Checked Ingress resources with
kubectl get ingress -A
.Examined TLS certificate details with
kubectl get secret <tls-secret> -o yaml
.Decoded the certificate to check expiration date:
```bash
kubectl get secret
-o jsonpath='{.data.tls.crt}' | base64 -d | openssl x509 -text | grep "Not After" ```
Reviewed cert-manager logs (if using cert-manager).
Checked for certificate renewal jobs or CronJobs.
Root Cause:
The TLS certificates were managed manually through Kubernetes Secrets rather than using cert-manager or another automated solution. The team had no process for monitoring certificate expiration or automating renewals.
Fix/Workaround:
• Short-term: Manually renewed the certificates and updated the Secrets:
# Generate new certificates
openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
-keyout tls.key -out tls.crt -subj "/CN=example.com"
# Update the Kubernetes Secret
kubectl create secret tls tls-secret --cert=tls.crt --key=tls.key \
--dry-run=client -o yaml | kubectl apply -f -
• Long-term: Implemented cert-manager for automated certificate management:
# Install cert-manager
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: admin@example.com
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- http01:
ingress:
class: nginx
---
# Update Ingress to use cert-manager
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: example-ingress
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
tls:
- hosts:
- example.com
secretName: example-tls
rules:
- host: example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: example-service
port:
number: 80
Lessons Learned:
Certificate management requires automation and monitoring to prevent unexpected expirations.
How to Avoid:
Use cert-manager or similar tools for automated certificate management.
Implement monitoring for certificate expiration dates.
Set up alerts for certificates nearing expiration (30 days in advance).
Document certificate renewal processes.
Use longer validity periods for non-public certificates where appropriate.