Question

kubernetes_scenario_01

Answer 1

output:

Kubernetes Kubernetes 1.26, Production cluster, High-load application

Summary:

No summary provided

What Happened:

During peak traffic hours, several critical application pods were being evicted from nodes, causing service disruptions. The evictions occurred without any manual intervention or deployment changes.

Diagnosis Steps:

Examined pod events with kubectl describe pod <pod-name>.
Checked node conditions with kubectl describe node <node-name>.
Analyzed resource usage with kubectl top nodes and kubectl top pods.
Reviewed kubelet logs on affected nodes.
Examined cluster-autoscaler logs for scaling events.

Root Cause:

The nodes were experiencing memory pressure due to a combination of factors: 1. Pods had no memory limits defined, allowing them to consume excessive memory. 2. System daemons on the nodes were using more memory than expected. 3. The kubelet was configured with a low eviction threshold for memory. 4. The cluster autoscaler was not scaling up quickly enough to handle the load.

Fix/Workaround:

• Added appropriate memory limits to all pods:


# Pod resource configuration
resources:
  requests:
    memory: "256Mi"
    cpu: "100m"
  limits:
    memory: "512Mi"
    cpu: "500m"

• Adjusted kubelet configuration to have more appropriate eviction thresholds:


# kubelet configuration
evictionHard:
  memory.available: "500Mi"
  nodefs.available: "10%"
  nodefs.inodesFree: "5%"
evictionSoft:
  memory.available: "1Gi"
  nodefs.available: "15%"
evictionSoftGracePeriod:
  memory.available: "1m"
  nodefs.available: "1m"

• Implemented pod disruption budgets for critical services:


apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: critical-service-pdb
spec:
  minAvailable: 2  # or maxUnavailable: 1
  selector:
    matchLabels:
      app: critical-service

• Adjusted cluster autoscaler settings for faster scaling.

Lessons Learned:

Proper resource management and eviction policies are critical for cluster stability.

How to Avoid:

Always set appropriate resource requests and limits for all pods.
Configure kubelet eviction thresholds based on workload characteristics.
Implement pod disruption budgets for critical services.
Monitor node resource usage and set up alerts for pressure conditions.
Consider using vertical pod autoscaler to help set appropriate resource limits.

Answer 2

output:

Kubernetes Kubernetes 1.25, Production cluster, Stateful application

Summary:

No summary provided

What Happened:

After scaling down a StatefulSet and later scaling it back up, the application reported data loss. The team discovered that persistent volumes that should have been retained were deleted during the scale-down operation.

Diagnosis Steps:

Examined StatefulSet configuration with kubectl get statefulset <name> -o yaml.
Checked PersistentVolumeClaim status with kubectl get pvc.
Reviewed events related to the StatefulSet and PVCs.
Analyzed the Kubernetes controller logs.
Tested scaling operations in a non-production environment.

Root Cause:

The StatefulSet was configured with the default persistentVolumeClaimRetentionPolicy which uses Delete for both whenDeleted and whenScaled. This caused PVCs to be deleted when pods were removed during scale-down operations.

Fix/Workaround:

• Updated the StatefulSet configuration to retain PVCs during scaling:


apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: database
spec:
  replicas: 3
  persistentVolumeClaimRetentionPolicy:
    whenScaled: Retain    # Keep PVCs when scaling down
    whenDeleted: Delete   # Remove PVCs when StatefulSet is deleted
  selector:
    matchLabels:
      app: database
  template:
    metadata:
      labels:
        app: database
    spec:
      containers:
      - name: database
        image: postgres:14
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 10Gi

• Implemented a backup strategy for critical data.

• Added pre-scaling checks to verify data replication status.

Lessons Learned:

StatefulSet PVC retention policies must be explicitly configured to prevent data loss during scaling operations.

How to Avoid:

Always set explicit persistentVolumeClaimRetentionPolicy for StatefulSets with important data.
Implement regular backups for stateful workloads.
Test scaling operations in non-production environments.
Add monitoring for PVC creation and deletion events.
Document StatefulSet scaling procedures with appropriate precautions.

Answer 3

output:

Kubernetes Kubernetes 1.27, Production cluster, Public-facing services

Summary:

No summary provided

What Happened:

Users reported connection errors when accessing several services through HTTPS. The errors indicated TLS certificate issues, despite certificates being managed through Kubernetes.

Diagnosis Steps:

Checked Ingress resources with kubectl get ingress -A.
Examined TLS certificate details with kubectl get secret <tls-secret> -o yaml.
Decoded the certificate to check expiration date:
```bash
kubectl get secret -o jsonpath='{.data.tls.crt}' | base64 -d | openssl x509 -text | grep "Not After"
```
Reviewed cert-manager logs (if using cert-manager).
Checked for certificate renewal jobs or CronJobs.

Root Cause:

The TLS certificates were managed manually through Kubernetes Secrets rather than using cert-manager or another automated solution. The team had no process for monitoring certificate expiration or automating renewals.

Fix/Workaround:

• Short-term: Manually renewed the certificates and updated the Secrets:


# Generate new certificates
openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
  -keyout tls.key -out tls.crt -subj "/CN=example.com"
# Update the Kubernetes Secret
kubectl create secret tls tls-secret --cert=tls.crt --key=tls.key \
  --dry-run=client -o yaml | kubectl apply -f -

• Long-term: Implemented cert-manager for automated certificate management:


# Install cert-manager
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: admin@example.com
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
    - http01:
        ingress:
          class: nginx
---
# Update Ingress to use cert-manager
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: example-ingress
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
  tls:
  - hosts:
    - example.com
    secretName: example-tls
  rules:
  - host: example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: example-service
            port:
              number: 80

Lessons Learned:

Certificate management requires automation and monitoring to prevent unexpected expirations.

How to Avoid:

Use cert-manager or similar tools for automated certificate management.
Implement monitoring for certificate expiration dates.
Set up alerts for certificates nearing expiration (30 days in advance).
Document certificate renewal processes.
Use longer validity periods for non-public certificates where appropriate.

# Kubernetes Scenarios

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid: