Example of improved pod specification with proper affinity rules

Question

container_orchestration_scenario_01

Answer 1

output:

Container Orchestration Kubernetes 1.24, Production cluster

Summary:

No summary provided

What Happened:

During a scheduled maintenance window, an automated process cordoned multiple nodes simultaneously to prepare for kernel updates. Despite having PodDisruptionBudgets (PDBs) configured for critical services, several applications experienced downtime as pods were unable to be rescheduled fast enough.

Diagnosis Steps:

Examined cluster events during the maintenance window.
Reviewed PodDisruptionBudget configurations for affected services.
Analyzed pod scheduling and eviction logs.
Checked node resource utilization before and during the incident.
Reviewed the automated maintenance script's logic.

Root Cause:

Multiple issues contributed to the problem: 1. The automated maintenance script cordoned too many nodes simultaneously (25% of the cluster), exceeding the cluster's ability to reschedule pods quickly 2. Some PodDisruptionBudgets were configured with absolute values instead of percentages, which didn't scale with deployment size 3. Several services had anti-affinity rules that made rescheduling difficult with fewer available nodes 4. The cluster was already running at high resource utilization (85% CPU), leaving little headroom for pod migrations

Fix/Workaround:

• Short-term: Modified the maintenance script to cordon nodes sequentially:


#!/bin/bash
# Improved maintenance script with sequential cordoning
set -euo pipefail
# Configuration
MAX_UNAVAILABLE_NODES=3
DRAIN_TIMEOUT=300
WAIT_BETWEEN_NODES=120
MAINTENANCE_LABEL="maintenance=true"
# Get all nodes
NODES=$(kubectl get nodes -o jsonpath='{.items[*].metadata.name}')
TOTAL_NODES=$(echo "$NODES" | wc -w)
echo "Total nodes in cluster: $TOTAL_NODES"
# Calculate maximum nodes to cordon at once
MAX_NODES_TO_CORDON=$(( TOTAL_NODES * 10 / 100 ))
if [ "$MAX_NODES_TO_CORDON" -gt "$MAX_UNAVAILABLE_NODES" ]; then
  MAX_NODES_TO_CORDON=$MAX_UNAVAILABLE_NODES
fi
echo "Maximum nodes to cordon at once: $MAX_NODES_TO_CORDON"
# Function to check cluster health
check_cluster_health() {
  # Check if any PDBs are violated
  PDB_VIOLATIONS=$(kubectl get poddisruptionbudgets --all-namespaces -o json | jq '.items[] | select(.status.disruptionsAllowed == 0) | .metadata.name' | wc -l)
  if [ "$PDB_VIOLATIONS" -gt 0 ]; then
    echo "Warning: $PDB_VIOLATIONS PDBs currently have 0 disruptions allowed"
    return 1
  fi
  # Check if any deployments are not fully available
  UNAVAILABLE_DEPLOYMENTS=$(kubectl get deployments --all-namespaces -o json | jq '.items[] | select(.status.availableReplicas < .status.replicas) | .metadata.name' | wc -l)
  if [ "$UNAVAILABLE_DEPLOYMENTS" -gt 0 ]; then
    echo "Warning: $UNAVAILABLE_DEPLOYMENTS deployments are not fully available"
    return 1
  fi
  return 0
}
# Function to cordon and drain a node
cordon_and_drain() {
  local node=$1
  echo "Cordoning node $node..."
  kubectl cordon "$node"
  echo "Labeling node $node for maintenance..."
  kubectl label node "$node" "$MAINTENANCE_LABEL" --overwrite
  echo "Draining node $node..."
  kubectl drain "$node" --ignore-daemonsets --delete-emptydir-data --timeout="${DRAIN_TIMEOUT}s"
  echo "Node $node successfully cordoned and drained"
}
# Main maintenance loop
CORDONED_COUNT=0
for node in $NODES; do
  # Skip nodes that are already cordoned
  if kubectl get node "$node" -o jsonpath='{.spec.unschedulable}' | grep -q 'true'; then
    echo "Node $node is already cordoned, skipping"
    continue
  fi
  # Check if we've reached the maximum number of nodes to cordon at once
  if [ "$CORDONED_COUNT" -ge "$MAX_NODES_TO_CORDON" ]; then
    echo "Maximum number of nodes cordoned, waiting for cluster to stabilize..."
    while ! check_cluster_health; do
      echo "Cluster is not healthy, waiting 60 seconds..."
      sleep 60
    done
    CORDONED_COUNT=0
  fi
  # Cordon and drain the node
  cordon_and_drain "$node"
  CORDONED_COUNT=$((CORDONED_COUNT + 1))
  # Wait between nodes to allow for pod rescheduling
  echo "Waiting $WAIT_BETWEEN_NODES seconds before proceeding to next node..."
  sleep "$WAIT_BETWEEN_NODES"
done
echo "All nodes have been cordoned and drained for maintenance"

• Long-term: Improved PodDisruptionBudget configurations:


# Before: Problematic PDB with absolute values
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-service-pdb
  namespace: production
spec:
  minAvailable: 3
  selector:
    matchLabels:
      app: api-service
# After: Improved PDB with percentage and proper selector
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-service-pdb
  namespace: production
spec:
  minAvailable: 80%
  selector:
    matchLabels:
      app: api-service

• Implemented better pod anti-affinity rules:


# Before: Strict anti-affinity that made rescheduling difficult
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
  namespace: production
spec:
  replicas: 10
  selector:
    matchLabels:
      app: api-service
  template:
    metadata:
      labels:
        app: api-service
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - api-service
            topologyKey: kubernetes.io/hostname
# After: Balanced anti-affinity with preferred and required rules
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
  namespace: production
spec:
  replicas: 10
  selector:
    matchLabels:
      app: api-service
  template:
    metadata:
      labels:
        app: api-service
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - api-service
            topologyKey: topology.kubernetes.io/zone
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - api-service
              topologyKey: kubernetes.io/hostname

• Created a maintenance controller in Go:


// maintenance_controller.go
package main
import (
	"context"
	"flag"
	"fmt"
	"os"
	"os/signal"
	"syscall"
	"time"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/client-go/kubernetes"
	"k8s.io/client-go/rest"
	"k8s.io/client-go/tools/clientcmd"
	"k8s.io/klog/v2"
)
type MaintenanceController struct {
	clientset           *kubernetes.Clientset
	maxUnavailableNodes int
	drainTimeout        time.Duration
	waitBetweenNodes    time.Duration
	maintenanceLabel    string
}
func NewMaintenanceController(clientset *kubernetes.Clientset, maxUnavailableNodes int, drainTimeout, waitBetweenNodes time.Duration, maintenanceLabel string) *MaintenanceController {
	return &MaintenanceController{
		clientset:           clientset,
		maxUnavailableNodes: maxUnavailableNodes,
		drainTimeout:        drainTimeout,
		waitBetweenNodes:    waitBetweenNodes,
		maintenanceLabel:    maintenanceLabel,
	}
}
func (c *MaintenanceController) Run(ctx context.Context) error {
	klog.Info("Starting maintenance controller")
	// Get all nodes
	nodes, err := c.clientset.CoreV1().Nodes().List(ctx, metav1.ListOptions{})
	if err != nil {
		return fmt.Errorf("failed to list nodes: %w", err)
	}
	totalNodes := len(nodes.Items)
	klog.Infof("Total nodes in cluster: %d", totalNodes)
	// Calculate maximum nodes to cordon at once (10% of total, capped at maxUnavailableNodes)
	maxNodesToCordon := totalNodes * 10 / 100
	if maxNodesToCordon > c.maxUnavailableNodes {
		maxNodesToCordon = c.maxUnavailableNodes
	}
	klog.Infof("Maximum nodes to cordon at once: %d", maxNodesToCordon)
	// Main maintenance loop
	cordonedCount := 0
	for _, node := range nodes.Items {
		// Skip nodes that are already cordoned
		if node.Spec.Unschedulable {
			klog.Infof("Node %s is already cordoned, skipping", node.Name)
			continue
		}
		// Check if we've reached the maximum number of nodes to cordon at once
		if cordonedCount >= maxNodesToCordon {
			klog.Info("Maximum number of nodes cordoned, waiting for cluster to stabilize...")
			for {
				healthy, err := c.checkClusterHealth(ctx)
				if err != nil {
					klog.Warningf("Failed to check cluster health: %v", err)
				}
				if healthy {
					break
				}
				klog.Info("Cluster is not healthy, waiting 60 seconds...")
				select {
				case <-ctx.Done():
					return ctx.Err()
				case <-time.After(60 * time.Second):
				}
			}
			cordonedCount = 0
		}
		// Cordon and drain the node
		if err := c.cordonAndDrainNode(ctx, node.Name); err != nil {
			klog.Errorf("Failed to cordon and drain node %s: %v", node.Name, err)
			continue
		}
		cordonedCount++
		// Wait between nodes to allow for pod rescheduling
		klog.Infof("Waiting %v before proceeding to next node...", c.waitBetweenNodes)
		select {
		case <-ctx.Done():
			return ctx.Err()
		case <-time.After(c.waitBetweenNodes):
		}
	}
	klog.Info("All nodes have been cordoned and drained for maintenance")
	return nil
}
func (c *MaintenanceController) checkClusterHealth(ctx context.Context) (bool, error) {
	// Check if any PDBs are violated
	pdbs, err := c.clientset.PolicyV1().PodDisruptionBudgets(metav1.NamespaceAll).List(ctx, metav1.ListOptions{})
	if err != nil {
		return false, fmt.Errorf("failed to list PDBs: %w", err)
	}
	pdbViolations := 0
	for _, pdb := range pdbs.Items {
		if pdb.Status.DisruptionsAllowed == 0 {
			pdbViolations++
		}
	}
	if pdbViolations > 0 {
		klog.Warningf("%d PDBs currently have 0 disruptions allowed", pdbViolations)
		return false, nil
	}
	// Check if any deployments are not fully available
	deployments, err := c.clientset.AppsV1().Deployments(metav1.NamespaceAll).List(ctx, metav1.ListOptions{})
	if err != nil {
		return false, fmt.Errorf("failed to list deployments: %w", err)
	}
	unavailableDeployments := 0
	for _, deployment := range deployments.Items {
		if deployment.Status.AvailableReplicas < deployment.Status.Replicas {
			unavailableDeployments++
		}
	}
	if unavailableDeployments > 0 {
		klog.Warningf("%d deployments are not fully available", unavailableDeployments)
		return false, nil
	}
	return true, nil
}
func (c *MaintenanceController) cordonAndDrainNode(ctx context.Context, nodeName string) error {
	klog.Infof("Cordoning node %s...", nodeName)
	// Get the node
	node, err := c.clientset.CoreV1().Nodes().Get(ctx, nodeName, metav1.GetOptions{})
	if err != nil {
		return fmt.Errorf("failed to get node %s: %w", nodeName, err)
	}
	// Cordon the node
	node.Spec.Unschedulable = true
	_, err = c.clientset.CoreV1().Nodes().Update(ctx, node, metav1.UpdateOptions{})
	if err != nil {
		return fmt.Errorf("failed to cordon node %s: %w", nodeName, err)
	}
	// Label the node for maintenance
	if node.Labels == nil {
		node.Labels = make(map[string]string)
	}
	parts := make(map[string]string)
	for k, v := range node.Labels {
		parts[k] = v
	}
	key, value, found := splitLabel(c.maintenanceLabel)
	if found {
		parts[key] = value
		node.Labels = parts
		_, err = c.clientset.CoreV1().Nodes().Update(ctx, node, metav1.UpdateOptions{})
		if err != nil {
			return fmt.Errorf("failed to label node %s: %w", nodeName, err)
		}
	}
	// Drain the node (evict pods)
	// This is a simplified version - in a real controller, you would use the Eviction API
	// to properly respect PodDisruptionBudgets
	pods, err := c.clientset.CoreV1().Pods(metav1.NamespaceAll).List(ctx, metav1.ListOptions{
		FieldSelector: fmt.Sprintf("spec.nodeName=%s", nodeName),
	})
	if err != nil {
		return fmt.Errorf("failed to list pods on node %s: %w", nodeName, err)
	}
	for _, pod := range pods.Items {
		// Skip DaemonSet pods
		if isDaemonSetPod(&pod) {
			continue
		}
		klog.Infof("Evicting pod %s/%s from node %s", pod.Namespace, pod.Name, nodeName)
		err := c.clientset.CoreV1().Pods(pod.Namespace).Evict(ctx, &metav1.Eviction{
			ObjectMeta: metav1.ObjectMeta{
				Name:      pod.Name,
				Namespace: pod.Namespace,
			},
		})
		if err != nil {
			klog.Warningf("Failed to evict pod %s/%s: %v", pod.Namespace, pod.Name, err)
		}
	}
	// Wait for pods to be evicted
	deadline := time.Now().Add(c.drainTimeout)
	for time.Now().Before(deadline) {
		pods, err := c.clientset.CoreV1().Pods(metav1.NamespaceAll).List(ctx, metav1.ListOptions{
			FieldSelector: fmt.Sprintf("spec.nodeName=%s", nodeName),
		})
		if err != nil {
			return fmt.Errorf("failed to list pods on node %s: %w", nodeName, err)
		}
		nonDaemonSetPods := 0
		for _, pod := range pods.Items {
			if !isDaemonSetPod(&pod) {
				nonDaemonSetPods++
			}
		}
		if nonDaemonSetPods == 0 {
			klog.Infof("Node %s successfully cordoned and drained", nodeName)
			return nil
		}
		klog.Infof("Waiting for %d pods to be evicted from node %s...", nonDaemonSetPods, nodeName)
		select {
		case <-ctx.Done():
			return ctx.Err()
		case <-time.After(5 * time.Second):
		}
	}
	return fmt.Errorf("timed out waiting for pods to be evicted from node %s", nodeName)
}
func isDaemonSetPod(pod *corev1.Pod) bool {
	for _, ownerRef := range pod.OwnerReferences {
		if ownerRef.Kind == "DaemonSet" {
			return true
		}
	}
	return false
}
func splitLabel(label string) (string, string, bool) {
	for i, c := range label {
		if c == '=' {
			return label[:i], label[i+1:], true
		}
	}
	return "", "", false
}
func main() {
	klog.InitFlags(nil)
	var (
		kubeconfig          string
		maxUnavailableNodes int
		drainTimeout        time.Duration
		waitBetweenNodes    time.Duration
		maintenanceLabel    string
	)
	flag.StringVar(&kubeconfig, "kubeconfig", "", "Path to kubeconfig file")
	flag.IntVar(&maxUnavailableNodes, "max-unavailable-nodes", 3, "Maximum number of nodes to cordon at once")
	flag.DurationVar(&drainTimeout, "drain-timeout", 5*time.Minute, "Timeout for draining a node")
	flag.DurationVar(&waitBetweenNodes, "wait-between-nodes", 2*time.Minute, "Time to wait between cordoning nodes")
	flag.StringVar(&maintenanceLabel, "maintenance-label", "maintenance=true", "Label to apply to nodes under maintenance")
	flag.Parse()
	// Create Kubernetes client
	var config *rest.Config
	var err error
	if kubeconfig == "" {
		klog.Info("Using in-cluster configuration")
		config, err = rest.InClusterConfig()
	} else {
		klog.Infof("Using configuration from %s", kubeconfig)
		config, err = clientcmd.BuildConfigFromFlags("", kubeconfig)
	}
	if err != nil {
		klog.Fatalf("Failed to create Kubernetes client config: %v", err)
	}
	clientset, err := kubernetes.NewForConfig(config)
	if err != nil {
		klog.Fatalf("Failed to create Kubernetes clientset: %v", err)
	}
	// Create and run controller
	controller := NewMaintenanceController(
		clientset,
		maxUnavailableNodes,
		drainTimeout,
		waitBetweenNodes,
		maintenanceLabel,
	)
	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()
	// Handle signals
	sigCh := make(chan os.Signal, 1)
	signal.Notify(sigCh, syscall.SIGINT, syscall.SIGTERM)
	go func() {
		sig := <-sigCh
		klog.Infof("Received signal %s, shutting down", sig)
		cancel()
	}()
	if err := controller.Run(ctx); err != nil && err != context.Canceled {
		klog.Fatalf("Error running controller: %v", err)
	}
}

Lessons Learned:

Node cordoning and draining requires careful orchestration to prevent service disruption.

How to Avoid:

Implement percentage-based PodDisruptionBudgets instead of absolute values.
Cordon and drain nodes sequentially with health checks between operations.
Use a combination of required and preferred anti-affinity rules.
Maintain sufficient resource headroom for pod migrations during maintenance.
Test maintenance procedures in a staging environment before production.

Answer 2

output:

Container Orchestration Kubernetes 1.24, AWS EKS, Production environment

Summary:

No summary provided

What Happened:

During a marketing campaign, several critical microservices became unavailable. Investigation showed that pods were being evicted from nodes, and some nodes were being marked as NotReady. The issue affected multiple services across different namespaces.

Diagnosis Steps:

Examined Kubernetes events for eviction notices and node status changes.
Analyzed node resource metrics (CPU, memory, disk) before and during the incident.
Reviewed pod resource requests and limits across affected namespaces.
Checked kubelet logs for resource pressure signals.
Investigated system logs for OOM killer activity.

Root Cause:

The cluster was experiencing memory pressure due to a combination of factors: 1. Many pods had no memory limits defined, allowing them to consume excessive memory. 2. System daemons and kubelet reserved memory was not properly configured. 3. Node memory was fragmented, preventing efficient allocation. 4. Some pods had memory leaks that gradually consumed available memory. 5. The cluster autoscaler was not configured to scale based on memory pressure.

Fix/Workaround:

• Short-term: Increased node size and manually cordoned affected nodes:


# Cordon affected nodes to prevent new pods from being scheduled
kubectl cordon node-1 node-2 node-3
# Drain pods from affected nodes (with grace period to allow proper termination)
kubectl drain node-1 --grace-period=300 --ignore-daemonsets --delete-emptydir-data
kubectl drain node-2 --grace-period=300 --ignore-daemonsets --delete-emptydir-data
kubectl drain node-3 --grace-period=300 --ignore-daemonsets --delete-emptydir-data
# Restart kubelet on affected nodes
ssh node-1 "sudo systemctl restart kubelet"
ssh node-2 "sudo systemctl restart kubelet"
ssh node-3 "sudo systemctl restart kubelet"
# Uncordon nodes after recovery
kubectl uncordon node-1 node-2 node-3

• Long-term: Implemented proper resource management:


# Set appropriate kubelet configuration
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
metadata:
  name: kubelet-config
evictionHard:
  memory.available: "500Mi"
  nodefs.available: "10%"
  nodefs.inodesFree: "5%"
evictionSoft:
  memory.available: "1Gi"
  nodefs.available: "15%"
  nodefs.inodesFree: "10%"
evictionSoftGracePeriod:
  memory.available: "1m"
  nodefs.available: "1m"
  nodefs.inodesFree: "1m"
evictionMaxPodGracePeriod: 300
evictionPressureTransitionPeriod: "30s"
systemReserved:
  memory: "1Gi"
  cpu: "500m"
kubeReserved:
  memory: "1Gi"
  cpu: "1"

• Created a LimitRange to enforce default resource limits:


# Default resource limits for namespace
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
  - default:
      memory: 512Mi
      cpu: 500m
    defaultRequest:
      memory: 256Mi
      cpu: 100m
    type: Container

• Implemented a ResourceQuota for namespaces:


# Resource quota for namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: namespace-quota
  namespace: production
spec:
  hard:
    requests.cpu: "20"
    requests.memory: 20Gi
    limits.cpu: "40"
    limits.memory: 40Gi
    pods: "50"

• Created a Vertical Pod Autoscaler configuration:


# Vertical Pod Autoscaler configuration
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
    - containerName: '*'
      minAllowed:
        cpu: 50m
        memory: 100Mi
      maxAllowed:
        cpu: 1
        memory: 1Gi
      controlledResources: ["cpu", "memory"]

• Configured Cluster Autoscaler to consider memory pressure:


# Cluster Autoscaler configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-config
  namespace: kube-system
data:
  config.yaml: |
    extraArgs:
      scale-down-utilization-threshold: "0.5"
      scale-down-unneeded-time: "5m"
      max-node-provision-time: "15m"
      scan-interval: "10s"
      scale-down-delay-after-add: "5m"
      scale-down-delay-after-delete: "0s"
      scale-down-delay-after-failure: "3m"
      expendable-pods-priority-cutoff: "-10"
      balance-similar-node-groups: "true"
      expander: "least-waste"
      skip-nodes-with-local-storage: "false"
      skip-nodes-with-system-pods: "true"
      max-graceful-termination-sec: "600"
      memory-utilization-threshold: "0.7"
      memory-pressure-threshold: "0.85"

• Implemented a Go-based memory monitoring sidecar:


// memory_monitor.go
package main
import (
	"context"
	"fmt"
	"log"
	"net/http"
	"os"
	"strconv"
	"time"
	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promauto"
	"github.com/prometheus/client_golang/prometheus/promhttp"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/client-go/kubernetes"
	"k8s.io/client-go/rest"
	"k8s.io/metrics/pkg/client/clientset/versioned"
)
var (
	// Prometheus metrics
	nodeMemoryPressure = promauto.NewGaugeVec(
		prometheus.GaugeOpts{
			Name: "node_memory_pressure",
			Help: "Memory pressure on nodes (0-1)",
		},
		[]string{"node"},
	)
	podMemoryUsage = promauto.NewGaugeVec(
		prometheus.GaugeOpts{
			Name: "pod_memory_usage_bytes",
			Help: "Memory usage of pods in bytes",
		},
		[]string{"namespace", "pod", "node"},
	)
	podMemoryLimit = promauto.NewGaugeVec(
		prometheus.GaugeOpts{
			Name: "pod_memory_limit_bytes",
			Help: "Memory limit of pods in bytes",
		},
		[]string{"namespace", "pod", "node"},
	)
	podMemoryRequest = promauto.NewGaugeVec(
		prometheus.GaugeOpts{
			Name: "pod_memory_request_bytes",
			Help: "Memory request of pods in bytes",
		},
		[]string{"namespace", "pod", "node"},
	)
	nodeMemoryCapacity = promauto.NewGaugeVec(
		prometheus.GaugeOpts{
			Name: "node_memory_capacity_bytes",
			Help: "Memory capacity of nodes in bytes",
		},
		[]string{"node"},
	)
	nodeMemoryAllocatable = promauto.NewGaugeVec(
		prometheus.GaugeOpts{
			Name: "node_memory_allocatable_bytes",
			Help: "Allocatable memory of nodes in bytes",
		},
		[]string{"node"},
	)
	evictionRisk = promauto.NewGaugeVec(
		prometheus.GaugeOpts{
			Name: "pod_eviction_risk",
			Help: "Risk of pod eviction (0-1)",
		},
		[]string{"namespace", "pod", "node"},
	)
)
func main() {
	// Set up Kubernetes client
	config, err := rest.InClusterConfig()
	if err != nil {
		log.Fatalf("Failed to get cluster config: %v", err)
	}
	clientset, err := kubernetes.NewForConfig(config)
	if err != nil {
		log.Fatalf("Failed to create Kubernetes client: %v", err)
	}
	metricsClient, err := versioned.NewForConfig(config)
	if err != nil {
		log.Fatalf("Failed to create Metrics client: %v", err)
	}
	// Start HTTP server for Prometheus metrics
	http.Handle("/metrics", promhttp.Handler())
	go func() {
		log.Fatal(http.ListenAndServe(":8080", nil))
	}()
	// Get monitoring interval from environment variable or use default
	interval := 60
	if intervalStr := os.Getenv("MONITORING_INTERVAL"); intervalStr != "" {
		if i, err := strconv.Atoi(intervalStr); err == nil {
			interval = i
		}
	}
	// Start monitoring loop
	for {
		monitorNodes(clientset)
		monitorPods(clientset, metricsClient)
		time.Sleep(time.Duration(interval) * time.Second)
	}
}
func monitorNodes(clientset *kubernetes.Clientset) {
	nodes, err := clientset.CoreV1().Nodes().List(context.TODO(), metav1.ListOptions{})
	if err != nil {
		log.Printf("Failed to list nodes: %v", err)
		return
	}
	for _, node := range nodes.Items {
		// Get node capacity and allocatable memory
		memoryCapacity := node.Status.Capacity.Memory().Value()
		memoryAllocatable := node.Status.Allocatable.Memory().Value()
		nodeMemoryCapacity.WithLabelValues(node.Name).Set(float64(memoryCapacity))
		nodeMemoryAllocatable.WithLabelValues(node.Name).Set(float64(memoryAllocatable))
		// Check for memory pressure condition
		for _, condition := range node.Status.Conditions {
			if condition.Type == "MemoryPressure" {
				if condition.Status == "True" {
					nodeMemoryPressure.WithLabelValues(node.Name).Set(1)
				} else {
					nodeMemoryPressure.WithLabelValues(node.Name).Set(0)
				}
			}
		}
	}
}
func monitorPods(clientset *kubernetes.Clientset, metricsClient *versioned.Clientset) {
	pods, err := clientset.CoreV1().Pods("").List(context.TODO(), metav1.ListOptions{})
	if err != nil {
		log.Printf("Failed to list pods: %v", err)
		return
	}
	// Get pod metrics
	podMetrics, err := metricsClient.MetricsV1beta1().PodMetricses("").List(context.TODO(), metav1.ListOptions{})
	if err != nil {
		log.Printf("Failed to get pod metrics: %v", err)
		return
	}
	// Create a map of pod metrics for easier lookup
	podMetricsMap := make(map[string]map[string]int64)
	for _, podMetric := range podMetrics.Items {
		key := fmt.Sprintf("%s/%s", podMetric.Namespace, podMetric.Name)
		podMetricsMap[key] = make(map[string]int64)
		for _, container := range podMetric.Containers {
			memory := container.Usage.Memory().Value()
			if _, exists := podMetricsMap[key]["memory"]; exists {
				podMetricsMap[key]["memory"] += memory
			} else {
				podMetricsMap[key]["memory"] = memory
			}
		}
	}
	// Get node allocatable memory for calculating pressure
	nodes, err := clientset.CoreV1().Nodes().List(context.TODO(), metav1.ListOptions{})
	if err != nil {
		log.Printf("Failed to list nodes: %v", err)
		return
	}
	nodeAllocatableMemory := make(map[string]int64)
	for _, node := range nodes.Items {
		nodeAllocatableMemory[node.Name] = node.Status.Allocatable.Memory().Value()
	}
	// Process each pod
	for _, pod := range pods.Items {
		// Skip pods that are not running
		if pod.Status.Phase != "Running" {
			continue
		}
		// Get pod memory usage from metrics
		key := fmt.Sprintf("%s/%s", pod.Namespace, pod.Name)
		memoryUsage := int64(0)
		if metrics, exists := podMetricsMap[key]; exists {
			memoryUsage = metrics["memory"]
		}
		// Set pod memory usage metric
		podMemoryUsage.WithLabelValues(pod.Namespace, pod.Name, pod.Spec.NodeName).Set(float64(memoryUsage))
		// Calculate total memory requests and limits for the pod
		memoryRequest := int64(0)
		memoryLimit := int64(0)
		for _, container := range pod.Spec.Containers {
			if container.Resources.Requests != nil {
				if memory, exists := container.Resources.Requests["memory"]; exists {
					memoryRequest += memory.Value()
				}
			}
			if container.Resources.Limits != nil {
				if memory, exists := container.Resources.Limits["memory"]; exists {
					memoryLimit += memory.Value()
				}
			}
		}
		// Set pod memory request and limit metrics
		podMemoryRequest.WithLabelValues(pod.Namespace, pod.Name, pod.Spec.NodeName).Set(float64(memoryRequest))
		podMemoryLimit.WithLabelValues(pod.Namespace, pod.Name, pod.Spec.NodeName).Set(float64(memoryLimit))
		// Calculate eviction risk
		risk := 0.0
		// If node is under memory pressure, all pods are at risk
		for _, condition := range pod.Status.Conditions {
			if condition.Type == "NodeHasNoMemoryPressure" && condition.Status == "False" {
				risk = 1.0
				break
			}
		}
		// If pod has no memory limit, it's at higher risk
		if memoryLimit == 0 {
			risk = math.Max(risk, 0.8)
		}
		// If pod is using more than 90% of its limit, it's at risk
		if memoryLimit > 0 && float64(memoryUsage)/float64(memoryLimit) > 0.9 {
			risk = math.Max(risk, 0.7)
		}
		// If node allocatable memory is getting low, pods are at risk
		if nodeAllocatable, exists := nodeAllocatableMemory[pod.Spec.NodeName]; exists {
			nodeUsageRatio := float64(memoryUsage) / float64(nodeAllocatable)
			if nodeUsageRatio > 0.85 {
				risk = math.Max(risk, nodeUsageRatio - 0.5)
			}
		}
		// Set eviction risk metric
		evictionRisk.WithLabelValues(pod.Namespace, pod.Name, pod.Spec.NodeName).Set(risk)
	}
}

• Created a Rust-based memory leak detection tool:


// memory_leak_detector.rs
use chrono::{DateTime, Utc};
use futures::stream::StreamExt;
use k8s_openapi::api::core::v1::Pod;
use kube::{
    api::{Api, ListParams, WatchEvent},
    Client,
};
use serde::{Deserialize, Serialize};
use std::collections::HashMap;
use std::sync::Arc;
use std::time::{Duration, Instant};
use tokio::sync::Mutex;
use tokio::time;
#[derive(Debug, Clone, Serialize, Deserialize)]
struct MemoryUsageSample {
    timestamp: DateTime<Utc>,
    usage_bytes: u64,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
struct PodMemoryHistory {
    namespace: String,
    pod_name: String,
    node_name: String,
    samples: Vec<MemoryUsageSample>,
    start_time: DateTime<Utc>,
    last_updated: DateTime<Utc>,
}
impl PodMemoryHistory {
    fn new(namespace: String, pod_name: String, node_name: String) -> Self {
        let now = Utc::now();
        PodMemoryHistory {
            namespace,
            pod_name,
            node_name,
            samples: Vec::new(),
            start_time: now,
            last_updated: now,
        }
    }
    fn add_sample(&mut self, usage_bytes: u64) {
        let now = Utc::now();
        self.samples.push(MemoryUsageSample {
            timestamp: now,
            usage_bytes,
        });
        self.last_updated = now;
        // Keep only the last 24 hours of samples
        let one_day_ago = now - chrono::Duration::hours(24);
        self.samples.retain(|sample| sample.timestamp > one_day_ago);
    }
    fn detect_leak(&self) -> Option<LeakInfo> {
        // Need at least 10 samples to detect a trend
        if self.samples.len() < 10 {
            return None;
        }
        // Calculate linear regression to detect trend
        let n = self.samples.len() as f64;
        let timestamps: Vec<f64> = self.samples
            .iter()
            .map(|s| s.timestamp.timestamp() as f64)
            .collect();
        let usages: Vec<f64> = self.samples
            .iter()
            .map(|s| s.usage_bytes as f64)
            .collect();
        let sum_x: f64 = timestamps.iter().sum();
        let sum_y: f64 = usages.iter().sum();
        let sum_xy: f64 = timestamps.iter().zip(usages.iter()).map(|(x, y)| x * y).sum();
        let sum_xx: f64 = timestamps.iter().map(|x| x * x).sum();
        let slope = (n * sum_xy - sum_x * sum_y) / (n * sum_xx - sum_x * sum_x);
        let intercept = (sum_y - slope * sum_x) / n;
        // Calculate R-squared to determine how well the line fits
        let mean_y = sum_y / n;
        let ss_tot: f64 = usages.iter().map(|y| (y - mean_y).powi(2)).sum();
        let ss_res: f64 = usages
            .iter()
            .zip(timestamps.iter())
            .map(|(y, x)| (y - (slope * x + intercept)).powi(2))
            .sum();
        let r_squared = 1.0 - (ss_res / ss_tot);
        // Calculate growth rate in bytes per hour
        let growth_rate_per_second = slope;
        let growth_rate_per_hour = growth_rate_per_second * 3600.0;
        // If we have a positive slope, good R-squared, and significant growth, it's a leak
        if slope > 0.0 && r_squared > 0.7 && growth_rate_per_hour > 1024.0 * 1024.0 {
            let hours_to_1gb = (1024.0 * 1024.0 * 1024.0 - self.samples.last().unwrap().usage_bytes as f64) / growth_rate_per_hour;
            Some(LeakInfo {
                growth_rate_bytes_per_hour: growth_rate_per_hour as u64,
                r_squared,
                hours_to_1gb: if hours_to_1gb > 0.0 { Some(hours_to_1gb) } else { None },
                current_usage_bytes: self.samples.last().unwrap().usage_bytes,
                duration_hours: (self.last_updated - self.start_time).num_seconds() as f64 / 3600.0,
            })
        } else {
            None
        }
    }
}
#[derive(Debug, Clone, Serialize, Deserialize)]
struct LeakInfo {
    growth_rate_bytes_per_hour: u64,
    r_squared: f64,
    hours_to_1gb: Option<f64>,
    current_usage_bytes: u64,
    duration_hours: f64,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
struct LeakReport {
    namespace: String,
    pod_name: String,
    node_name: String,
    leak_info: LeakInfo,
    timestamp: DateTime<Utc>,
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Initialize Kubernetes client
    let client = Client::try_default().await?;
    // Create a shared state for pod memory history
    let pod_memory_history = Arc::new(Mutex::new(HashMap::<String, PodMemoryHistory>::new()));
    // Start the memory usage collector
    let collector_history = pod_memory_history.clone();
    tokio::spawn(async move {
        collect_memory_usage(client.clone(), collector_history).await;
    });
    // Start the leak detector
    let detector_history = pod_memory_history.clone();
    tokio::spawn(async move {
        detect_memory_leaks(detector_history).await;
    });
    // Start the pod watcher to track new and deleted pods
    let watcher_history = pod_memory_history.clone();
    tokio::spawn(async move {
        watch_pods(client.clone(), watcher_history).await;
    });
    // Keep the main thread running
    loop {
        time::sleep(Duration::from_secs(3600)).await;
    }
}
async fn collect_memory_usage(client: Client, history: Arc<Mutex<HashMap<String, PodMemoryHistory>>>) {
    let metrics_client = client.clone();
    loop {
        // Get pod metrics
        match metrics_client.request::<k8s_openapi::api::metrics::v1beta1::PodMetricsList>(
            k8s_openapi::http::Request::get("/apis/metrics.k8s.io/v1beta1/pods")
        ).await {
            Ok(metrics) => {
                let mut history_lock = history.lock().await;
                for pod_metrics in metrics.items {
                    let namespace = pod_metrics.metadata.namespace.unwrap_or_default();
                    let pod_name = pod_metrics.metadata.name.unwrap_or_default();
                    let key = format!("{}/{}", namespace, pod_name);
                    // Calculate total memory usage for the pod
                    let total_memory = pod_metrics.containers.iter()
                        .filter_map(|c| c.usage.get("memory").and_then(|q| q.0.parse::<u64>().ok()))
                        .sum();
                    // Update or create pod history
                    if let Some(pod_history) = history_lock.get_mut(&key) {
                        pod_history.add_sample(total_memory);
                    } else {
                        // Try to get the node name from the pod
                        let pods: Api<Pod> = Api::namespaced(client.clone(), &namespace);
                        if let Ok(pod) = pods.get(&pod_name).await {
                            let node_name = pod.spec.and_then(|s| s.node_name).unwrap_or_default();
                            let mut new_history = PodMemoryHistory::new(namespace, pod_name, node_name);
                            new_history.add_sample(total_memory);
                            history_lock.insert(key, new_history);
                        }
                    }
                }
            }
            Err(e) => {
                eprintln!("Failed to get pod metrics: {}", e);
            }
        }
        // Collect metrics every minute
        time::sleep(Duration::from_secs(60)).await;
    }
}
async fn detect_memory_leaks(history: Arc<Mutex<HashMap<String, PodMemoryHistory>>>) {
    loop {
        let mut reports = Vec::new();
        // Check for leaks
        {
            let history_lock = history.lock().await;
            for (key, pod_history) in history_lock.iter() {
                if let Some(leak_info) = pod_history.detect_leak() {
                    reports.push(LeakReport {
                        namespace: pod_history.namespace.clone(),
                        pod_name: pod_history.pod_name.clone(),
                        node_name: pod_history.node_name.clone(),
                        leak_info,
                        timestamp: Utc::now(),
                    });
                }
            }
        }
        // Report leaks
        if !reports.is_empty() {
            println!("Memory leak reports:");
            for report in reports {
                println!("Pod: {}/{} (Node: {})", report.namespace, report.pod_name, report.node_name);
                println!("  Growth rate: {} MB/hour", report.leak_info.growth_rate_bytes_per_hour / (1024 * 1024));
                println!("  R-squared: {:.4}", report.leak_info.r_squared);
                if let Some(hours) = report.leak_info.hours_to_1gb {
                    println!("  Hours to 1GB: {:.2}", hours);
                }
                println!("  Current usage: {} MB", report.leak_info.current_usage_bytes / (1024 * 1024));
                println!("  Monitored for: {:.2} hours", report.leak_info.duration_hours);
                println!();
                // Send alert (implement your preferred alerting mechanism)
                send_alert(&report).await;
            }
        }
        // Run leak detection every 15 minutes
        time::sleep(Duration::from_secs(900)).await;
    }
}
async fn watch_pods(client: Client, history: Arc<Mutex<HashMap<String, PodMemoryHistory>>>) {
    let pods: Api<Pod> = Api::all(client);
    let lp = ListParams::default()
        .timeout(60)
        .allow_bookmarks();
    let mut stream = pods.watch(&lp, "").await.unwrap().boxed();
    while let Some(event) = stream.next().await {
        match event {
            Ok(WatchEvent::Added(pod)) | Ok(WatchEvent::Modified(pod)) => {
                let namespace = pod.metadata.namespace.unwrap_or_default();
                let pod_name = pod.metadata.name.unwrap_or_default();
                let node_name = pod.spec.and_then(|s| s.node_name).unwrap_or_default();
                let key = format!("{}/{}", namespace, pod_name);
                let mut history_lock = history.lock().await;
                if !history_lock.contains_key(&key) {
                    history_lock.insert(key, PodMemoryHistory::new(namespace, pod_name, node_name));
                }
            }
            Ok(WatchEvent::Deleted(pod)) => {
                let namespace = pod.metadata.namespace.unwrap_or_default();
                let pod_name = pod.metadata.name.unwrap_or_default();
                let key = format!("{}/{}", namespace, pod_name);
                let mut history_lock = history.lock().await;
                history_lock.remove(&key);
            }
            Ok(WatchEvent::Bookmark(_)) => {}
            Err(e) => {
                eprintln!("Watch error: {}", e);
                time::sleep(Duration::from_secs(5)).await;
            }
        }
    }
}
async fn send_alert(report: &LeakReport) {
    // Implement your preferred alerting mechanism
    // This could be Slack, PagerDuty, email, etc.
    println!("ALERT: Memory leak detected in pod {}/{}", report.namespace, report.pod_name);
    // Example: Send to webhook
    let client = reqwest::Client::new();
    let _ = client.post("https://alerts.example.com/webhook")
        .json(report)
        .send()
        .await;
}

Lessons Learned:

Proper resource management is critical for Kubernetes cluster stability.

How to Avoid:

Define appropriate resource requests and limits for all containers.
Configure kubelet with proper eviction thresholds and system reservations.
Implement monitoring for memory usage and pressure.
Use LimitRange and ResourceQuota to enforce resource constraints.
Implement memory leak detection for critical applications.

Answer 3

output:

Container Orchestration Kubernetes 1.24, AWS EKS, Production environment

Summary:

No summary provided

What Happened:

After implementing Cluster Autoscaler for a production EKS cluster, the operations team noticed that nodes were being added and removed frequently throughout the day, sometimes within minutes of each other. This caused pods to be evicted and rescheduled repeatedly, leading to service disruptions and increased AWS costs due to the frequent scaling operations.

Diagnosis Steps:

Analyzed Cluster Autoscaler logs to understand scaling decisions.
Reviewed pod resource requests and limits across the cluster.
Examined node utilization patterns and metrics.
Checked HorizontalPodAutoscaler configurations and behavior.
Investigated pod scheduling and eviction events.

Root Cause:

Multiple issues contributed to the autoscaler thrashing: 1. Pod resource requests were set too low compared to actual usage 2. The Cluster Autoscaler's scale-down delay was too short (5 minutes) 3. HorizontalPodAutoscalers were configured with too aggressive scaling parameters 4. Some stateful applications had no Pod Disruption Budgets defined 5. The node groups were configured with small step sizes (1 node at a time)

Fix/Workaround:

• Short-term: Implemented immediate configuration changes:


# Before: Problematic Cluster Autoscaler configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-config
  namespace: kube-system
data:
  config.yaml: |
    scaleDownUtilizationThreshold: 0.5
    scaleDownUnneededTime: 5m
    scaleDownDelayAfterAdd: 5m
    maxNodeProvisionTime: 15m
    scanInterval: 10s
    expendablePodsPriorityCutoff: -10
# After: Optimized Cluster Autoscaler configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-config
  namespace: kube-system
data:
  config.yaml: |
    scaleDownUtilizationThreshold: 0.5
    scaleDownUnneededTime: 15m
    scaleDownDelayAfterAdd: 15m
    maxNodeProvisionTime: 15m
    scanInterval: 30s
    expendablePodsPriorityCutoff: -10
    scaleDownUnreadyTime: 20m
    okTotalUnreadyCount: 3
    maxGracefulTerminationSec: 600
    ignoreDaemonSetsUtilization: true
    skipNodesWithLocalStorage: true

• Added Pod Disruption Budgets for critical services:


apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: critical-service-pdb
  namespace: production
spec:
  minAvailable: 2  # or use maxUnavailable: 1
  selector:
    matchLabels:
      app: critical-service

• Adjusted resource requests to match actual usage:


# Before: Underprovisioned resource requests
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
  namespace: production
spec:
  replicas: 5
  template:
    spec:
      containers:
      - name: api-container
        resources:
          requests:
            cpu: 100m
            memory: 256Mi
          limits:
            cpu: 1000m
            memory: 1Gi
# After: Properly sized resource requests
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
  namespace: production
spec:
  replicas: 5
  template:
    spec:
      containers:
      - name: api-container
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            cpu: 1000m
            memory: 1Gi

• Long-term: Implemented a comprehensive autoscaling strategy:


// autoscaling_analyzer.go
package main
import (
	"context"
	"encoding/json"
	"fmt"
	"io/ioutil"
	"log"
	"os"
	"sort"
	"time"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/client-go/kubernetes"
	"k8s.io/client-go/rest"
	"k8s.io/client-go/tools/clientcmd"
	"k8s.io/metrics/pkg/client/clientset/versioned"
)
// PodMetrics represents resource usage metrics for a pod
type PodMetrics struct {
	Name      string
	Namespace string
	Containers map[string]ContainerMetrics
	RequestCPU    int64
	RequestMemory int64
	UsageCPU      int64
	UsageMemory   int64
	CPURatio      float64
	MemoryRatio   float64
}
// ContainerMetrics represents resource usage metrics for a container
type ContainerMetrics struct {
	Name         string
	RequestCPU    int64
	RequestMemory int64
	UsageCPU      int64
	UsageMemory   int64
	CPURatio      float64
	MemoryRatio   float64
}
// NodeMetrics represents resource usage metrics for a node
type NodeMetrics struct {
	Name         string
	AllocatableCPU    int64
	AllocatableMemory int64
	RequestCPU        int64
	RequestMemory     int64
	UsageCPU          int64
	UsageMemory       int64
	CPUUtilization    float64
	MemoryUtilization float64
	Pods              int
	PodCapacity       int
	PodUtilization    float64
}
// AutoscalingRecommendation represents a recommendation for autoscaling configuration
type AutoscalingRecommendation struct {
	ResourceType string  `json:"resourceType"`
	Name         string  `json:"name"`
	Namespace    string  `json:"namespace"`
	CurrentValue string  `json:"currentValue"`
	RecommendedValue string  `json:"recommendedValue"`
	Reason       string  `json:"reason"`
	Impact       string  `json:"impact"`
	Priority     int     `json:"priority"`
}
func main() {
	// Load Kubernetes configuration
	kubeconfig := os.Getenv("KUBECONFIG")
	var config *rest.Config
	var err error
	if kubeconfig != "" {
		// Use kubeconfig file
		config, err = clientcmd.BuildConfigFromFlags("", kubeconfig)
	} else {
		// Use in-cluster config
		config, err = rest.InClusterConfig()
	}
	if err != nil {
		log.Fatalf("Error building kubeconfig: %v", err)
	}
	// Create Kubernetes clientset
	clientset, err := kubernetes.NewForConfig(config)
	if err != nil {
		log.Fatalf("Error creating Kubernetes client: %v", err)
	}
	// Create metrics clientset
	metricsClientset, err := versioned.NewForConfig(config)
	if err != nil {
		log.Fatalf("Error creating metrics client: %v", err)
	}
	// Collect pod metrics
	podMetrics, err := collectPodMetrics(clientset, metricsClientset)
	if err != nil {
		log.Fatalf("Error collecting pod metrics: %v", err)
	}
	// Collect node metrics
	nodeMetrics, err := collectNodeMetrics(clientset, metricsClientset)
	if err != nil {
		log.Fatalf("Error collecting node metrics: %v", err)
	}
	// Analyze autoscaling configuration
	recommendations := analyzeAutoscaling(clientset, podMetrics, nodeMetrics)
	// Output recommendations
	outputRecommendations(recommendations)
}
func collectPodMetrics(clientset *kubernetes.Clientset, metricsClientset *versioned.Clientset) ([]PodMetrics, error) {
	// Get all pods
	pods, err := clientset.CoreV1().Pods("").List(context.TODO(), metav1.ListOptions{})
	if err != nil {
		return nil, fmt.Errorf("error listing pods: %v", err)
	}
	// Get pod metrics
	podMetricsList, err := metricsClientset.MetricsV1beta1().PodMetricses("").List(context.TODO(), metav1.ListOptions{})
	if err != nil {
		return nil, fmt.Errorf("error getting pod metrics: %v", err)
	}
	// Create a map of pod metrics
	podMetricsMap := make(map[string]map[string]map[string]int64)
	for _, podMetric := range podMetricsList.Items {
		if _, ok := podMetricsMap[podMetric.Namespace]; !ok {
			podMetricsMap[podMetric.Namespace] = make(map[string]map[string]int64)
		}
		podMetricsMap[podMetric.Namespace][podMetric.Name] = make(map[string]map[string]int64)
		for _, container := range podMetric.Containers {
			if _, ok := podMetricsMap[podMetric.Namespace][podMetric.Name][container.Name]; !ok {
				podMetricsMap[podMetric.Namespace][podMetric.Name][container.Name] = make(map[string]int64)
			}
			podMetricsMap[podMetric.Namespace][podMetric.Name][container.Name]["cpu"] = container.Usage.Cpu().MilliValue()
			podMetricsMap[podMetric.Namespace][podMetric.Name][container.Name]["memory"] = container.Usage.Memory().Value() / (1024 * 1024) // Convert to Mi
		}
	}
	// Process pod metrics
	var result []PodMetrics
	for _, pod := range pods.Items {
		if pod.Status.Phase != "Running" {
			continue
		}
		podMetric := PodMetrics{
			Name:      pod.Name,
			Namespace: pod.Namespace,
			Containers: make(map[string]ContainerMetrics),
		}
		for _, container := range pod.Spec.Containers {
			containerMetric := ContainerMetrics{
				Name: container.Name,
			}
			// Get resource requests
			if container.Resources.Requests != nil {
				containerMetric.RequestCPU = container.Resources.Requests.Cpu().MilliValue()
				containerMetric.RequestMemory = container.Resources.Requests.Memory().Value() / (1024 * 1024) // Convert to Mi
			}
			// Get resource usage
			if metrics, ok := podMetricsMap[pod.Namespace][pod.Name][container.Name]; ok {
				containerMetric.UsageCPU = metrics["cpu"]
				containerMetric.UsageMemory = metrics["memory"]
				// Calculate ratios
				if containerMetric.RequestCPU > 0 {
					containerMetric.CPURatio = float64(containerMetric.UsageCPU) / float64(containerMetric.RequestCPU)
				}
				if containerMetric.RequestMemory > 0 {
					containerMetric.MemoryRatio = float64(containerMetric.UsageMemory) / float64(containerMetric.RequestMemory)
				}
			}
			podMetric.Containers[container.Name] = containerMetric
			// Aggregate pod-level metrics
			podMetric.RequestCPU += containerMetric.RequestCPU
			podMetric.RequestMemory += containerMetric.RequestMemory
			podMetric.UsageCPU += containerMetric.UsageCPU
			podMetric.UsageMemory += containerMetric.UsageMemory
		}
		// Calculate pod-level ratios
		if podMetric.RequestCPU > 0 {
			podMetric.CPURatio = float64(podMetric.UsageCPU) / float64(podMetric.RequestCPU)
		}
		if podMetric.RequestMemory > 0 {
			podMetric.MemoryRatio = float64(podMetric.UsageMemory) / float64(podMetric.RequestMemory)
		}
		result = append(result, podMetric)
	}
	return result, nil
}
func collectNodeMetrics(clientset *kubernetes.Clientset, metricsClientset *versioned.Clientset) ([]NodeMetrics, error) {
	// Get all nodes
	nodes, err := clientset.CoreV1().Nodes().List(context.TODO(), metav1.ListOptions{})
	if err != nil {
		return nil, fmt.Errorf("error listing nodes: %v", err)
	}
	// Get node metrics
	nodeMetricsList, err := metricsClientset.MetricsV1beta1().NodeMetricses().List(context.TODO(), metav1.ListOptions{})
	if err != nil {
		return nil, fmt.Errorf("error getting node metrics: %v", err)
	}
	// Create a map of node metrics
	nodeMetricsMap := make(map[string]map[string]int64)
	for _, nodeMetric := range nodeMetricsList.Items {
		nodeMetricsMap[nodeMetric.Name] = make(map[string]int64)
		nodeMetricsMap[nodeMetric.Name]["cpu"] = nodeMetric.Usage.Cpu().MilliValue()
		nodeMetricsMap[nodeMetric.Name]["memory"] = nodeMetric.Usage.Memory().Value() / (1024 * 1024) // Convert to Mi
	}
	// Get all pods to calculate resource requests per node
	pods, err := clientset.CoreV1().Pods("").List(context.TODO(), metav1.ListOptions{})
	if err != nil {
		return nil, fmt.Errorf("error listing pods: %v", err)
	}
	// Create a map of resource requests per node
	nodeRequestsMap := make(map[string]map[string]int64)
	nodePodsMap := make(map[string]int)
	for _, pod := range pods.Items {
		if pod.Status.Phase != "Running" {
			continue
		}
		nodeName := pod.Spec.NodeName
		if nodeName == "" {
			continue
		}
		if _, ok := nodeRequestsMap[nodeName]; !ok {
			nodeRequestsMap[nodeName] = make(map[string]int64)
			nodeRequestsMap[nodeName]["cpu"] = 0
			nodeRequestsMap[nodeName]["memory"] = 0
		}
		nodePodsMap[nodeName]++
		for _, container := range pod.Spec.Containers {
			if container.Resources.Requests != nil {
				nodeRequestsMap[nodeName]["cpu"] += container.Resources.Requests.Cpu().MilliValue()
				nodeRequestsMap[nodeName]["memory"] += container.Resources.Requests.Memory().Value() / (1024 * 1024) // Convert to Mi
			}
		}
	}
	// Process node metrics
	var result []NodeMetrics
	for _, node := range nodes.Items {
		nodeMetric := NodeMetrics{
			Name: node.Name,
		}
		// Get allocatable resources
		nodeMetric.AllocatableCPU = node.Status.Allocatable.Cpu().MilliValue()
		nodeMetric.AllocatableMemory = node.Status.Allocatable.Memory().Value() / (1024 * 1024) // Convert to Mi
		nodeMetric.PodCapacity = int(node.Status.Allocatable.Pods().Value())
		// Get resource requests
		if requests, ok := nodeRequestsMap[node.Name]; ok {
			nodeMetric.RequestCPU = requests["cpu"]
			nodeMetric.RequestMemory = requests["memory"]
		}
		// Get resource usage
		if metrics, ok := nodeMetricsMap[node.Name]; ok {
			nodeMetric.UsageCPU = metrics["cpu"]
			nodeMetric.UsageMemory = metrics["memory"]
		}
		// Get pod count
		nodeMetric.Pods = nodePodsMap[node.Name]
		// Calculate utilization
		nodeMetric.CPUUtilization = float64(nodeMetric.UsageCPU) / float64(nodeMetric.AllocatableCPU)
		nodeMetric.MemoryUtilization = float64(nodeMetric.UsageMemory) / float64(nodeMetric.AllocatableMemory)
		nodeMetric.PodUtilization = float64(nodeMetric.Pods) / float64(nodeMetric.PodCapacity)
		result = append(result, nodeMetric)
	}
	return result, nil
}
func analyzeAutoscaling(clientset *kubernetes.Clientset, podMetrics []PodMetrics, nodeMetrics []NodeMetrics) []AutoscalingRecommendation {
	var recommendations []AutoscalingRecommendation
	// Analyze pod resource requests
	for _, pod := range podMetrics {
		// Check for underprovisioned CPU requests
		if pod.CPURatio > 1.5 {
			recommendations = append(recommendations, AutoscalingRecommendation{
				ResourceType: "PodCPURequest",
				Name:         pod.Name,
				Namespace:    pod.Namespace,
				CurrentValue: fmt.Sprintf("%dm", pod.RequestCPU),
				RecommendedValue: fmt.Sprintf("%dm", int64(float64(pod.UsageCPU)*1.2)),
				Reason:       fmt.Sprintf("CPU usage (%.2f%%) significantly exceeds request", pod.CPURatio*100),
				Impact:       "Pod may be throttled and cause node autoscaler thrashing",
				Priority:     1,
			})
		}
		// Check for overprovisioned CPU requests
		if pod.CPURatio < 0.3 && pod.RequestCPU > 100 {
			recommendations = append(recommendations, AutoscalingRecommendation{
				ResourceType: "PodCPURequest",
				Name:         pod.Name,
				Namespace:    pod.Namespace,
				CurrentValue: fmt.Sprintf("%dm", pod.RequestCPU),
				RecommendedValue: fmt.Sprintf("%dm", int64(float64(pod.UsageCPU)*1.5)),
				Reason:       fmt.Sprintf("CPU usage (%.2f%%) is much lower than request", pod.CPURatio*100),
				Impact:       "Wasted resources and unnecessary scaling",
				Priority:     2,
			})
		}
		// Check for underprovisioned memory requests
		if pod.MemoryRatio > 1.2 {
			recommendations = append(recommendations, AutoscalingRecommendation{
				ResourceType: "PodMemoryRequest",
				Name:         pod.Name,
				Namespace:    pod.Namespace,
				CurrentValue: fmt.Sprintf("%dMi", pod.RequestMemory),
				RecommendedValue: fmt.Sprintf("%dMi", int64(float64(pod.UsageMemory)*1.2)),
				Reason:       fmt.Sprintf("Memory usage (%.2f%%) exceeds request", pod.MemoryRatio*100),
				Impact:       "Pod may be OOM killed and cause node autoscaler thrashing",
				Priority:     1,
			})
		}
		// Check for overprovisioned memory requests
		if pod.MemoryRatio < 0.3 && pod.RequestMemory > 100 {
			recommendations = append(recommendations, AutoscalingRecommendation{
				ResourceType: "PodMemoryRequest",
				Name:         pod.Name,
				Namespace:    pod.Namespace,
				CurrentValue: fmt.Sprintf("%dMi", pod.RequestMemory),
				RecommendedValue: fmt.Sprintf("%dMi", int64(float64(pod.UsageMemory)*1.5)),
				Reason:       fmt.Sprintf("Memory usage (%.2f%%) is much lower than request", pod.MemoryRatio*100),
				Impact:       "Wasted resources and unnecessary scaling",
				Priority:     2,
			})
		}
	}
	// Analyze node utilization
	var cpuUtilization, memoryUtilization, podUtilization float64
	for _, node := range nodeMetrics {
		cpuUtilization += node.CPUUtilization
		memoryUtilization += node.MemoryUtilization
		podUtilization += node.PodUtilization
	}
	nodeCount := float64(len(nodeMetrics))
	if nodeCount > 0 {
		cpuUtilization /= nodeCount
		memoryUtilization /= nodeCount
		podUtilization /= nodeCount
		// Check for cluster-wide resource imbalance
		if cpuUtilization < 0.3 && memoryUtilization > 0.7 {
			recommendations = append(recommendations, AutoscalingRecommendation{
				ResourceType: "ClusterAutoscaler",
				Name:         "cluster-autoscaler",
				Namespace:    "kube-system",
				CurrentValue: "Memory-bound scaling",
				RecommendedValue: "Consider memory-optimized node types",
				Reason:       fmt.Sprintf("Cluster is memory-bound (CPU: %.2f%%, Memory: %.2f%%)", cpuUtilization*100, memoryUtilization*100),
				Impact:       "Inefficient resource usage and unnecessary scaling",
				Priority:     3,
			})
		}
		if memoryUtilization < 0.3 && cpuUtilization > 0.7 {
			recommendations = append(recommendations, AutoscalingRecommendation{
				ResourceType: "ClusterAutoscaler",
				Name:         "cluster-autoscaler",
				Namespace:    "kube-system",
				CurrentValue: "CPU-bound scaling",
				RecommendedValue: "Consider compute-optimized node types",
				Reason:       fmt.Sprintf("Cluster is CPU-bound (CPU: %.2f%%, Memory: %.2f%%)", cpuUtilization*100, memoryUtilization*100),
				Impact:       "Inefficient resource usage and unnecessary scaling",
				Priority:     3,
			})
		}
		if podUtilization > 0.8 {
			recommendations = append(recommendations, AutoscalingRecommendation{
				ResourceType: "ClusterAutoscaler",
				Name:         "cluster-autoscaler",
				Namespace:    "kube-system",
				CurrentValue: "Pod-bound scaling",
				RecommendedValue: "Increase max pods per node",
				Reason:       fmt.Sprintf("Cluster is pod-bound (Pod utilization: %.2f%%)", podUtilization*100),
				Impact:       "Scaling based on pod count rather than resource usage",
				Priority:     3,
			})
		}
	}
	// Get Cluster Autoscaler configuration
	caConfigMap, err := clientset.CoreV1().ConfigMaps("kube-system").Get(context.TODO(), "cluster-autoscaler-config", metav1.GetOptions{})
	if err == nil {
		// Check for aggressive scale-down settings
		if caConfig, ok := caConfigMap.Data["config.yaml"]; ok {
			if scaleDownTime := getConfigValue(caConfig, "scaleDownUnneededTime"); scaleDownTime != "" {
				if scaleDownTime == "5m" || scaleDownTime == "300s" {
					recommendations = append(recommendations, AutoscalingRecommendation{
						ResourceType: "ClusterAutoscalerConfig",
						Name:         "scaleDownUnneededTime",
						Namespace:    "kube-system",
						CurrentValue: scaleDownTime,
						RecommendedValue: "15m",
						Reason:       "Scale-down delay is too short",
						Impact:       "Causes frequent node removal and pod disruption",
						Priority:     1,
					})
				}
			}
			if scaleDownDelay := getConfigValue(caConfig, "scaleDownDelayAfterAdd"); scaleDownDelay != "" {
				if scaleDownDelay == "5m" || scaleDownDelay == "300s" {
					recommendations = append(recommendations, AutoscalingRecommendation{
						ResourceType: "ClusterAutoscalerConfig",
						Name:         "scaleDownDelayAfterAdd",
						Namespace:    "kube-system",
						CurrentValue: scaleDownDelay,
						RecommendedValue: "15m",
						Reason:       "Scale-down delay after add is too short",
						Impact:       "Causes thrashing when load fluctuates",
						Priority:     1,
					})
				}
			}
			if scanInterval := getConfigValue(caConfig, "scanInterval"); scanInterval != "" {
				if scanInterval == "10s" {
					recommendations = append(recommendations, AutoscalingRecommendation{
						ResourceType: "ClusterAutoscalerConfig",
						Name:         "scanInterval",
						Namespace:    "kube-system",
						CurrentValue: scanInterval,
						RecommendedValue: "30s",
						Reason:       "Scan interval is too short",
						Impact:       "Causes frequent scaling decisions and API server load",
						Priority:     2,
					})
				}
			}
		}
	}
	// Get HorizontalPodAutoscaler configurations
	hpas, err := clientset.AutoscalingV2beta2().HorizontalPodAutoscalers("").List(context.TODO(), metav1.ListOptions{})
	if err == nil {
		for _, hpa := range hpas.Items {
			// Check for aggressive scaling settings
			if hpa.Spec.Behavior != nil {
				if hpa.Spec.Behavior.ScaleDown != nil {
					for _, policy := range hpa.Spec.Behavior.ScaleDown.Policies {
						if policy.PeriodSeconds < 60 && policy.Type == "Percent" && policy.Value > 20 {
							recommendations = append(recommendations, AutoscalingRecommendation{
								ResourceType: "HorizontalPodAutoscaler",
								Name:         hpa.Name,
								Namespace:    hpa.Namespace,
								CurrentValue: fmt.Sprintf("%d%% every %ds", policy.Value, policy.PeriodSeconds),
								RecommendedValue: fmt.Sprintf("10%% every 60s"),
								Reason:       "Scale-down policy is too aggressive",
								Impact:       "Causes rapid pod scaling and potential thrashing",
								Priority:     2,
							})
						}
					}
				}
				if hpa.Spec.Behavior.ScaleUp != nil {
					for _, policy := range hpa.Spec.Behavior.ScaleUp.Policies {
						if policy.PeriodSeconds < 60 && policy.Type == "Percent" && policy.Value > 100 {
							recommendations = append(recommendations, AutoscalingRecommendation{
								ResourceType: "HorizontalPodAutoscaler",
								Name:         hpa.Name,
								Namespace:    hpa.Namespace,
								CurrentValue: fmt.Sprintf("%d%% every %ds", policy.Value, policy.PeriodSeconds),
								RecommendedValue: fmt.Sprintf("50%% every 60s"),
								Reason:       "Scale-up policy is too aggressive",
								Impact:       "Causes rapid pod scaling and potential thrashing",
								Priority:     2,
							})
						}
					}
				}
			}
		}
	}
	// Sort recommendations by priority
	sort.Slice(recommendations, func(i, j int) bool {
		return recommendations[i].Priority < recommendations[j].Priority
	})
	return recommendations
}
func getConfigValue(config string, key string) string {
	// Simple parser for YAML-like config
	// In a real implementation, use a proper YAML parser
	lines := strings.Split(config, "\n")
	for _, line := range lines {
		parts := strings.SplitN(line, ":", 2)
		if len(parts) == 2 && strings.TrimSpace(parts[0]) == key {
			return strings.TrimSpace(parts[1])
		}
	}
	return ""
}
func outputRecommendations(recommendations []AutoscalingRecommendation) {
	// Output recommendations as JSON
	output, err := json.MarshalIndent(recommendations, "", "  ")
	if err != nil {
		log.Fatalf("Error marshaling recommendations: %v", err)
	}
	fmt.Println(string(output))
	// Write recommendations to file
	err = ioutil.WriteFile("autoscaling_recommendations.json", output, 0644)
	if err != nil {
		log.Fatalf("Error writing recommendations to file: %v", err)
	}
	// Print summary
	fmt.Printf("\nAutoscaling Analysis Summary:\n")
	fmt.Printf("Found %d recommendations\n", len(recommendations))
	priorityCount := make(map[int]int)
	typeCount := make(map[string]int)
	for _, rec := range recommendations {
		priorityCount[rec.Priority]++
		typeCount[rec.ResourceType]++
	}
	fmt.Printf("\nRecommendations by priority:\n")
	for priority := 1; priority <= 3; priority++ {
		fmt.Printf("Priority %d: %d recommendations\n", priority, priorityCount[priority])
	}
	fmt.Printf("\nRecommendations by type:\n")
	for resourceType, count := range typeCount {
		fmt.Printf("%s: %d recommendations\n", resourceType, count)
	}
}

• Created a comprehensive autoscaling configuration guide:


# autoscaling_best_practices.yaml
---
cluster_autoscaler:
  configuration:
    # Delay before scaling down a node after it becomes unneeded
    scaleDownUnneededTime: 15m
    # Delay before scaling down after a node was added
    scaleDownDelayAfterAdd: 15m
    # Delay before scaling down after a node was removed
    scaleDownDelayAfterDelete: 15m
    # Delay before scaling down after a failure
    scaleDownDelayAfterFailure: 3m
    # How often CA checks for scale up/down
    scanInterval: 30s
    # Utilization threshold below which a node may be considered for scale down
    scaleDownUtilizationThreshold: 0.5
    # Maximum time to wait for node provisioning
    maxNodeProvisionTime: 15m
    # Maximum number of nodes that can be added in a single scale up
    maxNodesTotal: 100
    # Maximum number of unready nodes
    maxTotalUnreadyPercentage: 45
    # Skip nodes with local storage
    skipNodesWithLocalStorage: true
    # Skip nodes with system pods
    skipNodesWithSystemPods: true
    # Ignore DaemonSet pods when calculating resource utilization
    ignoreDaemonSetsUtilization: true
    # Maximum time to wait for pod deletion during scale down
    maxGracefulTerminationSec: 600
  node_groups:
    # Configure multiple node groups with different instance types
    - name: "general-purpose"
      min_size: 3
      max_size: 20
      instance_type: "m5.2xlarge"
    - name: "cpu-optimized"
      min_size: 0
      max_size: 10
      instance_type: "c5.2xlarge"
    - name: "memory-optimized"
      min_size: 0
      max_size: 10
      instance_type: "r5.2xlarge"
  scaling_strategy:
    # Scale up quickly, scale down slowly
    scale_up_step_size: 2
    scale_down_step_size: 1
    # Use priority expander to prefer cheaper node groups
    expander: "priority"
    expander_priorities:
      - group_name: "general-purpose"
        priority: 100
      - group_name: "spot-instances"
        priority: 50
      - group_name: "cpu-optimized"
        priority: 25
      - group_name: "memory-optimized"
        priority: 25
horizontal_pod_autoscaler:
  configuration:
    # Default HPA behavior
    default_behavior:
      scale_up:
        stabilization_window_seconds: 300
        select_policy: "Max"
        policies:
          - type: "Percent"
            value: 50
            period_seconds: 60
          - type: "Pods"
            value: 4
            period_seconds: 60
      scale_down:
        stabilization_window_seconds: 300
        select_policy: "Min"
        policies:
          - type: "Percent"
            value: 10
            period_seconds: 60
  metrics:
    # Use multiple metrics for more stable scaling
    - type: "Resource"
      resource:
        name: "cpu"
        target:
          type: "Utilization"
          averageUtilization: 70
    - type: "Resource"
      resource:
        name: "memory"
        target:
          type: "Utilization"
          averageUtilization: 70
    # Consider custom metrics for application-specific scaling
    - type: "Pods"
      pods:
        metric:
          name: "requests_per_second"
        target:
          type: "AverageValue"
          averageValue: 1000
pod_disruption_budgets:
  # Configure PDBs for all critical services
  critical_services:
    min_available: "50%"
  stateful_services:
    min_available: 1
  batch_jobs:
    max_unavailable: "50%"
resource_requests:
  # Guidelines for setting resource requests
  cpu:
    # Set requests based on observed P90 usage
    buffer_factor: 1.5
    min_request: "50m"
  memory:
    # Set requests based on observed P95 usage
    buffer_factor: 1.2
    min_request: "64Mi"
  # Regularly review and adjust based on actual usage
  review_period: "2 weeks"
monitoring:
  # Metrics to monitor for autoscaling health
  metrics:
    - name: "cluster_autoscaler_activity"
      description: "Count of scale up/down activities"
      alert_threshold: "10 per hour"
    - name: "node_creation_time"
      description: "Time taken to create and ready a new node"
      alert_threshold: "> 5 minutes"
    - name: "pod_disruption_events"
      description: "Count of pods disrupted due to node scaling"
      alert_threshold: "> 20 per hour"
    - name: "resource_request_vs_usage_ratio"
      description: "Ratio of requested resources to actual usage"
      alert_threshold: "< 0.3 or > 1.5"
  dashboards:
    - name: "Autoscaling Overview"
      panels:
        - title: "Cluster Autoscaler Activity"
          metrics: ["cluster_autoscaler_scale_up_count", "cluster_autoscaler_scale_down_count"]
        - title: "Node Count by Type"
          metrics: ["kube_node_labels"]
        - title: "Pod Disruptions"
          metrics: ["kube_pod_deleted", "kube_pod_container_status_waiting"]
        - title: "Resource Utilization vs Requests"
          metrics: ["container_cpu_usage_seconds_total", "kube_pod_container_resource_requests"]

Lessons Learned:

Effective autoscaling requires careful tuning of multiple components working together.

How to Avoid:

Configure appropriate scale-down delays to prevent thrashing.
Set accurate resource requests based on actual usage.
Implement Pod Disruption Budgets for all critical services.
Use multiple node groups with different instance types.
Monitor autoscaling behavior and adjust configurations regularly.

Answer 4

output:

Container Orchestration Kubernetes 1.23 to 1.24 upgrade, StatefulSets, Persistent Volumes

Summary:

No summary provided

What Happened:

A company upgraded their Kubernetes cluster from version 1.23 to 1.24. After the upgrade, their distributed database running as a StatefulSet experienced severe issues. Pods were unable to form a proper cluster, and some nodes reported data corruption. The application logs showed that pods were unable to recognize their peers, despite the StatefulSet appearing to be running normally according to Kubernetes.

Diagnosis Steps:

Analyzed application logs for error patterns.
Compared pod hostnames and network identities before and after the upgrade.
Reviewed StatefulSet configuration and PersistentVolumeClaim bindings.
Examined Kubernetes events and controller logs.
Tested pod-to-pod communication using network utilities.

Root Cause:

The investigation revealed multiple issues: 1. The Kubernetes upgrade included changes to the StatefulSet controller's pod identity management 2. Custom pod management policy settings in the StatefulSet were incompatible with the new version 3. The headless service DNS records were not properly updated after the upgrade 4. PersistentVolumeClaims were correctly bound, but pods couldn't resolve their peers by hostname 5. A custom admission controller was modifying pod hostnames in a way that conflicted with StatefulSet naming

Fix/Workaround:

• Short-term: Implemented a temporary fix by manually correcting DNS entries and restarting pods in the correct order

• Updated the StatefulSet configuration to be compatible with Kubernetes 1.24:


# Before: Problematic StatefulSet configuration
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: db-cluster
  namespace: production
spec:
  serviceName: db-cluster-headless
  replicas: 5
  podManagementPolicy: Parallel
  updateStrategy:
    type: OnDelete
  selector:
    matchLabels:
      app: distributed-db
  template:
    metadata:
      labels:
        app: distributed-db
    spec:
      terminationGracePeriodSeconds: 10
      hostname: db-node
      subdomain: db-cluster-domain
      containers:
      - name: db
        image: distributed-db:v1.2.3
        ports:
        - containerPort: 9042
          name: client
        - containerPort: 7000
          name: intra-node
        - containerPort: 7001
          name: tls-intra-node
        env:
        - name: CLUSTER_NAME
          value: "db-cluster"
        - name: POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        - name: HOST_IP
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        volumeMounts:
        - name: data
          mountPath: /var/lib/db
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "fast-ssd"
      resources:
        requests:
          storage: 100Gi
# After: Fixed StatefulSet configuration
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: db-cluster
  namespace: production
spec:
  serviceName: db-cluster-headless
  replicas: 5
  podManagementPolicy: OrderedReady  # Changed from Parallel
  updateStrategy:
    type: RollingUpdate            # Changed from OnDelete
    rollingUpdate:
      partition: 0
  selector:
    matchLabels:
      app: distributed-db
  template:
    metadata:
      labels:
        app: distributed-db
    spec:
      terminationGracePeriodSeconds: 30  # Increased grace period
      # Removed custom hostname and subdomain settings
      containers:
      - name: db
        image: distributed-db:v1.2.3
        ports:
        - containerPort: 9042
          name: client
        - containerPort: 7000
          name: intra-node
        - containerPort: 7001
          name: tls-intra-node
        env:
        - name: CLUSTER_NAME
          value: "db-cluster"
        - name: POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        - name: HOST_IP
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        # Added readiness probe
        readinessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - nodetool status | grep -E "^UN\\s+${POD_IP}"
          initialDelaySeconds: 15
          timeoutSeconds: 5
        # Added lifecycle hooks
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              - nodetool drain
        volumeMounts:
        - name: data
          mountPath: /var/lib/db
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "fast-ssd"
      resources:
        requests:
          storage: 100Gi

• Fixed the headless service configuration:


# Before: Problematic headless service
apiVersion: v1
kind: Service
metadata:
  name: db-cluster-headless
  namespace: production
  labels:
    app: distributed-db
spec:
  ports:
  - port: 9042
    name: client
  - port: 7000
    name: intra-node
  - port: 7001
    name: tls-intra-node
  clusterIP: None
  selector:
    app: distributed-db
  publishNotReadyAddresses: true
# After: Fixed headless service
apiVersion: v1
kind: Service
metadata:
  name: db-cluster-headless
  namespace: production
  labels:
    app: distributed-db
  annotations:
    service.alpha.kubernetes.io/tolerate-unready-endpoints: "true"
spec:
  ports:
  - port: 9042
    name: client
  - port: 7000
    name: intra-node
  - port: 7001
    name: tls-intra-node
  clusterIP: None
  selector:
    app: distributed-db
  publishNotReadyAddresses: false  # Changed to false

• Implemented a custom StatefulSet controller in Go to handle the migration:


// statefulset_migration.go
package main
import (
	"context"
	"flag"
	"fmt"
	"os"
	"os/signal"
	"syscall"
	"time"
	appsv1 "k8s.io/api/apps/v1"
	corev1 "k8s.io/api/core/v1"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/apimachinery/pkg/labels"
	"k8s.io/apimachinery/pkg/types"
	"k8s.io/client-go/kubernetes"
	"k8s.io/client-go/rest"
	"k8s.io/client-go/tools/clientcmd"
	"k8s.io/klog/v2"
)
var (
	kubeconfig        string
	namespace         string
	statefulSetName   string
	migrationMode     string
	gracePeriodSec    int
	waitTimeSec       int
	skipConfirmation  bool
	preserveNodeOrder bool
)
func init() {
	flag.StringVar(&kubeconfig, "kubeconfig", "", "Path to kubeconfig file")
	flag.StringVar(&namespace, "namespace", "", "Namespace of the StatefulSet")
	flag.StringVar(&statefulSetName, "statefulset", "", "Name of the StatefulSet")
	flag.StringVar(&migrationMode, "mode", "validate", "Migration mode: validate, prepare, migrate, rollback")
	flag.IntVar(&gracePeriodSec, "grace-period", 30, "Grace period in seconds for pod termination")
	flag.IntVar(&waitTimeSec, "wait-time", 60, "Wait time in seconds between pod operations")
	flag.BoolVar(&skipConfirmation, "yes", false, "Skip confirmation prompts")
	flag.BoolVar(&preserveNodeOrder, "preserve-order", true, "Preserve node order during migration")
	flag.Parse()
}
func main() {
	if namespace == "" || statefulSetName == "" {
		klog.Fatal("Namespace and StatefulSet name are required")
	}
	// Create Kubernetes client
	var config *rest.Config
	var err error
	if kubeconfig == "" {
		klog.Info("Using in-cluster configuration")
		config, err = rest.InClusterConfig()
	} else {
		klog.Infof("Using configuration from %s", kubeconfig)
		config, err = clientcmd.BuildConfigFromFlags("", kubeconfig)
	}
	if err != nil {
		klog.Fatalf("Error building kubeconfig: %s", err.Error())
	}
	clientset, err := kubernetes.NewForConfig(config)
	if err != nil {
		klog.Fatalf("Error creating Kubernetes client: %s", err.Error())
	}
	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()
	// Handle termination signals
	signalChan := make(chan os.Signal, 1)
	signal.Notify(signalChan, syscall.SIGINT, syscall.SIGTERM)
	go func() {
		<-signalChan
		klog.Info("Received termination signal, shutting down gracefully...")
		cancel()
		os.Exit(0)
	}()
	// Execute the selected mode
	switch migrationMode {
	case "validate":
		validateStatefulSet(ctx, clientset)
	case "prepare":
		prepareStatefulSetMigration(ctx, clientset)
	case "migrate":
		migrateStatefulSet(ctx, clientset)
	case "rollback":
		rollbackStatefulSetMigration(ctx, clientset)
	default:
		klog.Fatalf("Unknown migration mode: %s", migrationMode)
	}
}
func validateStatefulSet(ctx context.Context, clientset *kubernetes.Clientset) {
	klog.Infof("Validating StatefulSet %s in namespace %s", statefulSetName, namespace)
	// Get the StatefulSet
	sts, err := clientset.AppsV1().StatefulSets(namespace).Get(ctx, statefulSetName, metav1.GetOptions{})
	if err != nil {
		klog.Fatalf("Error getting StatefulSet: %s", err.Error())
	}
	// Validate StatefulSet configuration
	issues := []string{}
	// Check podManagementPolicy
	if sts.Spec.PodManagementPolicy == appsv1.ParallelPodManagement {
		issues = append(issues, "PodManagementPolicy is set to Parallel, which may cause issues after upgrade. Consider changing to OrderedReady.")
	}
	// Check updateStrategy
	if sts.Spec.UpdateStrategy.Type == appsv1.OnDeleteStatefulSetStrategyType {
		issues = append(issues, "UpdateStrategy is set to OnDelete, which may cause issues after upgrade. Consider changing to RollingUpdate.")
	}
	// Check for custom hostname and subdomain
	if sts.Spec.Template.Spec.Hostname != "" || sts.Spec.Template.Spec.Subdomain != "" {
		issues = append(issues, "Custom hostname or subdomain is set, which may conflict with StatefulSet naming in newer Kubernetes versions.")
	}
	// Check for readiness probes
	for _, container := range sts.Spec.Template.Spec.Containers {
		if container.ReadinessProbe == nil {
			issues = append(issues, fmt.Sprintf("Container %s does not have a readiness probe, which may cause issues with pod identity.", container.Name))
		}
	}
	// Check for lifecycle hooks
	for _, container := range sts.Spec.Template.Spec.Containers {
		if container.Lifecycle == nil || container.Lifecycle.PreStop == nil {
			issues = append(issues, fmt.Sprintf("Container %s does not have a preStop lifecycle hook, which may cause issues with graceful termination.", container.Name))
		}
	}
	// Check termination grace period
	if sts.Spec.Template.Spec.TerminationGracePeriodSeconds != nil && *sts.Spec.Template.Spec.TerminationGracePeriodSeconds < 30 {
		issues = append(issues, "TerminationGracePeriodSeconds is less than 30 seconds, which may not be enough for graceful termination.")
	}
	// Check headless service
	svc, err := clientset.CoreV1().Services(namespace).Get(ctx, sts.Spec.ServiceName, metav1.GetOptions{})
	if err != nil {
		issues = append(issues, fmt.Sprintf("Error getting headless service %s: %s", sts.Spec.ServiceName, err.Error()))
	} else {
		if svc.Spec.ClusterIP != "None" {
			issues = append(issues, "Service is not headless (ClusterIP is not None).")
		}
		if svc.Spec.PublishNotReadyAddresses {
			issues = append(issues, "PublishNotReadyAddresses is set to true, which may cause issues with pod identity in newer Kubernetes versions.")
		}
	}
	// Check PVCs
	pvcs, err := clientset.CoreV1().PersistentVolumeClaims(namespace).List(ctx, metav1.ListOptions{
		LabelSelector: labels.SelectorFromSet(sts.Spec.Selector.MatchLabels).String(),
	})
	if err != nil {
		issues = append(issues, fmt.Sprintf("Error listing PVCs: %s", err.Error()))
	} else {
		klog.Infof("Found %d PVCs for StatefulSet %s", len(pvcs.Items), statefulSetName)
		for _, pvc := range pvcs.Items {
			if pvc.Status.Phase != corev1.ClaimBound {
				issues = append(issues, fmt.Sprintf("PVC %s is not bound (status: %s)", pvc.Name, pvc.Status.Phase))
			}
		}
	}
	// Check pods
	pods, err := clientset.CoreV1().Pods(namespace).List(ctx, metav1.ListOptions{
		LabelSelector: labels.SelectorFromSet(sts.Spec.Selector.MatchLabels).String(),
	})
	if err != nil {
		issues = append(issues, fmt.Sprintf("Error listing pods: %s", err.Error()))
	} else {
		klog.Infof("Found %d pods for StatefulSet %s", len(pods.Items), statefulSetName)
		for _, pod := range pods.Items {
			if !strings.HasPrefix(pod.Name, statefulSetName+"-") {
				issues = append(issues, fmt.Sprintf("Pod %s does not follow StatefulSet naming convention", pod.Name))
			}
		}
	}
	// Print validation results
	if len(issues) == 0 {
		klog.Info("No issues found with StatefulSet configuration.")
	} else {
		klog.Info("Found the following issues with StatefulSet configuration:")
		for i, issue := range issues {
			klog.Infof("%d. %s", i+1, issue)
		}
	}
}
func prepareStatefulSetMigration(ctx context.Context, clientset *kubernetes.Clientset) {
	klog.Infof("Preparing StatefulSet %s in namespace %s for migration", statefulSetName, namespace)
	// Get the StatefulSet
	sts, err := clientset.AppsV1().StatefulSets(namespace).Get(ctx, statefulSetName, metav1.GetOptions{})
	if err != nil {
		klog.Fatalf("Error getting StatefulSet: %s", err.Error())
	}
	// Create a backup of the StatefulSet
	backupName := fmt.Sprintf("%s-backup-%d", statefulSetName, time.Now().Unix())
	backupSts := sts.DeepCopy()
	backupSts.ObjectMeta.Name = backupName
	backupSts.ObjectMeta.ResourceVersion = ""
	backupSts.ObjectMeta.UID = ""
	backupSts.ObjectMeta.CreationTimestamp = metav1.Time{}
	backupSts.Status = appsv1.StatefulSetStatus{}
	// Add backup annotation to the original StatefulSet
	if sts.Annotations == nil {
		sts.Annotations = make(map[string]string)
	}
	sts.Annotations["statefulset-migration.kubernetes.io/backup"] = backupName
	// Create the backup StatefulSet
	_, err = clientset.AppsV1().StatefulSets(namespace).Create(ctx, backupSts, metav1.CreateOptions{})
	if err != nil {
		klog.Fatalf("Error creating backup StatefulSet: %s", err.Error())
	}
	klog.Infof("Created backup StatefulSet %s", backupName)
	// Update the original StatefulSet with the backup annotation
	_, err = clientset.AppsV1().StatefulSets(namespace).Update(ctx, sts, metav1.UpdateOptions{})
	if err != nil {
		klog.Fatalf("Error updating StatefulSet with backup annotation: %s", err.Error())
	}
	klog.Infof("Updated StatefulSet %s with backup annotation", statefulSetName)
	// Create a ConfigMap with the migration plan
	migrationPlan := map[string]string{
		"backup":       backupName,
		"statefulset":  statefulSetName,
		"namespace":    namespace,
		"created_at":   time.Now().Format(time.RFC3339),
		"created_by":   "statefulset-migration-tool",
		"instructions": "This ConfigMap contains the migration plan for StatefulSet " + statefulSetName + " in namespace " + namespace + ". Do not delete this ConfigMap until the migration is complete.",
	}
	migrationPlanCM := &corev1.ConfigMap{
		ObjectMeta: metav1.ObjectMeta{
			Name:      fmt.Sprintf("%s-migration-plan", statefulSetName),
			Namespace: namespace,
			Labels: map[string]string{
				"app.kubernetes.io/managed-by": "statefulset-migration-tool",
				"app.kubernetes.io/name":       statefulSetName,
			},
		},
		Data: migrationPlan,
	}
	_, err = clientset.CoreV1().ConfigMaps(namespace).Create(ctx, migrationPlanCM, metav1.CreateOptions{})
	if err != nil {
		klog.Fatalf("Error creating migration plan ConfigMap: %s", err.Error())
	}
	klog.Infof("Created migration plan ConfigMap %s", migrationPlanCM.Name)
	klog.Info("StatefulSet migration preparation complete.")
}
func migrateStatefulSet(ctx context.Context, clientset *kubernetes.Clientset) {
	klog.Infof("Migrating StatefulSet %s in namespace %s", statefulSetName, namespace)
	// Get the StatefulSet
	sts, err := clientset.AppsV1().StatefulSets(namespace).Get(ctx, statefulSetName, metav1.GetOptions{})
	if err != nil {
		klog.Fatalf("Error getting StatefulSet: %s", err.Error())
	}
	// Check if the StatefulSet has a backup annotation
	backupName, ok := sts.Annotations["statefulset-migration.kubernetes.io/backup"]
	if !ok {
		klog.Fatal("StatefulSet does not have a backup annotation. Run prepare mode first.")
	}
	// Get the backup StatefulSet
	_, err = clientset.AppsV1().StatefulSets(namespace).Get(ctx, backupName, metav1.GetOptions{})
	if err != nil {
		klog.Fatalf("Error getting backup StatefulSet: %s", err.Error())
	}
	// Get the pods
	pods, err := clientset.CoreV1().Pods(namespace).List(ctx, metav1.ListOptions{
		LabelSelector: labels.SelectorFromSet(sts.Spec.Selector.MatchLabels).String(),
	})
	if err != nil {
		klog.Fatalf("Error listing pods: %s", err.Error())
	}
	// Sort pods by ordinal index
	sort.Slice(pods.Items, func(i, j int) bool {
		ordinalI := getOrdinalFromPodName(pods.Items[i].Name, statefulSetName)
		ordinalJ := getOrdinalFromPodName(pods.Items[j].Name, statefulSetName)
		return ordinalI < ordinalJ
	})
	// Confirm migration
	if !skipConfirmation {
		klog.Infof("Will migrate %d pods of StatefulSet %s in namespace %s", len(pods.Items), statefulSetName, namespace)
		klog.Info("This will cause temporary disruption to the service.")
		klog.Info("Press Ctrl+C to cancel or Enter to continue...")
		fmt.Scanln()
	}
	// Update StatefulSet configuration
	newSts := sts.DeepCopy()
	// Update podManagementPolicy
	newSts.Spec.PodManagementPolicy = appsv1.OrderedReadyPodManagement
	// Update updateStrategy
	newSts.Spec.UpdateStrategy = appsv1.StatefulSetUpdateStrategy{
		Type: appsv1.RollingUpdateStatefulSetStrategyType,
		RollingUpdate: &appsv1.RollingUpdateStatefulSetStrategy{
			Partition: new(int32),
		},
	}
	// Remove custom hostname and subdomain
	newSts.Spec.Template.Spec.Hostname = ""
	newSts.Spec.Template.Spec.Subdomain = ""
	// Set termination grace period
	gracePeriod := int64(gracePeriodSec)
	newSts.Spec.Template.Spec.TerminationGracePeriodSeconds = &gracePeriod
	// Add migration in progress annotation
	if newSts.Annotations == nil {
		newSts.Annotations = make(map[string]string)
	}
	newSts.Annotations["statefulset-migration.kubernetes.io/in-progress"] = "true"
	// Update the StatefulSet
	_, err = clientset.AppsV1().StatefulSets(namespace).Update(ctx, newSts, metav1.UpdateOptions{})
	if err != nil {
		klog.Fatalf("Error updating StatefulSet: %s", err.Error())
	}
	klog.Infof("Updated StatefulSet %s configuration", statefulSetName)
	// Scale down the StatefulSet to 0 replicas
	klog.Info("Scaling down StatefulSet to 0 replicas")
	scale, err := clientset.AppsV1().StatefulSets(namespace).GetScale(ctx, statefulSetName, metav1.GetOptions{})
	if err != nil {
		klog.Fatalf("Error getting StatefulSet scale: %s", err.Error())
	}
	originalReplicas := scale.Spec.Replicas
	scale.Spec.Replicas = 0
	_, err = clientset.AppsV1().StatefulSets(namespace).UpdateScale(ctx, statefulSetName, scale, metav1.UpdateOptions{})
	if err != nil {
		klog.Fatalf("Error scaling down StatefulSet: %s", err.Error())
	}
	// Wait for all pods to terminate
	klog.Info("Waiting for all pods to terminate...")
	for {
		pods, err := clientset.CoreV1().Pods(namespace).List(ctx, metav1.ListOptions{
			LabelSelector: labels.SelectorFromSet(sts.Spec.Selector.MatchLabels).String(),
		})
		if err != nil {
			klog.Fatalf("Error listing pods: %s", err.Error())
		}
		if len(pods.Items) == 0 {
			break
		}
		klog.Infof("%d pods still running, waiting...", len(pods.Items))
		time.Sleep(5 * time.Second)
	}
	// Update the headless service
	svc, err := clientset.CoreV1().Services(namespace).Get(ctx, sts.Spec.ServiceName, metav1.GetOptions{})
	if err != nil {
		klog.Fatalf("Error getting headless service: %s", err.Error())
	}
	newSvc := svc.DeepCopy()
	newSvc.Spec.PublishNotReadyAddresses = false
	// Add migration annotation
	if newSvc.Annotations == nil {
		newSvc.Annotations = make(map[string]string)
	}
	newSvc.Annotations["service.alpha.kubernetes.io/tolerate-unready-endpoints"] = "true"
	_, err = clientset.CoreV1().Services(namespace).Update(ctx, newSvc, metav1.UpdateOptions{})
	if err != nil {
		klog.Fatalf("Error updating headless service: %s", err.Error())
	}
	klog.Infof("Updated headless service %s", svc.Name)
	// Scale up the StatefulSet to the original number of replicas
	klog.Infof("Scaling up StatefulSet to %d replicas", originalReplicas)
	scale.Spec.Replicas = originalReplicas
	_, err = clientset.AppsV1().StatefulSets(namespace).UpdateScale(ctx, statefulSetName, scale, metav1.UpdateOptions{})
	if err != nil {
		klog.Fatalf("Error scaling up StatefulSet: %s", err.Error())
	}
	// Wait for all pods to be ready
	klog.Info("Waiting for all pods to be ready...")
	for {
		sts, err := clientset.AppsV1().StatefulSets(namespace).Get(ctx, statefulSetName, metav1.GetOptions{})
		if err != nil {
			klog.Fatalf("Error getting StatefulSet: %s", err.Error())
		}
		if sts.Status.ReadyReplicas == sts.Status.Replicas {
			break
		}
		klog.Infof("%d/%d pods ready, waiting...", sts.Status.ReadyReplicas, sts.Status.Replicas)
		time.Sleep(5 * time.Second)
	}
	// Remove migration in progress annotation
	sts, err = clientset.AppsV1().StatefulSets(namespace).Get(ctx, statefulSetName, metav1.GetOptions{})
	if err != nil {
		klog.Fatalf("Error getting StatefulSet: %s", err.Error())
	}
	delete(sts.Annotations, "statefulset-migration.kubernetes.io/in-progress")
	sts.Annotations["statefulset-migration.kubernetes.io/completed"] = time.Now().Format(time.RFC3339)
	_, err = clientset.AppsV1().StatefulSets(namespace).Update(ctx, sts, metav1.UpdateOptions{})
	if err != nil {
		klog.Fatalf("Error updating StatefulSet: %s", err.Error())
	}
	klog.Info("StatefulSet migration completed successfully.")
}
func rollbackStatefulSetMigration(ctx context.Context, clientset *kubernetes.Clientset) {
	klog.Infof("Rolling back StatefulSet %s in namespace %s", statefulSetName, namespace)
	// Get the StatefulSet
	sts, err := clientset.AppsV1().StatefulSets(namespace).Get(ctx, statefulSetName, metav1.GetOptions{})
	if err != nil {
		klog.Fatalf("Error getting StatefulSet: %s", err.Error())
	}
	// Check if the StatefulSet has a backup annotation
	backupName, ok := sts.Annotations["statefulset-migration.kubernetes.io/backup"]
	if !ok {
		klog.Fatal("StatefulSet does not have a backup annotation. Cannot rollback.")
	}
	// Get the backup StatefulSet
	backupSts, err := clientset.AppsV1().StatefulSets(namespace).Get(ctx, backupName, metav1.GetOptions{})
	if err != nil {
		klog.Fatalf("Error getting backup StatefulSet: %s", err.Error())
	}
	// Confirm rollback
	if !skipConfirmation {
		klog.Infof("Will roll back StatefulSet %s in namespace %s to backup %s", statefulSetName, namespace, backupName)
		klog.Info("This will cause temporary disruption to the service.")
		klog.Info("Press Ctrl+C to cancel or Enter to continue...")
		fmt.Scanln()
	}
	// Scale down the StatefulSet to 0 replicas
	klog.Info("Scaling down StatefulSet to 0 replicas")
	scale, err := clientset.AppsV1().StatefulSets(namespace).GetScale(ctx, statefulSetName, metav1.GetOptions{})
	if err != nil {
		klog.Fatalf("Error getting StatefulSet scale: %s", err.Error())
	}
	originalReplicas := scale.Spec.Replicas
	scale.Spec.Replicas = 0
	_, err = clientset.AppsV1().StatefulSets(namespace).UpdateScale(ctx, statefulSetName, scale, metav1.UpdateOptions{})
	if err != nil {
		klog.Fatalf("Error scaling down StatefulSet: %s", err.Error())
	}
	// Wait for all pods to terminate
	klog.Info("Waiting for all pods to terminate...")
	for {
		pods, err := clientset.CoreV1().Pods(namespace).List(ctx, metav1.ListOptions{
			LabelSelector: labels.SelectorFromSet(sts.Spec.Selector.MatchLabels).String(),
		})
		if err != nil {
			klog.Fatalf("Error listing pods: %s", err.Error())
		}
		if len(pods.Items) == 0 {
			break
		}
		klog.Infof("%d pods still running, waiting...", len(pods.Items))
		time.Sleep(5 * time.Second)
	}
	// Update the StatefulSet with the backup configuration
	newSts := sts.DeepCopy()
	newSts.Spec = backupSts.Spec
	// Add rollback annotation
	if newSts.Annotations == nil {
		newSts.Annotations = make(map[string]string)
	}
	newSts.Annotations["statefulset-migration.kubernetes.io/rollback"] = time.Now().Format(time.RFC3339)
	delete(newSts.Annotations, "statefulset-migration.kubernetes.io/in-progress")
	delete(newSts.Annotations, "statefulset-migration.kubernetes.io/completed")
	_, err = clientset.AppsV1().StatefulSets(namespace).Update(ctx, newSts, metav1.UpdateOptions{})
	if err != nil {
		klog.Fatalf("Error updating StatefulSet: %s", err.Error())
	}
	klog.Infof("Updated StatefulSet %s configuration to backup", statefulSetName)
	// Restore the headless service
	svc, err := clientset.CoreV1().Services(namespace).Get(ctx, sts.Spec.ServiceName, metav1.GetOptions{})
	if err != nil {
		klog.Fatalf("Error getting headless service: %s", err.Error())
	}
	newSvc := svc.DeepCopy()
	newSvc.Spec.PublishNotReadyAddresses = true
	// Remove migration annotation
	if newSvc.Annotations != nil {
		delete(newSvc.Annotations, "service.alpha.kubernetes.io/tolerate-unready-endpoints")
	}
	_, err = clientset.CoreV1().Services(namespace).Update(ctx, newSvc, metav1.UpdateOptions{})
	if err != nil {
		klog.Fatalf("Error updating headless service: %s", err.Error())
	}
	klog.Infof("Updated headless service %s", svc.Name)
	// Scale up the StatefulSet to the original number of replicas
	klog.Infof("Scaling up StatefulSet to %d replicas", originalReplicas)
	scale.Spec.Replicas = originalReplicas
	_, err = clientset.AppsV1().StatefulSets(namespace).UpdateScale(ctx, statefulSetName, scale, metav1.UpdateOptions{})
	if err != nil {
		klog.Fatalf("Error scaling up StatefulSet: %s", err.Error())
	}
	// Wait for all pods to be ready
	klog.Info("Waiting for all pods to be ready...")
	for {
		sts, err := clientset.AppsV1().StatefulSets(namespace).Get(ctx, statefulSetName, metav1.GetOptions{})
		if err != nil {
			klog.Fatalf("Error getting StatefulSet: %s", err.Error())
		}
		if sts.Status.ReadyReplicas == sts.Status.Replicas {
			break
		}
		klog.Infof("%d/%d pods ready, waiting...", sts.Status.ReadyReplicas, sts.Status.Replicas)
		time.Sleep(5 * time.Second)
	}
	klog.Info("StatefulSet rollback completed successfully.")
}
// getOrdinalFromPodName extracts the ordinal index from a pod name
func getOrdinalFromPodName(podName, statefulSetName string) int {
	if !strings.HasPrefix(podName, statefulSetName+"-") {
		return -1
	}
	ordinalStr := strings.TrimPrefix(podName, statefulSetName+"-")
	ordinal, err := strconv.Atoi(ordinalStr)
	if err != nil {
		return -1
	}
	return ordinal
}

• Long-term: Implemented a comprehensive StatefulSet management strategy:

Created a StatefulSet upgrade validation tool

Implemented automated pre-upgrade testing

Added monitoring for StatefulSet identity issues

Documented best practices for StatefulSet configuration

Created a StatefulSet migration playbook

Lessons Learned:

StatefulSet pod identity management is critical for distributed systems and requires careful handling during Kubernetes upgrades.

How to Avoid:

Test StatefulSet configurations in a staging environment before upgrading production.
Follow Kubernetes version upgrade notes carefully for StatefulSet changes.
Use the default StatefulSet naming convention without custom hostnames.
Implement proper readiness probes and lifecycle hooks.
Monitor DNS resolution between StatefulSet pods after upgrades.

Answer 5

output:

Container Orchestration Kubernetes 1.23 to 1.25 upgrade, StatefulSets, Persistent Volumes

Summary:

No summary provided

What Happened:

A company was upgrading their Kubernetes cluster from version 1.23 to 1.25. During the upgrade process, several StatefulSet-managed applications experienced unexpected pod restarts. After the upgrade, users reported data inconsistencies and some services failed to start properly. Investigation revealed that some StatefulSet pods had been assigned incorrect persistent volumes, causing data corruption and service disruption.

Diagnosis Steps:

Analyzed pod events and logs before and after the upgrade.
Examined StatefulSet configurations and PersistentVolumeClaim bindings.
Reviewed the upgrade process and component versions.
Tested pod identity preservation in a staging environment.
Consulted Kubernetes release notes for known issues with StatefulSets.

Root Cause:

The investigation revealed multiple issues: 1. The upgrade process did not properly preserve StatefulSet pod identities 2. A custom admission controller was modifying pod names during recreation 3. The storage class used for PVCs had reclaim policy set to "Delete" instead of "Retain" 4. Some PVCs were using volumeMode: Filesystem with block storage, causing mounting issues 5. The StatefulSet controller had a race condition when recreating pods during the upgrade

Fix/Workaround:

• Short-term: Implemented a controlled StatefulSet migration process:


// statefulset_migration.go
package main
import (
	"context"
	"flag"
	"fmt"
	"os"
	"time"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/client-go/kubernetes"
	"k8s.io/client-go/tools/clientcmd"
)
func main() {
	kubeconfig := flag.String("kubeconfig", "", "Path to kubeconfig file")
	namespace := flag.String("namespace", "", "Namespace of the StatefulSet")
	name := flag.String("name", "", "Name of the StatefulSet")
	action := flag.String("action", "validate", "Action to perform: validate, prepare, migrate, rollback")
	flag.Parse()
	if *namespace == "" || *name == "" {
		fmt.Println("Error: namespace and name are required")
		os.Exit(1)
	}
	// Load kubeconfig
	config, err := clientcmd.BuildConfigFromFlags("", *kubeconfig)
	if err != nil {
		fmt.Printf("Error building kubeconfig: %v\n", err)
		os.Exit(1)
	}
	// Create clientset
	clientset, err := kubernetes.NewForConfig(config)
	if err != nil {
		fmt.Printf("Error creating Kubernetes client: %v\n", err)
		os.Exit(1)
	}
	ctx := context.Background()
	// Perform the requested action
	switch *action {
	case "validate":
		validateStatefulSet(ctx, clientset, *namespace, *name)
	case "prepare":
		prepareStatefulSetMigration(ctx, clientset, *namespace, *name)
	case "migrate":
		migrateStatefulSet(ctx, clientset, *namespace, *name)
	case "rollback":
		rollbackStatefulSetMigration(ctx, clientset, *namespace, *name)
	default:
		fmt.Printf("Unknown action: %s\n", *action)
		os.Exit(1)
	}
}
func validateStatefulSet(ctx context.Context, clientset *kubernetes.Clientset, namespace, name string) {
	fmt.Printf("Validating StatefulSet %s in namespace %s...\n", name, namespace)
	// Get the StatefulSet
	sts, err := clientset.AppsV1().StatefulSets(namespace).Get(ctx, name, metav1.GetOptions{})
	if err != nil {
		fmt.Printf("Error getting StatefulSet: %v\n", err)
		os.Exit(1)
	}
	// Check if StatefulSet is stable
	if sts.Status.Replicas != sts.Status.ReadyReplicas {
		fmt.Printf("Warning: StatefulSet is not stable. Replicas: %d, ReadyReplicas: %d\n", 
			sts.Status.Replicas, sts.Status.ReadyReplicas)
	} else {
		fmt.Printf("StatefulSet is stable with %d replicas\n", sts.Status.Replicas)
	}
	// Get PVCs associated with the StatefulSet
	pvcs, err := clientset.CoreV1().PersistentVolumeClaims(namespace).List(ctx, metav1.ListOptions{
		LabelSelector: fmt.Sprintf("app=%s", name),
	})
	if err != nil {
		fmt.Printf("Error listing PVCs: %v\n", err)
		os.Exit(1)
	}
	fmt.Printf("Found %d PVCs associated with the StatefulSet\n", len(pvcs.Items))
	// Check PVC bindings
	for _, pvc := range pvcs.Items {
		if pvc.Status.Phase != "Bound" {
			fmt.Printf("Warning: PVC %s is not bound (status: %s)\n", pvc.Name, pvc.Status.Phase)
		} else {
			fmt.Printf("PVC %s is bound to PV %s\n", pvc.Name, pvc.Spec.VolumeName)
		}
		// Check volumeMode
		if pvc.Spec.VolumeMode != nil && *pvc.Spec.VolumeMode == "Block" {
			fmt.Printf("Warning: PVC %s uses Block volumeMode which requires special handling\n", pvc.Name)
		}
	}
	// Get pods associated with the StatefulSet
	pods, err := clientset.CoreV1().Pods(namespace).List(ctx, metav1.ListOptions{
		LabelSelector: fmt.Sprintf("app=%s", name),
	})
	if err != nil {
		fmt.Printf("Error listing pods: %v\n", err)
		os.Exit(1)
	}
	fmt.Printf("Found %d pods associated with the StatefulSet\n", len(pods.Items))
	// Check pod status and volume mounts
	for _, pod := range pods.Items {
		fmt.Printf("Pod %s status: %s\n", pod.Name, pod.Status.Phase)
		// Check if pod name follows StatefulSet naming convention
		if !isValidStatefulSetPodName(pod.Name, name) {
			fmt.Printf("Warning: Pod %s does not follow StatefulSet naming convention\n", pod.Name)
		}
		// Check volume mounts
		for _, volume := range pod.Spec.Volumes {
			if volume.PersistentVolumeClaim != nil {
				fmt.Printf("Pod %s uses PVC %s\n", pod.Name, volume.PersistentVolumeClaim.ClaimName)
				// Verify PVC name matches pod ordinal
				expectedPVCName := fmt.Sprintf("data-%s-%s", name, getPodOrdinal(pod.Name))
				if volume.PersistentVolumeClaim.ClaimName != expectedPVCName {
					fmt.Printf("Warning: Pod %s uses PVC %s but expected %s\n", 
						pod.Name, volume.PersistentVolumeClaim.ClaimName, expectedPVCName)
				}
			}
		}
	}
	fmt.Println("Validation complete")
}
func prepareStatefulSetMigration(ctx context.Context, clientset *kubernetes.Clientset, namespace, name string) {
	fmt.Printf("Preparing StatefulSet %s in namespace %s for migration...\n", name, namespace)
	// Get the StatefulSet
	sts, err := clientset.AppsV1().StatefulSets(namespace).Get(ctx, name, metav1.GetOptions{})
	if err != nil {
		fmt.Printf("Error getting StatefulSet: %v\n", err)
		os.Exit(1)
	}
	// Create a backup of the StatefulSet
	backupName := fmt.Sprintf("%s-backup-%d", name, time.Now().Unix())
	sts.ObjectMeta.Name = backupName
	sts.ObjectMeta.ResourceVersion = ""
	sts.ObjectMeta.UID = ""
	sts.ObjectMeta.CreationTimestamp = metav1.Time{}
	sts.Status = metav1.Status{}
	// Add backup annotation
	if sts.ObjectMeta.Annotations == nil {
		sts.ObjectMeta.Annotations = make(map[string]string)
	}
	sts.ObjectMeta.Annotations["migration-backup-of"] = name
	sts.ObjectMeta.Annotations["migration-timestamp"] = fmt.Sprintf("%d", time.Now().Unix())
	// Set replicas to 0 to avoid creating pods
	replicas := int32(0)
	sts.Spec.Replicas = &replicas
	// Create the backup StatefulSet
	_, err = clientset.AppsV1().StatefulSets(namespace).Create(ctx, sts, metav1.CreateOptions{})
	if err != nil {
		fmt.Printf("Error creating backup StatefulSet: %v\n", err)
		os.Exit(1)
	}
	fmt.Printf("Created backup StatefulSet %s\n", backupName)
	// Get PVCs associated with the StatefulSet
	pvcs, err := clientset.CoreV1().PersistentVolumeClaims(namespace).List(ctx, metav1.ListOptions{
		LabelSelector: fmt.Sprintf("app=%s", name),
	})
	if err != nil {
		fmt.Printf("Error listing PVCs: %v\n", err)
		os.Exit(1)
	}
	// Create snapshots or backups of PVCs if supported
	fmt.Printf("Found %d PVCs to prepare for migration\n", len(pvcs.Items))
	for _, pvc := range pvcs.Items {
		// Add migration annotation
		pvcCopy := pvc.DeepCopy()
		if pvcCopy.Annotations == nil {
			pvcCopy.Annotations = make(map[string]string)
		}
		pvcCopy.Annotations["migration-prepared"] = "true"
		pvcCopy.Annotations["migration-timestamp"] = fmt.Sprintf("%d", time.Now().Unix())
		_, err = clientset.CoreV1().PersistentVolumeClaims(namespace).Update(ctx, pvcCopy, metav1.UpdateOptions{})
		if err != nil {
			fmt.Printf("Warning: Failed to update PVC %s annotations: %v\n", pvc.Name, err)
		} else {
			fmt.Printf("Updated PVC %s annotations for migration\n", pvc.Name)
		}
	}
	fmt.Println("Preparation complete. Ready for migration.")
}
func migrateStatefulSet(ctx context.Context, clientset *kubernetes.Clientset, namespace, name string) {
	fmt.Printf("Migrating StatefulSet %s in namespace %s...\n", name, namespace)
	// Scale down the StatefulSet
	fmt.Printf("Scaling down StatefulSet %s to 0 replicas...\n", name)
	scale, err := clientset.AppsV1().StatefulSets(namespace).GetScale(ctx, name, metav1.GetOptions{})
	if err != nil {
		fmt.Printf("Error getting StatefulSet scale: %v\n", err)
		os.Exit(1)
	}
	// Save original replicas for later
	originalReplicas := scale.Spec.Replicas
	// Set replicas to 0
	scale.Spec.Replicas = 0
	_, err = clientset.AppsV1().StatefulSets(namespace).UpdateScale(ctx, name, scale, metav1.UpdateOptions{})
	if err != nil {
		fmt.Printf("Error scaling down StatefulSet: %v\n", err)
		os.Exit(1)
	}
	// Wait for pods to terminate
	fmt.Println("Waiting for pods to terminate...")
	for {
		pods, err := clientset.CoreV1().Pods(namespace).List(ctx, metav1.ListOptions{
			LabelSelector: fmt.Sprintf("app=%s", name),
		})
		if err != nil {
			fmt.Printf("Error listing pods: %v\n", err)
			os.Exit(1)
		}
		if len(pods.Items) == 0 {
			fmt.Println("All pods terminated")
			break
		}
		fmt.Printf("%d pods still running, waiting...\n", len(pods.Items))
		time.Sleep(5 * time.Second)
	}
	// Get the current StatefulSet
	sts, err := clientset.AppsV1().StatefulSets(namespace).Get(ctx, name, metav1.GetOptions{})
	if err != nil {
		fmt.Printf("Error getting StatefulSet: %v\n", err)
		os.Exit(1)
	}
	// Create a new StatefulSet with migration fixes
	newSTS := sts.DeepCopy()
	newSTS.ObjectMeta.ResourceVersion = ""
	newSTS.ObjectMeta.UID = ""
	newSTS.ObjectMeta.CreationTimestamp = metav1.Time{}
	newSTS.Status = metav1.Status{}
	// Add migration annotation
	if newSTS.ObjectMeta.Annotations == nil {
		newSTS.ObjectMeta.Annotations = make(map[string]string)
	}
	newSTS.ObjectMeta.Annotations["migration-timestamp"] = fmt.Sprintf("%d", time.Now().Unix())
	// Set replicas to 0 initially
	replicas := int32(0)
	newSTS.Spec.Replicas = &replicas
	// Fix volumeClaimTemplates if needed
	for i := range newSTS.Spec.VolumeClaimTemplates {
		// Ensure volumeMode is set correctly
		if newSTS.Spec.VolumeClaimTemplates[i].Spec.VolumeMode == nil {
			filesystemMode := "Filesystem"
			newSTS.Spec.VolumeClaimTemplates[i].Spec.VolumeMode = &filesystemMode
		}
	}
	// Delete the old StatefulSet
	fmt.Printf("Deleting StatefulSet %s...\n", name)
	err = clientset.AppsV1().StatefulSets(namespace).Delete(ctx, name, metav1.DeleteOptions{})
	if err != nil {
		fmt.Printf("Error deleting StatefulSet: %v\n", err)
		os.Exit(1)
	}
	// Wait for StatefulSet to be deleted
	fmt.Println("Waiting for StatefulSet to be deleted...")
	for {
		_, err := clientset.AppsV1().StatefulSets(namespace).Get(ctx, name, metav1.GetOptions{})
		if err != nil {
			fmt.Println("StatefulSet deleted")
			break
		}
		time.Sleep(2 * time.Second)
	}
	// Create the new StatefulSet
	fmt.Printf("Creating new StatefulSet %s...\n", name)
	_, err = clientset.AppsV1().StatefulSets(namespace).Create(ctx, newSTS, metav1.CreateOptions{})
	if err != nil {
		fmt.Printf("Error creating new StatefulSet: %v\n", err)
		os.Exit(1)
	}
	// Scale up the StatefulSet to original replicas
	fmt.Printf("Scaling up StatefulSet %s to %d replicas...\n", name, originalReplicas)
	scale, err = clientset.AppsV1().StatefulSets(namespace).GetScale(ctx, name, metav1.GetOptions{})
	if err != nil {
		fmt.Printf("Error getting StatefulSet scale: %v\n", err)
		os.Exit(1)
	}
	scale.Spec.Replicas = originalReplicas
	_, err = clientset.AppsV1().StatefulSets(namespace).UpdateScale(ctx, name, scale, metav1.UpdateOptions{})
	if err != nil {
		fmt.Printf("Error scaling up StatefulSet: %v\n", err)
		os.Exit(1)
	}
	fmt.Println("Migration complete. Monitoring pod creation...")
	// Monitor pod creation
	for i := 0; i < int(originalReplicas); i++ {
		podName := fmt.Sprintf("%s-%d", name, i)
		fmt.Printf("Waiting for pod %s to be ready...\n", podName)
		for {
			pod, err := clientset.CoreV1().Pods(namespace).Get(ctx, podName, metav1.GetOptions{})
			if err != nil {
				fmt.Printf("Pod %s not found yet, waiting...\n", podName)
				time.Sleep(5 * time.Second)
				continue
			}
			if isPodReady(pod) {
				fmt.Printf("Pod %s is ready\n", podName)
				break
			}
			fmt.Printf("Pod %s is not ready yet (status: %s), waiting...\n", podName, pod.Status.Phase)
			time.Sleep(5 * time.Second)
		}
	}
	fmt.Println("All pods are ready. Migration successful.")
}
func rollbackStatefulSetMigration(ctx context.Context, clientset *kubernetes.Clientset, namespace, name string) {
	fmt.Printf("Rolling back StatefulSet %s migration in namespace %s...\n", name, namespace)
	// Find the backup StatefulSet
	statefulSets, err := clientset.AppsV1().StatefulSets(namespace).List(ctx, metav1.ListOptions{})
	if err != nil {
		fmt.Printf("Error listing StatefulSets: %v\n", err)
		os.Exit(1)
	}
	var backupSTS *metav1.ObjectMeta
	for _, sts := range statefulSets.Items {
		if sts.Annotations != nil && sts.Annotations["migration-backup-of"] == name {
			stsCopy := sts.ObjectMeta
			backupSTS = &stsCopy
			break
		}
	}
	if backupSTS == nil {
		fmt.Printf("Error: No backup StatefulSet found for %s\n", name)
		os.Exit(1)
	}
	fmt.Printf("Found backup StatefulSet %s\n", backupSTS.Name)
	// Scale down the current StatefulSet
	fmt.Printf("Scaling down StatefulSet %s to 0 replicas...\n", name)
	scale, err := clientset.AppsV1().StatefulSets(namespace).GetScale(ctx, name, metav1.GetOptions{})
	if err != nil {
		fmt.Printf("Error getting StatefulSet scale: %v\n", err)
		os.Exit(1)
	}
	// Save original replicas for later
	originalReplicas := scale.Spec.Replicas
	// Set replicas to 0
	scale.Spec.Replicas = 0
	_, err = clientset.AppsV1().StatefulSets(namespace).UpdateScale(ctx, name, scale, metav1.UpdateOptions{})
	if err != nil {
		fmt.Printf("Error scaling down StatefulSet: %v\n", err)
		os.Exit(1)
	}
	// Wait for pods to terminate
	fmt.Println("Waiting for pods to terminate...")
	for {
		pods, err := clientset.CoreV1().Pods(namespace).List(ctx, metav1.ListOptions{
			LabelSelector: fmt.Sprintf("app=%s", name),
		})
		if err != nil {
			fmt.Printf("Error listing pods: %v\n", err)
			os.Exit(1)
		}
		if len(pods.Items) == 0 {
			fmt.Println("All pods terminated")
			break
		}
		fmt.Printf("%d pods still running, waiting...\n", len(pods.Items))
		time.Sleep(5 * time.Second)
	}
	// Get the backup StatefulSet
	backupStatefulSet, err := clientset.AppsV1().StatefulSets(namespace).Get(ctx, backupSTS.Name, metav1.GetOptions{})
	if err != nil {
		fmt.Printf("Error getting backup StatefulSet: %v\n", err)
		os.Exit(1)
	}
	// Prepare the backup for restoration
	backupStatefulSet.ObjectMeta.Name = name
	backupStatefulSet.ObjectMeta.ResourceVersion = ""
	backupStatefulSet.ObjectMeta.UID = ""
	backupStatefulSet.ObjectMeta.CreationTimestamp = metav1.Time{}
	backupStatefulSet.Status = metav1.Status{}
	// Remove backup annotations
	delete(backupStatefulSet.ObjectMeta.Annotations, "migration-backup-of")
	delete(backupStatefulSet.ObjectMeta.Annotations, "migration-timestamp")
	// Set replicas to original count
	backupStatefulSet.Spec.Replicas = &originalReplicas
	// Delete the current StatefulSet
	fmt.Printf("Deleting StatefulSet %s...\n", name)
	err = clientset.AppsV1().StatefulSets(namespace).Delete(ctx, name, metav1.DeleteOptions{})
	if err != nil {
		fmt.Printf("Error deleting StatefulSet: %v\n", err)
		os.Exit(1)
	}
	// Wait for StatefulSet to be deleted
	fmt.Println("Waiting for StatefulSet to be deleted...")
	for {
		_, err := clientset.AppsV1().StatefulSets(namespace).Get(ctx, name, metav1.GetOptions{})
		if err != nil {
			fmt.Println("StatefulSet deleted")
			break
		}
		time.Sleep(2 * time.Second)
	}
	// Create the restored StatefulSet
	fmt.Printf("Creating restored StatefulSet %s...\n", name)
	_, err = clientset.AppsV1().StatefulSets(namespace).Create(ctx, backupStatefulSet, metav1.CreateOptions{})
	if err != nil {
		fmt.Printf("Error creating restored StatefulSet: %v\n", err)
		os.Exit(1)
	}
	fmt.Println("Rollback complete. Monitoring pod creation...")
	// Monitor pod creation
	for i := 0; i < int(originalReplicas); i++ {
		podName := fmt.Sprintf("%s-%d", name, i)
		fmt.Printf("Waiting for pod %s to be ready...\n", podName)
		for {
			pod, err := clientset.CoreV1().Pods(namespace).Get(ctx, podName, metav1.GetOptions{})
			if err != nil {
				fmt.Printf("Pod %s not found yet, waiting...\n", podName)
				time.Sleep(5 * time.Second)
				continue
			}
			if isPodReady(pod) {
				fmt.Printf("Pod %s is ready\n", podName)
				break
			}
			fmt.Printf("Pod %s is not ready yet (status: %s), waiting...\n", podName, pod.Status.Phase)
			time.Sleep(5 * time.Second)
		}
	}
	fmt.Println("All pods are ready. Rollback successful.")
}
// Helper functions
func isValidStatefulSetPodName(podName, stsName string) bool {
	return len(podName) > len(stsName) && 
		podName[:len(stsName)] == stsName && 
		podName[len(stsName)] == '-'
}
func getPodOrdinal(podName string) string {
	for i := len(podName) - 1; i >= 0; i-- {
		if podName[i] == '-' {
			return podName[i+1:]
		}
	}
	return ""
}
func isPodReady(pod *metav1.Object) bool {
	// Type assertion to access status
	if p, ok := pod.(*metav1.Pod); ok {
		if p.Status.Phase != "Running" {
			return false
		}
		for _, condition := range p.Status.Conditions {
			if condition.Type == "Ready" && condition.Status == "True" {
				return true
			}
		}
	}
	return false
}

• Fixed StatefulSet configuration to ensure proper pod identity preservation:


# Before: Problematic StatefulSet configuration
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: database
  namespace: production
spec:
  serviceName: database
  replicas: 3
  selector:
    matchLabels:
      app: database
  template:
    metadata:
      labels:
        app: database
    spec:
      containers:
      - name: database
        image: postgres:14
        ports:
        - containerPort: 5432
          name: postgres
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: standard
      resources:
        requests:
          storage: 10Gi
# After: Improved StatefulSet configuration
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: database
  namespace: production
  annotations:
    statefulset.kubernetes.io/pod-name-generation: "consistent"
spec:
  serviceName: database
  replicas: 3
  podManagementPolicy: OrderedReady
  updateStrategy:
    type: OnDelete  # Prevents automatic updates during cluster upgrades
  selector:
    matchLabels:
      app: database
  template:
    metadata:
      labels:
        app: database
    spec:
      terminationGracePeriodSeconds: 60
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - database
              topologyKey: kubernetes.io/hostname
      containers:
      - name: database
        image: postgres:14
        ports:
        - containerPort: 5432
          name: postgres
        env:
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: database-credentials
              key: password
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
        readinessProbe:
          exec:
            command:
            - pg_isready
            - -U
            - postgres
          initialDelaySeconds: 5
          periodSeconds: 10
        lifecycle:
          preStop:
            exec:
              command:
              - sh
              - -c
              - "pg_ctl -D /var/lib/postgresql/data stop -m fast"
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: managed-premium
      volumeMode: Filesystem
      resources:
        requests:
          storage: 10Gi

• Implemented a StatefulSet upgrade procedure:


#!/bin/bash
# statefulset_upgrade.sh
# Usage: ./statefulset_upgrade.sh <namespace> <statefulset-name>
set -e
NAMESPACE=$1
STATEFULSET=$2
if [ -z "$NAMESPACE" ] || [ -z "$STATEFULSET" ]; then
  echo "Usage: $0 <namespace> <statefulset-name>"
  exit 1
fi
echo "Starting controlled upgrade of StatefulSet $STATEFULSET in namespace $NAMESPACE"
# Validate StatefulSet
echo "Validating StatefulSet..."
kubectl get statefulset -n $NAMESPACE $STATEFULSET -o json > /tmp/statefulset-original.json
REPLICAS=$(kubectl get statefulset -n $NAMESPACE $STATEFULSET -o jsonpath='{.spec.replicas}')
echo "StatefulSet has $REPLICAS replicas"
# Check for PVCs
echo "Checking PersistentVolumeClaims..."
PVC_COUNT=$(kubectl get pvc -n $NAMESPACE -l app=$STATEFULSET -o name | wc -l)
echo "Found $PVC_COUNT PVCs"
# Backup StatefulSet definition
echo "Creating backup of StatefulSet definition..."
TIMESTAMP=$(date +%Y%m%d%H%M%S)
kubectl get statefulset -n $NAMESPACE $STATEFULSET -o yaml > statefulset-$STATEFULSET-backup-$TIMESTAMP.yaml
echo "Backup saved to statefulset-$STATEFULSET-backup-$TIMESTAMP.yaml"
# Scale down StatefulSet one pod at a time
echo "Scaling down StatefulSet one pod at a time..."
for ((i=$REPLICAS-1; i>=0; i--)); do
  POD_NAME="$STATEFULSET-$i"
  echo "Processing pod $POD_NAME"
  # Check if pod exists
  if kubectl get pod -n $NAMESPACE $POD_NAME &>/dev/null; then
    echo "Taking pod $POD_NAME out of service..."
    # Check if pod has a preStop hook
    HAS_PRESTOP=$(kubectl get pod -n $NAMESPACE $POD_NAME -o json | jq '.spec.containers[0].lifecycle.preStop')
    if [ "$HAS_PRESTOP" != "null" ]; then
      echo "Pod has preStop hook, allowing time for graceful shutdown..."
      GRACE_PERIOD=$(kubectl get pod -n $NAMESPACE $POD_NAME -o jsonpath='{.spec.terminationGracePeriodSeconds}')
      if [ -z "$GRACE_PERIOD" ]; then
        GRACE_PERIOD=30
      fi
      echo "Grace period is $GRACE_PERIOD seconds"
    fi
    # Delete pod
    kubectl delete pod -n $NAMESPACE $POD_NAME
    # Wait for pod to be deleted
    echo "Waiting for pod $POD_NAME to be deleted..."
    kubectl wait --for=delete pod/$POD_NAME -n $NAMESPACE --timeout=300s
    echo "Pod $POD_NAME deleted"
  else
    echo "Pod $POD_NAME does not exist, skipping"
  fi
done
# Delete the StatefulSet but keep the pods
echo "Deleting StatefulSet $STATEFULSET while keeping pods..."
kubectl delete statefulset -n $NAMESPACE $STATEFULSET --cascade=false
# Create the new StatefulSet
echo "Creating new StatefulSet from backup with modifications..."
cat statefulset-$STATEFULSET-backup-$TIMESTAMP.yaml | \
  sed 's/updateStrategy:/updateStrategy:\n  type: OnDelete/' | \
  sed '/creationTimestamp:/d' | \
  sed '/resourceVersion:/d' | \
  sed '/uid:/d' | \
  sed '/status:/,$ d' | \
  kubectl apply -f -
# Wait for StatefulSet to be ready
echo "Waiting for StatefulSet to be ready..."
kubectl rollout status statefulset/$STATEFULSET -n $NAMESPACE
echo "StatefulSet upgrade completed successfully"

• Long-term: Implemented a comprehensive StatefulSet management strategy:

Created a StatefulSet operator for advanced lifecycle management

Implemented automated backup and restore procedures

Developed a StatefulSet migration framework for version upgrades

Established clear procedures for StatefulSet maintenance

Implemented monitoring for StatefulSet identity and data integrity

Lessons Learned:

StatefulSet pod identity preservation requires careful management during Kubernetes upgrades.

How to Avoid:

Use the "OnDelete" update strategy for critical StatefulSets.
Implement proper backup procedures before cluster upgrades.
Test StatefulSet behavior in a staging environment before production upgrades.
Use consistent storage classes with appropriate reclaim policies.
Implement StatefulSet-aware upgrade procedures.

Answer 6

output:

Container Orchestration Kubernetes cluster, Production environment, Microservices architecture

Summary:

No summary provided

What Happened:

During a planned maintenance window, the operations team initiated a rolling upgrade of Kubernetes nodes. Despite having Pod Disruption Budgets in place, a critical service became completely unavailable, causing a production outage. The incident occurred when multiple nodes were drained simultaneously, and the service pods could not be rescheduled fast enough due to resource constraints and misconfigured PDBs.

Diagnosis Steps:

Analyzed Kubernetes events during the outage timeframe.
Reviewed Pod Disruption Budget configurations for affected services.
Examined node drain logs and kubectl drain command parameters.
Checked pod scheduling and resource allocation during the incident.
Reviewed cluster autoscaling configuration and behavior.

Root Cause:

The investigation revealed multiple issues with the Pod Disruption Budget implementation: 1. The PDB was configured with maxUnavailable: 50%, allowing too many pods to be evicted simultaneously 2. The service had resource requests that were too high, limiting where pods could be scheduled 3. Node draining was performed without the --pod-eviction-timeout flag, causing rapid evictions 4. Cluster autoscaling was too slow to provision new nodes for rescheduled pods 5. The service lacked proper readiness probes, causing traffic to be sent to initializing pods

Fix/Workaround:

• Short-term: Implemented immediate fixes to prevent recurrence:


# Before: Problematic PDB configuration
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: critical-service-pdb
  namespace: production
spec:
  maxUnavailable: 50%
  selector:
    matchLabels:
      app: critical-service
# After: Improved PDB configuration
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: critical-service-pdb
  namespace: production
spec:
  minAvailable: 80%  # Ensure at least 80% of pods are available
  selector:
    matchLabels:
      app: critical-service

• Created a node drain script with proper safeguards:


#!/bin/bash
# safe_node_drain.sh - Safely drain nodes with proper timeouts and monitoring
set -e
NODE=$1
NAMESPACE=${2:-"production"}
EVICTION_TIMEOUT=${3:-"300s"}
DRAIN_TIMEOUT=${4:-"600"}
GRACE_PERIOD=${5:-"120"}
if [ -z "$NODE" ]; then
  echo "Usage: $0 <node-name> [namespace] [eviction-timeout] [drain-timeout] [grace-period]"
  exit 1
fi
echo "Checking PDBs in namespace $NAMESPACE..."
kubectl get pdb -n $NAMESPACE -o wide
echo "Checking pods on node $NODE..."
POD_COUNT=$(kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=$NODE | grep -v NAME | wc -l)
echo "Found $POD_COUNT pods running on $NODE"
echo "Cordoning node $NODE..."
kubectl cordon $NODE
echo "Waiting for grace period ($GRACE_PERIOD seconds) before starting evictions..."
sleep $GRACE_PERIOD
echo "Starting drain with eviction timeout $EVICTION_TIMEOUT..."
timeout $DRAIN_TIMEOUT kubectl drain $NODE \
  --pod-eviction-timeout=$EVICTION_TIMEOUT \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --force
if [ $? -eq 0 ]; then
  echo "Node $NODE drained successfully"
else
  echo "Node drain timed out or failed. Check node status."
  kubectl uncordon $NODE
  exit 1
fi
echo "Checking service health..."
for svc in $(kubectl get svc -n $NAMESPACE -o name); do
  echo "Checking $svc..."
  kubectl get $svc -n $NAMESPACE -o wide
done
echo "Node drain completed. Node is ready for maintenance."

• Improved the deployment configuration with proper resource settings and probes:


apiVersion: apps/v1
kind: Deployment
metadata:
  name: critical-service
  namespace: production
spec:
  replicas: 10
  selector:
    matchLabels:
      app: critical-service
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
  template:
    metadata:
      labels:
        app: critical-service
    spec:
      terminationGracePeriodSeconds: 60
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - critical-service
              topologyKey: "kubernetes.io/hostname"
      containers:
      - name: critical-service
        image: critical-service:v1.2.3
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: 250m     # Reduced from 1000m
            memory: 512Mi # Reduced from 2Gi
          limits:
            cpu: 1000m
            memory: 1Gi
        readinessProbe:   # Added proper readiness probe
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 2
          successThreshold: 1
          failureThreshold: 3
        livenessProbe:    # Added proper liveness probe
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 15
          timeoutSeconds: 5
          failureThreshold: 3
        startupProbe:     # Added startup probe
          httpGet:
            path: /health/startup
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 2
          failureThreshold: 12  # Allow 60s for startup

• Long-term: Implemented a comprehensive node maintenance strategy:

Created a node maintenance controller to automate safe node drains

Implemented PDB validation in CI/CD pipelines

Improved cluster autoscaling configuration for faster node provisioning

Developed a service impact prediction tool for maintenance operations

Established clear maintenance procedures with proper testing

Lessons Learned:

Pod Disruption Budgets require careful configuration and testing to be effective.

How to Avoid:

Configure PDBs with minAvailable rather than maxUnavailable for critical services.
Test PDB configurations in a staging environment before production.
Implement proper node drain procedures with adequate timeouts.
Ensure services have appropriate resource requests and limits.
Use pod anti-affinity to distribute pods across nodes.

Answer 7

output:

Container Orchestration Kubernetes cluster, Production environment, Mixed workload types

Summary:

No summary provided

What Happened:

A large production Kubernetes cluster began experiencing increased pod scheduling latency and performance issues. Some nodes were heavily utilized while others remained underutilized. Critical workloads were competing for resources on specific nodes, while other nodes had abundant capacity. The issue gradually worsened over several weeks as new workloads were deployed, eventually leading to pod evictions and scheduling failures during peak traffic periods.

Diagnosis Steps:

Analyzed node resource utilization across the cluster.
Examined pod distribution and resource requests/limits.
Reviewed node labels, taints, and pod affinity/anti-affinity rules.
Checked the scheduler configuration and default policies.
Investigated historical deployment patterns and workload growth.

Root Cause:

The investigation revealed multiple issues contributing to the resource imbalance: 1. Pod affinity rules were causing workloads to cluster on specific nodes 2. Node labels for specialized hardware were applied inconsistently 3. Some applications had overly restrictive node selectors 4. The default scheduler configuration wasn't optimized for the workload mix 5. Resource requests were significantly lower than actual usage for some workloads

Fix/Workaround:

• Implemented immediate fixes to rebalance the cluster

• Optimized pod affinity/anti-affinity rules

• Standardized node labeling and workload placement

• Adjusted resource requests to match actual usage patterns

• Implemented proper node autoscaling

Lessons Learned:

Kubernetes resource allocation requires ongoing monitoring and optimization as workloads evolve.

How to Avoid:

Implement regular resource utilization reviews and optimization.
Use pod topology spread constraints for better workload distribution.
Configure appropriate resource requests based on actual usage.
Standardize node labeling and workload placement strategies.
Monitor scheduling metrics to detect imbalances early.

Answer 8

output:

Container Orchestration Kubernetes, StatefulSets, Production database cluster

Summary:

No summary provided

What Happened:

During a scheduled update to a database cluster running in Kubernetes as a StatefulSet, the operations team initiated a rolling update to upgrade the database version. Shortly after the update began, the entire cluster became unavailable, causing a major production outage. The update process had terminated all pods simultaneously instead of following the expected ordered, one-at-a-time update pattern. Recovery required manual intervention and data restoration from backups.

Diagnosis Steps:

Examined the StatefulSet configuration and update history.
Reviewed Kubernetes events and pod termination sequences.
Analyzed the StatefulSet controller logs.
Checked PersistentVolume and PersistentVolumeClaim status.
Verified the database cluster's quorum and replication state.

Root Cause:

The investigation revealed multiple issues with the StatefulSet configuration: 1. The StatefulSet updateStrategy was incorrectly set to "OnDelete" instead of "RollingUpdate" 2. The team manually deleted all pods simultaneously to trigger the update 3. Pod disruption budgets were not configured to protect the cluster quorum 4. The database was not configured to handle multiple simultaneous node failures 5. The StatefulSet did not have proper readiness probes to verify cluster health

Fix/Workaround:

• Restored the database cluster from backups

• Corrected the StatefulSet updateStrategy to "RollingUpdate"

• Implemented proper pod disruption budgets

• Configured appropriate readiness and liveness probes

• Created a comprehensive update procedure for stateful applications

Lessons Learned:

StatefulSet update strategies require careful configuration and understanding of the application's resilience properties.

How to Avoid:

Always use "RollingUpdate" strategy for StatefulSets unless there's a specific reason not to.
Implement pod disruption budgets to protect application quorum.
Configure appropriate readiness probes that verify application health.
Test update procedures in a staging environment before production.
Create detailed runbooks for stateful application updates.

Answer 9

output:

Container Orchestration Kubernetes, Production cluster, Multi-zone deployment

Summary:

No summary provided

What Happened:

A company deployed a microservices application to a multi-zone Kubernetes cluster. After several weeks of operation, they noticed that certain nodes were consistently overloaded while others remained underutilized. During a traffic spike, the overloaded nodes experienced performance degradation, causing service disruptions. Investigation revealed that the workload distribution was severely imbalanced due to misconfigured node affinity and anti-affinity rules.

Diagnosis Steps:

Analyzed node resource utilization across the cluster.
Examined pod distribution and placement patterns.
Reviewed node affinity and anti-affinity configurations.
Checked node labels and taints/tolerations.
Assessed the impact of topology spread constraints.

Root Cause:

The investigation revealed multiple issues with workload placement: 1. Node affinity rules were too restrictive, forcing certain workloads onto specific nodes 2. Anti-affinity rules were missing for critical components, allowing them to co-locate 3. Some node labels were inconsistent across the cluster, affecting affinity rules 4. Topology spread constraints were not implemented for multi-zone distribution 5. Pod resource requests were inaccurate, leading to poor scheduling decisions

Fix/Workaround:

• Implemented immediate fixes to rebalance workloads

• Corrected node affinity and anti-affinity configurations

• Standardized node labels across the cluster

• Implemented topology spread constraints for zone distribution

• Adjusted pod resource requests based on actual usage

Lessons Learned:

Node affinity and anti-affinity configurations significantly impact workload distribution and cluster stability.

How to Avoid:

Implement proper topology spread constraints for multi-zone deployments.
Use node affinity selectively and avoid overly restrictive rules.
Ensure consistent node labeling across the cluster.
Regularly review and validate workload distribution patterns.
Implement pod disruption budgets to protect critical services during rebalancing.
```yaml
Example of improved pod specification with proper affinity rules
apiVersion: v1
kind: Pod
metadata:
name: web-app
labels:
app: web
component: frontend
spec:
Topology spread constraints for multi-zone distribution
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: web
Pod affinity for related services
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- cache
topologyKey: kubernetes.io/hostname
Pod anti-affinity to avoid co-location of same components
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: component
operator: In
values:
- frontend
topologyKey: kubernetes.io/hostname
Node affinity for hardware requirements
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values:
- c5.large
- c5.xlarge
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 50
preference:
matchExpressions:
- key: failure-domain.beta.kubernetes.io/zone
operator: In
values:
- us-east-1a
- us-east-1b
containers:
- name: web-app
image: web-app:v1.2.3
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
```

Answer 10

output:

Container Orchestration Kubernetes, Production cluster, Automated node management

Summary:

No summary provided

What Happened:

A large e-commerce company used an automated node maintenance system to perform regular updates on their Kubernetes cluster. The system was designed to cordon nodes, drain workloads, perform updates, and then uncordon nodes to return them to service. During a scheduled maintenance window, the system identified multiple nodes requiring kernel updates and initiated the cordoning process for all of them simultaneously. This resulted in insufficient capacity for the rescheduled workloads, causing pod evictions, service disruptions, and eventually a partial outage of the platform during a high-traffic period.

Diagnosis Steps:

Analyzed cluster events and pod scheduling logs.
Examined node cordoning timestamps and patterns.
Reviewed the automated maintenance system's configuration.
Checked cluster capacity and resource allocation.
Investigated pod priority and preemption settings.

Root Cause:

The investigation revealed multiple issues with the node maintenance automation: 1. The maintenance system had no limits on how many nodes could be cordoned simultaneously 2. There was no pre-check for available cluster capacity before cordoning 3. Pod disruption budgets were not properly configured for critical services 4. The system lacked awareness of application-level dependencies 5. There was no circuit breaker to halt the process if too many pods failed to reschedule

Fix/Workaround:

• Implemented immediate fixes to restore service

• Uncordoned sufficient nodes to restore capacity

• Implemented a phased maintenance approach with capacity checks

• Configured proper pod disruption budgets for all services

• Added circuit breaker logic to the maintenance system


# Improved Node Maintenance Controller Configuration
# File: node-maintenance-controller-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: node-maintenance-controller-config
  namespace: kube-system
data:
  config.yaml: |
    maintenance:
      # Maximum percentage of nodes that can be under maintenance simultaneously
      maxConcurrentNodePercentage: 10
      # Minimum available capacity required before cordoning additional nodes
      minimumAvailableCapacity:
        cpu: 20%
        memory: 20%
      # Wait time between node maintenance operations
      nodeProcessingInterval: 5m
      # Circuit breaker settings
      circuitBreaker:
        # Maximum number of pod scheduling failures before pausing
        maxPodSchedulingFailures: 5
        # Pause duration when circuit breaker is triggered
        pauseDuration: 30m
        # Automatically uncordon nodes if circuit breaker is triggered
        autoUncordonOnBreak: true
      # Node selection strategy
      nodeSelectionStrategy: LeastDisruptive
      # Pre-cordoning validation
      preCordoning:
        # Verify all PDBs are satisfied before cordoning
        checkPodDisruptionBudgets: true
        # Check for critical pods that might not reschedule
        checkCriticalPods: true
        # Verify node affinity/anti-affinity constraints can be satisfied
        checkAffinityConstraints: true
      # Post-cordoning validation
      postCordoning:
        # Maximum time to wait for pods to be evicted
        maxPodEvictionWaitTime: 5m
        # Monitor for scheduling failures
        monitorSchedulingFailures: true
      # Maintenance window settings
      maintenanceWindows:
        - name: "overnight"
          daysOfWeek: ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"]
          timeWindow:
            start: "01:00"
            end: "05:00"
          timeZone: "UTC"
        - name: "weekend"
          daysOfWeek: ["Saturday", "Sunday"]
          timeWindow:
            start: "10:00"
            end: "16:00"
          timeZone: "UTC"


// Improved Node Maintenance Controller Logic
// File: node_maintenance_controller.go
package controller
import (
	"context"
	"fmt"
	"time"
	corev1 "k8s.io/api/core/v1"
	"k8s.io/apimachinery/pkg/api/resource"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/client-go/kubernetes"
	"k8s.io/client-go/util/retry"
	"k8s.io/klog/v2"
)
// NodeMaintenanceController manages safe node maintenance operations
type NodeMaintenanceController struct {
	kubeClient        kubernetes.Interface
	config            *MaintenanceConfig
	maintenanceState  *MaintenanceState
	capacityCalculator *ClusterCapacityCalculator
	pdbChecker        *PodDisruptionBudgetChecker
}
// MaintenanceState tracks the current state of maintenance operations
type MaintenanceState struct {
	NodesInMaintenance     map[string]*NodeMaintenance
	FailedSchedulingEvents int
	CircuitBreakerTripped  bool
	CircuitBreakerTrippedAt time.Time
}
// NodeMaintenance tracks the state of a node in maintenance
type NodeMaintenance struct {
	NodeName      string
	CordoneAt     time.Time
	PodsEvicted   int
	PodsRemaining int
	MaintenanceCompleted bool
}
// ProcessNodeMaintenance evaluates and performs maintenance on eligible nodes
func (c *NodeMaintenanceController) ProcessNodeMaintenance(ctx context.Context) error {
	// Check if we're in a maintenance window
	if !c.isInMaintenanceWindow() {
		klog.V(4).Info("Not currently in a maintenance window")
		return nil
	}
	// Check if circuit breaker is tripped
	if c.maintenanceState.CircuitBreakerTripped {
		elapsed := time.Since(c.maintenanceState.CircuitBreakerTrippedAt)
		if elapsed < c.config.CircuitBreaker.PauseDuration {
			klog.Warningf("Circuit breaker tripped, pausing maintenance for %v more", 
				c.config.CircuitBreaker.PauseDuration - elapsed)
			// Auto-uncordon nodes if configured
			if c.config.CircuitBreaker.AutoUncordonOnBreak {
				if err := c.uncordonNodesInMaintenance(ctx); err != nil {
					klog.Errorf("Failed to auto-uncordon nodes: %v", err)
				}
			}
			return nil
		}
		// Reset circuit breaker
		c.maintenanceState.CircuitBreakerTripped = false
		c.maintenanceState.FailedSchedulingEvents = 0
	}
	// Get nodes that need maintenance
	nodesToMaintain, err := c.getNodesNeedingMaintenance(ctx)
	if err != nil {
		return fmt.Errorf("failed to get nodes needing maintenance: %w", err)
	}
	// Check how many nodes are already in maintenance
	currentMaintenanceCount := len(c.maintenanceState.NodesInMaintenance)
	// Get all nodes to calculate percentage
	allNodes, err := c.kubeClient.CoreV1().Nodes().List(ctx, metav1.ListOptions{})
	if err != nil {
		return fmt.Errorf("failed to list nodes: %w", err)
	}
	// Calculate maximum allowed concurrent maintenance
	maxConcurrent := int(float64(len(allNodes.Items)) * c.config.MaxConcurrentNodePercentage / 100.0)
	if maxConcurrent < 1 {
		maxConcurrent = 1
	}
	// Check if we're already at the limit
	if currentMaintenanceCount >= maxConcurrent {
		klog.Infof("Already at maximum concurrent maintenance limit (%d nodes)", maxConcurrent)
		return nil
	}
	// Calculate how many more nodes we can process
	remainingSlots := maxConcurrent - currentMaintenanceCount
	// Check available capacity before cordoning more nodes
	availableCapacity, err := c.capacityCalculator.GetAvailableCapacity(ctx)
	if err != nil {
		return fmt.Errorf("failed to calculate available capacity: %w", err)
	}
	// Check if we have enough capacity
	requiredCPU := c.config.MinimumAvailableCapacity.CPU
	requiredMemory := c.config.MinimumAvailableCapacity.Memory
	if availableCapacity.CPU.Cmp(requiredCPU) < 0 || 
	   availableCapacity.Memory.Cmp(requiredMemory) < 0 {
		klog.Warningf("Insufficient capacity for maintenance: CPU: %v (required %v), Memory: %v (required %v)",
			availableCapacity.CPU, requiredCPU, availableCapacity.Memory, requiredMemory)
		return nil
	}
	// Sort nodes by maintenance strategy
	sortedNodes := c.sortNodesByStrategy(nodesToMaintain)
	// Process nodes up to the remaining slot limit
	nodesToProcess := min(len(sortedNodes), remainingSlots)
	for i := 0; i < nodesToProcess; i++ {
		node := sortedNodes[i]
		// Perform pre-cordoning validation
		if err := c.validatePreCordoning(ctx, node); err != nil {
			klog.Warningf("Pre-cordoning validation failed for node %s: %v", node.Name, err)
			continue
		}
		// Cordon the node
		if err := c.cordonNode(ctx, node); err != nil {
			klog.Errorf("Failed to cordon node %s: %v", node.Name, err)
			continue
		}
		// Track the node in maintenance
		c.maintenanceState.NodesInMaintenance[node.Name] = &NodeMaintenance{
			NodeName:  node.Name,
			CordoneAt: time.Now(),
		}
		// Wait before processing the next node
		time.Sleep(c.config.NodeProcessingInterval)
	}
	return nil
}
// validatePreCordoning performs checks before cordoning a node
func (c *NodeMaintenanceController) validatePreCordoning(ctx context.Context, node *corev1.Node) error {
	// Check if PDBs would be violated
	if c.config.PreCordoning.CheckPodDisruptionBudgets {
		if violations, err := c.pdbChecker.CheckPDBViolations(ctx, node.Name); err != nil {
			return fmt.Errorf("failed to check PDB violations: %w", err)
		} else if len(violations) > 0 {
			return fmt.Errorf("cordoning would violate PDBs: %v", violations)
		}
	}
	// Check for critical pods
	if c.config.PreCordoning.CheckCriticalPods {
		if criticalPods, err := c.getCriticalPodsOnNode(ctx, node.Name); err != nil {
			return fmt.Errorf("failed to check critical pods: %w", err)
		} else if len(criticalPods) > 0 {
			return fmt.Errorf("node has %d critical pods that may not reschedule", len(criticalPods))
		}
	}
	// Check affinity constraints
	if c.config.PreCordoning.CheckAffinityConstraints {
		if err := c.validateAffinityConstraints(ctx, node); err != nil {
			return fmt.Errorf("affinity validation failed: %w", err)
		}
	}
	return nil
}
// monitorNodeDrain watches for pod scheduling failures during node draining
func (c *NodeMaintenanceController) monitorNodeDrain(ctx context.Context, nodeName string) {
	// Set up event watcher for scheduling failures
	// Implementation omitted for brevity
	// If too many scheduling failures occur, trip the circuit breaker
	if c.maintenanceState.FailedSchedulingEvents >= c.config.CircuitBreaker.MaxPodSchedulingFailures {
		klog.Warningf("Circuit breaker tripped due to %d pod scheduling failures", 
			c.maintenanceState.FailedSchedulingEvents)
		c.maintenanceState.CircuitBreakerTripped = true
		c.maintenanceState.CircuitBreakerTrippedAt = time.Now()
	}
}
// Helper functions
func (c *NodeMaintenanceController) isInMaintenanceWindow() bool {
	// Implementation omitted for brevity
	return true
}
func (c *NodeMaintenanceController) getNodesNeedingMaintenance(ctx context.Context) ([]*corev1.Node, error) {
	// Implementation omitted for brevity
	return nil, nil
}
func (c *NodeMaintenanceController) sortNodesByStrategy(nodes []*corev1.Node) []*corev1.Node {
	// Implementation omitted for brevity
	return nodes
}
func (c *NodeMaintenanceController) cordonNode(ctx context.Context, node *corev1.Node) error {
	// Implementation omitted for brevity
	return nil
}
func (c *NodeMaintenanceController) uncordonNodesInMaintenance(ctx context.Context) error {
	// Implementation omitted for brevity
	return nil
}
func (c *NodeMaintenanceController) getCriticalPodsOnNode(ctx context.Context, nodeName string) ([]corev1.Pod, error) {
	// Implementation omitted for brevity
	return nil, nil
}
func (c *NodeMaintenanceController) validateAffinityConstraints(ctx context.Context, node *corev1.Node) error {
	// Implementation omitted for brevity
	return nil
}
func min(a, b int) int {
	if a < b {
		return a
	}
	return b
}

Lessons Learned:

Automated node maintenance requires careful orchestration to prevent capacity issues and service disruptions.

How to Avoid:

Implement limits on concurrent node maintenance operations.
Perform capacity checks before cordoning nodes.
Configure proper pod disruption budgets for all services.
Implement circuit breaker logic to halt operations if too many failures occur.
Consider application dependencies when designing maintenance processes.

Answer 11

output:

Container Orchestration Kubernetes, Production environment, Multiple node pools

Summary:

No summary provided

What Happened:

A large e-commerce company was performing routine maintenance on their Kubernetes cluster, which involved upgrading the operating system on a subset of nodes. The operations team followed their standard procedure: cordon the nodes to prevent new pods from being scheduled, drain existing pods to other nodes, perform the maintenance, and then uncordon the nodes. However, despite the nodes showing as cordoned in kubectl, new pods were still being scheduled to these nodes. When the team proceeded with rebooting the nodes for maintenance, several critical services experienced unexpected downtime as newly scheduled pods were terminated.

Diagnosis Steps:

Verified the cordoned status of nodes using kubectl.
Examined the Kubernetes scheduler logs.
Reviewed recent changes to the cluster configuration.
Checked the Kubernetes API server audit logs.
Tested node cordoning in a controlled environment.

Root Cause:

The investigation revealed multiple issues with the node cordoning process: 1. A custom admission controller was bypassing the standard scheduling restrictions 2. The admission controller was implemented to ensure certain workloads always had capacity 3. The controller was not respecting the cordoned status of nodes 4. The maintenance procedure didn't account for this custom component 5. There was no testing of the cordoning process before the maintenance window

Fix/Workaround:

• Implemented immediate improvements to the maintenance process

• Modified the custom admission controller to respect node cordoning

• Created a pre-flight check for maintenance procedures

• Implemented a maintenance mode flag in the custom controller

• Established a testing protocol for maintenance procedures

Lessons Learned:

Custom Kubernetes components can interfere with standard cluster operations if not properly designed and tested.

How to Avoid:

Ensure custom components respect standard Kubernetes control mechanisms.
Test maintenance procedures in a staging environment before production.
Implement pre-flight checks for maintenance operations.
Document all custom components and their behavior.
Create comprehensive testing protocols for maintenance procedures.

# Container Orchestration Scenarios

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

Example of improved pod specification with proper affinity rules

Topology spread constraints for multi-zone distribution

Pod affinity for related services

Pod anti-affinity to avoid co-location of same components

Node affinity for hardware requirements

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid: