During a scheduled maintenance window, an automated process cordoned multiple nodes simultaneously to prepare for kernel updates. Despite having PodDisruptionBudgets (PDBs) configured for critical services, several applications experienced downtime as pods were unable to be rescheduled fast enough.
# Container Orchestration Scenarios
No summary provided
What Happened:
Diagnosis Steps:
Examined cluster events during the maintenance window.
Reviewed PodDisruptionBudget configurations for affected services.
Analyzed pod scheduling and eviction logs.
Checked node resource utilization before and during the incident.
Reviewed the automated maintenance script's logic.
Root Cause:
Multiple issues contributed to the problem: 1. The automated maintenance script cordoned too many nodes simultaneously (25% of the cluster), exceeding the cluster's ability to reschedule pods quickly 2. Some PodDisruptionBudgets were configured with absolute values instead of percentages, which didn't scale with deployment size 3. Several services had anti-affinity rules that made rescheduling difficult with fewer available nodes 4. The cluster was already running at high resource utilization (85% CPU), leaving little headroom for pod migrations
Fix/Workaround:
• Short-term: Modified the maintenance script to cordon nodes sequentially:
#!/bin/bash
# Improved maintenance script with sequential cordoning
set -euo pipefail
# Configuration
MAX_UNAVAILABLE_NODES=3
DRAIN_TIMEOUT=300
WAIT_BETWEEN_NODES=120
MAINTENANCE_LABEL="maintenance=true"
# Get all nodes
NODES=$(kubectl get nodes -o jsonpath='{.items[*].metadata.name}')
TOTAL_NODES=$(echo "$NODES" | wc -w)
echo "Total nodes in cluster: $TOTAL_NODES"
# Calculate maximum nodes to cordon at once
MAX_NODES_TO_CORDON=$(( TOTAL_NODES * 10 / 100 ))
if [ "$MAX_NODES_TO_CORDON" -gt "$MAX_UNAVAILABLE_NODES" ]; then
MAX_NODES_TO_CORDON=$MAX_UNAVAILABLE_NODES
fi
echo "Maximum nodes to cordon at once: $MAX_NODES_TO_CORDON"
# Function to check cluster health
check_cluster_health() {
# Check if any PDBs are violated
PDB_VIOLATIONS=$(kubectl get poddisruptionbudgets --all-namespaces -o json | jq '.items[] | select(.status.disruptionsAllowed == 0) | .metadata.name' | wc -l)
if [ "$PDB_VIOLATIONS" -gt 0 ]; then
echo "Warning: $PDB_VIOLATIONS PDBs currently have 0 disruptions allowed"
return 1
fi
# Check if any deployments are not fully available
UNAVAILABLE_DEPLOYMENTS=$(kubectl get deployments --all-namespaces -o json | jq '.items[] | select(.status.availableReplicas < .status.replicas) | .metadata.name' | wc -l)
if [ "$UNAVAILABLE_DEPLOYMENTS" -gt 0 ]; then
echo "Warning: $UNAVAILABLE_DEPLOYMENTS deployments are not fully available"
return 1
fi
return 0
}
# Function to cordon and drain a node
cordon_and_drain() {
local node=$1
echo "Cordoning node $node..."
kubectl cordon "$node"
echo "Labeling node $node for maintenance..."
kubectl label node "$node" "$MAINTENANCE_LABEL" --overwrite
echo "Draining node $node..."
kubectl drain "$node" --ignore-daemonsets --delete-emptydir-data --timeout="${DRAIN_TIMEOUT}s"
echo "Node $node successfully cordoned and drained"
}
# Main maintenance loop
CORDONED_COUNT=0
for node in $NODES; do
# Skip nodes that are already cordoned
if kubectl get node "$node" -o jsonpath='{.spec.unschedulable}' | grep -q 'true'; then
echo "Node $node is already cordoned, skipping"
continue
fi
# Check if we've reached the maximum number of nodes to cordon at once
if [ "$CORDONED_COUNT" -ge "$MAX_NODES_TO_CORDON" ]; then
echo "Maximum number of nodes cordoned, waiting for cluster to stabilize..."
while ! check_cluster_health; do
echo "Cluster is not healthy, waiting 60 seconds..."
sleep 60
done
CORDONED_COUNT=0
fi
# Cordon and drain the node
cordon_and_drain "$node"
CORDONED_COUNT=$((CORDONED_COUNT + 1))
# Wait between nodes to allow for pod rescheduling
echo "Waiting $WAIT_BETWEEN_NODES seconds before proceeding to next node..."
sleep "$WAIT_BETWEEN_NODES"
done
echo "All nodes have been cordoned and drained for maintenance"
• Long-term: Improved PodDisruptionBudget configurations:
# Before: Problematic PDB with absolute values
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-service-pdb
namespace: production
spec:
minAvailable: 3
selector:
matchLabels:
app: api-service
# After: Improved PDB with percentage and proper selector
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-service-pdb
namespace: production
spec:
minAvailable: 80%
selector:
matchLabels:
app: api-service
• Implemented better pod anti-affinity rules:
# Before: Strict anti-affinity that made rescheduling difficult
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-service
namespace: production
spec:
replicas: 10
selector:
matchLabels:
app: api-service
template:
metadata:
labels:
app: api-service
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- api-service
topologyKey: kubernetes.io/hostname
# After: Balanced anti-affinity with preferred and required rules
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-service
namespace: production
spec:
replicas: 10
selector:
matchLabels:
app: api-service
template:
metadata:
labels:
app: api-service
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- api-service
topologyKey: topology.kubernetes.io/zone
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- api-service
topologyKey: kubernetes.io/hostname
• Created a maintenance controller in Go:
// maintenance_controller.go
package main
import (
"context"
"flag"
"fmt"
"os"
"os/signal"
"syscall"
"time"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/rest"
"k8s.io/client-go/tools/clientcmd"
"k8s.io/klog/v2"
)
type MaintenanceController struct {
clientset *kubernetes.Clientset
maxUnavailableNodes int
drainTimeout time.Duration
waitBetweenNodes time.Duration
maintenanceLabel string
}
func NewMaintenanceController(clientset *kubernetes.Clientset, maxUnavailableNodes int, drainTimeout, waitBetweenNodes time.Duration, maintenanceLabel string) *MaintenanceController {
return &MaintenanceController{
clientset: clientset,
maxUnavailableNodes: maxUnavailableNodes,
drainTimeout: drainTimeout,
waitBetweenNodes: waitBetweenNodes,
maintenanceLabel: maintenanceLabel,
}
}
func (c *MaintenanceController) Run(ctx context.Context) error {
klog.Info("Starting maintenance controller")
// Get all nodes
nodes, err := c.clientset.CoreV1().Nodes().List(ctx, metav1.ListOptions{})
if err != nil {
return fmt.Errorf("failed to list nodes: %w", err)
}
totalNodes := len(nodes.Items)
klog.Infof("Total nodes in cluster: %d", totalNodes)
// Calculate maximum nodes to cordon at once (10% of total, capped at maxUnavailableNodes)
maxNodesToCordon := totalNodes * 10 / 100
if maxNodesToCordon > c.maxUnavailableNodes {
maxNodesToCordon = c.maxUnavailableNodes
}
klog.Infof("Maximum nodes to cordon at once: %d", maxNodesToCordon)
// Main maintenance loop
cordonedCount := 0
for _, node := range nodes.Items {
// Skip nodes that are already cordoned
if node.Spec.Unschedulable {
klog.Infof("Node %s is already cordoned, skipping", node.Name)
continue
}
// Check if we've reached the maximum number of nodes to cordon at once
if cordonedCount >= maxNodesToCordon {
klog.Info("Maximum number of nodes cordoned, waiting for cluster to stabilize...")
for {
healthy, err := c.checkClusterHealth(ctx)
if err != nil {
klog.Warningf("Failed to check cluster health: %v", err)
}
if healthy {
break
}
klog.Info("Cluster is not healthy, waiting 60 seconds...")
select {
case <-ctx.Done():
return ctx.Err()
case <-time.After(60 * time.Second):
}
}
cordonedCount = 0
}
// Cordon and drain the node
if err := c.cordonAndDrainNode(ctx, node.Name); err != nil {
klog.Errorf("Failed to cordon and drain node %s: %v", node.Name, err)
continue
}
cordonedCount++
// Wait between nodes to allow for pod rescheduling
klog.Infof("Waiting %v before proceeding to next node...", c.waitBetweenNodes)
select {
case <-ctx.Done():
return ctx.Err()
case <-time.After(c.waitBetweenNodes):
}
}
klog.Info("All nodes have been cordoned and drained for maintenance")
return nil
}
func (c *MaintenanceController) checkClusterHealth(ctx context.Context) (bool, error) {
// Check if any PDBs are violated
pdbs, err := c.clientset.PolicyV1().PodDisruptionBudgets(metav1.NamespaceAll).List(ctx, metav1.ListOptions{})
if err != nil {
return false, fmt.Errorf("failed to list PDBs: %w", err)
}
pdbViolations := 0
for _, pdb := range pdbs.Items {
if pdb.Status.DisruptionsAllowed == 0 {
pdbViolations++
}
}
if pdbViolations > 0 {
klog.Warningf("%d PDBs currently have 0 disruptions allowed", pdbViolations)
return false, nil
}
// Check if any deployments are not fully available
deployments, err := c.clientset.AppsV1().Deployments(metav1.NamespaceAll).List(ctx, metav1.ListOptions{})
if err != nil {
return false, fmt.Errorf("failed to list deployments: %w", err)
}
unavailableDeployments := 0
for _, deployment := range deployments.Items {
if deployment.Status.AvailableReplicas < deployment.Status.Replicas {
unavailableDeployments++
}
}
if unavailableDeployments > 0 {
klog.Warningf("%d deployments are not fully available", unavailableDeployments)
return false, nil
}
return true, nil
}
func (c *MaintenanceController) cordonAndDrainNode(ctx context.Context, nodeName string) error {
klog.Infof("Cordoning node %s...", nodeName)
// Get the node
node, err := c.clientset.CoreV1().Nodes().Get(ctx, nodeName, metav1.GetOptions{})
if err != nil {
return fmt.Errorf("failed to get node %s: %w", nodeName, err)
}
// Cordon the node
node.Spec.Unschedulable = true
_, err = c.clientset.CoreV1().Nodes().Update(ctx, node, metav1.UpdateOptions{})
if err != nil {
return fmt.Errorf("failed to cordon node %s: %w", nodeName, err)
}
// Label the node for maintenance
if node.Labels == nil {
node.Labels = make(map[string]string)
}
parts := make(map[string]string)
for k, v := range node.Labels {
parts[k] = v
}
key, value, found := splitLabel(c.maintenanceLabel)
if found {
parts[key] = value
node.Labels = parts
_, err = c.clientset.CoreV1().Nodes().Update(ctx, node, metav1.UpdateOptions{})
if err != nil {
return fmt.Errorf("failed to label node %s: %w", nodeName, err)
}
}
// Drain the node (evict pods)
// This is a simplified version - in a real controller, you would use the Eviction API
// to properly respect PodDisruptionBudgets
pods, err := c.clientset.CoreV1().Pods(metav1.NamespaceAll).List(ctx, metav1.ListOptions{
FieldSelector: fmt.Sprintf("spec.nodeName=%s", nodeName),
})
if err != nil {
return fmt.Errorf("failed to list pods on node %s: %w", nodeName, err)
}
for _, pod := range pods.Items {
// Skip DaemonSet pods
if isDaemonSetPod(&pod) {
continue
}
klog.Infof("Evicting pod %s/%s from node %s", pod.Namespace, pod.Name, nodeName)
err := c.clientset.CoreV1().Pods(pod.Namespace).Evict(ctx, &metav1.Eviction{
ObjectMeta: metav1.ObjectMeta{
Name: pod.Name,
Namespace: pod.Namespace,
},
})
if err != nil {
klog.Warningf("Failed to evict pod %s/%s: %v", pod.Namespace, pod.Name, err)
}
}
// Wait for pods to be evicted
deadline := time.Now().Add(c.drainTimeout)
for time.Now().Before(deadline) {
pods, err := c.clientset.CoreV1().Pods(metav1.NamespaceAll).List(ctx, metav1.ListOptions{
FieldSelector: fmt.Sprintf("spec.nodeName=%s", nodeName),
})
if err != nil {
return fmt.Errorf("failed to list pods on node %s: %w", nodeName, err)
}
nonDaemonSetPods := 0
for _, pod := range pods.Items {
if !isDaemonSetPod(&pod) {
nonDaemonSetPods++
}
}
if nonDaemonSetPods == 0 {
klog.Infof("Node %s successfully cordoned and drained", nodeName)
return nil
}
klog.Infof("Waiting for %d pods to be evicted from node %s...", nonDaemonSetPods, nodeName)
select {
case <-ctx.Done():
return ctx.Err()
case <-time.After(5 * time.Second):
}
}
return fmt.Errorf("timed out waiting for pods to be evicted from node %s", nodeName)
}
func isDaemonSetPod(pod *corev1.Pod) bool {
for _, ownerRef := range pod.OwnerReferences {
if ownerRef.Kind == "DaemonSet" {
return true
}
}
return false
}
func splitLabel(label string) (string, string, bool) {
for i, c := range label {
if c == '=' {
return label[:i], label[i+1:], true
}
}
return "", "", false
}
func main() {
klog.InitFlags(nil)
var (
kubeconfig string
maxUnavailableNodes int
drainTimeout time.Duration
waitBetweenNodes time.Duration
maintenanceLabel string
)
flag.StringVar(&kubeconfig, "kubeconfig", "", "Path to kubeconfig file")
flag.IntVar(&maxUnavailableNodes, "max-unavailable-nodes", 3, "Maximum number of nodes to cordon at once")
flag.DurationVar(&drainTimeout, "drain-timeout", 5*time.Minute, "Timeout for draining a node")
flag.DurationVar(&waitBetweenNodes, "wait-between-nodes", 2*time.Minute, "Time to wait between cordoning nodes")
flag.StringVar(&maintenanceLabel, "maintenance-label", "maintenance=true", "Label to apply to nodes under maintenance")
flag.Parse()
// Create Kubernetes client
var config *rest.Config
var err error
if kubeconfig == "" {
klog.Info("Using in-cluster configuration")
config, err = rest.InClusterConfig()
} else {
klog.Infof("Using configuration from %s", kubeconfig)
config, err = clientcmd.BuildConfigFromFlags("", kubeconfig)
}
if err != nil {
klog.Fatalf("Failed to create Kubernetes client config: %v", err)
}
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
klog.Fatalf("Failed to create Kubernetes clientset: %v", err)
}
// Create and run controller
controller := NewMaintenanceController(
clientset,
maxUnavailableNodes,
drainTimeout,
waitBetweenNodes,
maintenanceLabel,
)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
// Handle signals
sigCh := make(chan os.Signal, 1)
signal.Notify(sigCh, syscall.SIGINT, syscall.SIGTERM)
go func() {
sig := <-sigCh
klog.Infof("Received signal %s, shutting down", sig)
cancel()
}()
if err := controller.Run(ctx); err != nil && err != context.Canceled {
klog.Fatalf("Error running controller: %v", err)
}
}
Lessons Learned:
Node cordoning and draining requires careful orchestration to prevent service disruption.
How to Avoid:
Implement percentage-based PodDisruptionBudgets instead of absolute values.
Cordon and drain nodes sequentially with health checks between operations.
Use a combination of required and preferred anti-affinity rules.
Maintain sufficient resource headroom for pod migrations during maintenance.
Test maintenance procedures in a staging environment before production.
No summary provided
What Happened:
During a marketing campaign, several critical microservices became unavailable. Investigation showed that pods were being evicted from nodes, and some nodes were being marked as NotReady. The issue affected multiple services across different namespaces.
Diagnosis Steps:
Examined Kubernetes events for eviction notices and node status changes.
Analyzed node resource metrics (CPU, memory, disk) before and during the incident.
Reviewed pod resource requests and limits across affected namespaces.
Checked kubelet logs for resource pressure signals.
Investigated system logs for OOM killer activity.
Root Cause:
The cluster was experiencing memory pressure due to a combination of factors: 1. Many pods had no memory limits defined, allowing them to consume excessive memory. 2. System daemons and kubelet reserved memory was not properly configured. 3. Node memory was fragmented, preventing efficient allocation. 4. Some pods had memory leaks that gradually consumed available memory. 5. The cluster autoscaler was not configured to scale based on memory pressure.
Fix/Workaround:
• Short-term: Increased node size and manually cordoned affected nodes:
# Cordon affected nodes to prevent new pods from being scheduled
kubectl cordon node-1 node-2 node-3
# Drain pods from affected nodes (with grace period to allow proper termination)
kubectl drain node-1 --grace-period=300 --ignore-daemonsets --delete-emptydir-data
kubectl drain node-2 --grace-period=300 --ignore-daemonsets --delete-emptydir-data
kubectl drain node-3 --grace-period=300 --ignore-daemonsets --delete-emptydir-data
# Restart kubelet on affected nodes
ssh node-1 "sudo systemctl restart kubelet"
ssh node-2 "sudo systemctl restart kubelet"
ssh node-3 "sudo systemctl restart kubelet"
# Uncordon nodes after recovery
kubectl uncordon node-1 node-2 node-3
• Long-term: Implemented proper resource management:
# Set appropriate kubelet configuration
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
metadata:
name: kubelet-config
evictionHard:
memory.available: "500Mi"
nodefs.available: "10%"
nodefs.inodesFree: "5%"
evictionSoft:
memory.available: "1Gi"
nodefs.available: "15%"
nodefs.inodesFree: "10%"
evictionSoftGracePeriod:
memory.available: "1m"
nodefs.available: "1m"
nodefs.inodesFree: "1m"
evictionMaxPodGracePeriod: 300
evictionPressureTransitionPeriod: "30s"
systemReserved:
memory: "1Gi"
cpu: "500m"
kubeReserved:
memory: "1Gi"
cpu: "1"
• Created a LimitRange to enforce default resource limits:
# Default resource limits for namespace
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: production
spec:
limits:
- default:
memory: 512Mi
cpu: 500m
defaultRequest:
memory: 256Mi
cpu: 100m
type: Container
• Implemented a ResourceQuota for namespaces:
# Resource quota for namespace
apiVersion: v1
kind: ResourceQuota
metadata:
name: namespace-quota
namespace: production
spec:
hard:
requests.cpu: "20"
requests.memory: 20Gi
limits.cpu: "40"
limits.memory: 40Gi
pods: "50"
• Created a Vertical Pod Autoscaler configuration:
# Vertical Pod Autoscaler configuration
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: my-app-vpa
namespace: production
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: my-app
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: '*'
minAllowed:
cpu: 50m
memory: 100Mi
maxAllowed:
cpu: 1
memory: 1Gi
controlledResources: ["cpu", "memory"]
• Configured Cluster Autoscaler to consider memory pressure:
# Cluster Autoscaler configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-config
namespace: kube-system
data:
config.yaml: |
extraArgs:
scale-down-utilization-threshold: "0.5"
scale-down-unneeded-time: "5m"
max-node-provision-time: "15m"
scan-interval: "10s"
scale-down-delay-after-add: "5m"
scale-down-delay-after-delete: "0s"
scale-down-delay-after-failure: "3m"
expendable-pods-priority-cutoff: "-10"
balance-similar-node-groups: "true"
expander: "least-waste"
skip-nodes-with-local-storage: "false"
skip-nodes-with-system-pods: "true"
max-graceful-termination-sec: "600"
memory-utilization-threshold: "0.7"
memory-pressure-threshold: "0.85"
• Implemented a Go-based memory monitoring sidecar:
// memory_monitor.go
package main
import (
"context"
"fmt"
"log"
"net/http"
"os"
"strconv"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/rest"
"k8s.io/metrics/pkg/client/clientset/versioned"
)
var (
// Prometheus metrics
nodeMemoryPressure = promauto.NewGaugeVec(
prometheus.GaugeOpts{
Name: "node_memory_pressure",
Help: "Memory pressure on nodes (0-1)",
},
[]string{"node"},
)
podMemoryUsage = promauto.NewGaugeVec(
prometheus.GaugeOpts{
Name: "pod_memory_usage_bytes",
Help: "Memory usage of pods in bytes",
},
[]string{"namespace", "pod", "node"},
)
podMemoryLimit = promauto.NewGaugeVec(
prometheus.GaugeOpts{
Name: "pod_memory_limit_bytes",
Help: "Memory limit of pods in bytes",
},
[]string{"namespace", "pod", "node"},
)
podMemoryRequest = promauto.NewGaugeVec(
prometheus.GaugeOpts{
Name: "pod_memory_request_bytes",
Help: "Memory request of pods in bytes",
},
[]string{"namespace", "pod", "node"},
)
nodeMemoryCapacity = promauto.NewGaugeVec(
prometheus.GaugeOpts{
Name: "node_memory_capacity_bytes",
Help: "Memory capacity of nodes in bytes",
},
[]string{"node"},
)
nodeMemoryAllocatable = promauto.NewGaugeVec(
prometheus.GaugeOpts{
Name: "node_memory_allocatable_bytes",
Help: "Allocatable memory of nodes in bytes",
},
[]string{"node"},
)
evictionRisk = promauto.NewGaugeVec(
prometheus.GaugeOpts{
Name: "pod_eviction_risk",
Help: "Risk of pod eviction (0-1)",
},
[]string{"namespace", "pod", "node"},
)
)
func main() {
// Set up Kubernetes client
config, err := rest.InClusterConfig()
if err != nil {
log.Fatalf("Failed to get cluster config: %v", err)
}
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
log.Fatalf("Failed to create Kubernetes client: %v", err)
}
metricsClient, err := versioned.NewForConfig(config)
if err != nil {
log.Fatalf("Failed to create Metrics client: %v", err)
}
// Start HTTP server for Prometheus metrics
http.Handle("/metrics", promhttp.Handler())
go func() {
log.Fatal(http.ListenAndServe(":8080", nil))
}()
// Get monitoring interval from environment variable or use default
interval := 60
if intervalStr := os.Getenv("MONITORING_INTERVAL"); intervalStr != "" {
if i, err := strconv.Atoi(intervalStr); err == nil {
interval = i
}
}
// Start monitoring loop
for {
monitorNodes(clientset)
monitorPods(clientset, metricsClient)
time.Sleep(time.Duration(interval) * time.Second)
}
}
func monitorNodes(clientset *kubernetes.Clientset) {
nodes, err := clientset.CoreV1().Nodes().List(context.TODO(), metav1.ListOptions{})
if err != nil {
log.Printf("Failed to list nodes: %v", err)
return
}
for _, node := range nodes.Items {
// Get node capacity and allocatable memory
memoryCapacity := node.Status.Capacity.Memory().Value()
memoryAllocatable := node.Status.Allocatable.Memory().Value()
nodeMemoryCapacity.WithLabelValues(node.Name).Set(float64(memoryCapacity))
nodeMemoryAllocatable.WithLabelValues(node.Name).Set(float64(memoryAllocatable))
// Check for memory pressure condition
for _, condition := range node.Status.Conditions {
if condition.Type == "MemoryPressure" {
if condition.Status == "True" {
nodeMemoryPressure.WithLabelValues(node.Name).Set(1)
} else {
nodeMemoryPressure.WithLabelValues(node.Name).Set(0)
}
}
}
}
}
func monitorPods(clientset *kubernetes.Clientset, metricsClient *versioned.Clientset) {
pods, err := clientset.CoreV1().Pods("").List(context.TODO(), metav1.ListOptions{})
if err != nil {
log.Printf("Failed to list pods: %v", err)
return
}
// Get pod metrics
podMetrics, err := metricsClient.MetricsV1beta1().PodMetricses("").List(context.TODO(), metav1.ListOptions{})
if err != nil {
log.Printf("Failed to get pod metrics: %v", err)
return
}
// Create a map of pod metrics for easier lookup
podMetricsMap := make(map[string]map[string]int64)
for _, podMetric := range podMetrics.Items {
key := fmt.Sprintf("%s/%s", podMetric.Namespace, podMetric.Name)
podMetricsMap[key] = make(map[string]int64)
for _, container := range podMetric.Containers {
memory := container.Usage.Memory().Value()
if _, exists := podMetricsMap[key]["memory"]; exists {
podMetricsMap[key]["memory"] += memory
} else {
podMetricsMap[key]["memory"] = memory
}
}
}
// Get node allocatable memory for calculating pressure
nodes, err := clientset.CoreV1().Nodes().List(context.TODO(), metav1.ListOptions{})
if err != nil {
log.Printf("Failed to list nodes: %v", err)
return
}
nodeAllocatableMemory := make(map[string]int64)
for _, node := range nodes.Items {
nodeAllocatableMemory[node.Name] = node.Status.Allocatable.Memory().Value()
}
// Process each pod
for _, pod := range pods.Items {
// Skip pods that are not running
if pod.Status.Phase != "Running" {
continue
}
// Get pod memory usage from metrics
key := fmt.Sprintf("%s/%s", pod.Namespace, pod.Name)
memoryUsage := int64(0)
if metrics, exists := podMetricsMap[key]; exists {
memoryUsage = metrics["memory"]
}
// Set pod memory usage metric
podMemoryUsage.WithLabelValues(pod.Namespace, pod.Name, pod.Spec.NodeName).Set(float64(memoryUsage))
// Calculate total memory requests and limits for the pod
memoryRequest := int64(0)
memoryLimit := int64(0)
for _, container := range pod.Spec.Containers {
if container.Resources.Requests != nil {
if memory, exists := container.Resources.Requests["memory"]; exists {
memoryRequest += memory.Value()
}
}
if container.Resources.Limits != nil {
if memory, exists := container.Resources.Limits["memory"]; exists {
memoryLimit += memory.Value()
}
}
}
// Set pod memory request and limit metrics
podMemoryRequest.WithLabelValues(pod.Namespace, pod.Name, pod.Spec.NodeName).Set(float64(memoryRequest))
podMemoryLimit.WithLabelValues(pod.Namespace, pod.Name, pod.Spec.NodeName).Set(float64(memoryLimit))
// Calculate eviction risk
risk := 0.0
// If node is under memory pressure, all pods are at risk
for _, condition := range pod.Status.Conditions {
if condition.Type == "NodeHasNoMemoryPressure" && condition.Status == "False" {
risk = 1.0
break
}
}
// If pod has no memory limit, it's at higher risk
if memoryLimit == 0 {
risk = math.Max(risk, 0.8)
}
// If pod is using more than 90% of its limit, it's at risk
if memoryLimit > 0 && float64(memoryUsage)/float64(memoryLimit) > 0.9 {
risk = math.Max(risk, 0.7)
}
// If node allocatable memory is getting low, pods are at risk
if nodeAllocatable, exists := nodeAllocatableMemory[pod.Spec.NodeName]; exists {
nodeUsageRatio := float64(memoryUsage) / float64(nodeAllocatable)
if nodeUsageRatio > 0.85 {
risk = math.Max(risk, nodeUsageRatio - 0.5)
}
}
// Set eviction risk metric
evictionRisk.WithLabelValues(pod.Namespace, pod.Name, pod.Spec.NodeName).Set(risk)
}
}
• Created a Rust-based memory leak detection tool:
// memory_leak_detector.rs
use chrono::{DateTime, Utc};
use futures::stream::StreamExt;
use k8s_openapi::api::core::v1::Pod;
use kube::{
api::{Api, ListParams, WatchEvent},
Client,
};
use serde::{Deserialize, Serialize};
use std::collections::HashMap;
use std::sync::Arc;
use std::time::{Duration, Instant};
use tokio::sync::Mutex;
use tokio::time;
#[derive(Debug, Clone, Serialize, Deserialize)]
struct MemoryUsageSample {
timestamp: DateTime<Utc>,
usage_bytes: u64,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
struct PodMemoryHistory {
namespace: String,
pod_name: String,
node_name: String,
samples: Vec<MemoryUsageSample>,
start_time: DateTime<Utc>,
last_updated: DateTime<Utc>,
}
impl PodMemoryHistory {
fn new(namespace: String, pod_name: String, node_name: String) -> Self {
let now = Utc::now();
PodMemoryHistory {
namespace,
pod_name,
node_name,
samples: Vec::new(),
start_time: now,
last_updated: now,
}
}
fn add_sample(&mut self, usage_bytes: u64) {
let now = Utc::now();
self.samples.push(MemoryUsageSample {
timestamp: now,
usage_bytes,
});
self.last_updated = now;
// Keep only the last 24 hours of samples
let one_day_ago = now - chrono::Duration::hours(24);
self.samples.retain(|sample| sample.timestamp > one_day_ago);
}
fn detect_leak(&self) -> Option<LeakInfo> {
// Need at least 10 samples to detect a trend
if self.samples.len() < 10 {
return None;
}
// Calculate linear regression to detect trend
let n = self.samples.len() as f64;
let timestamps: Vec<f64> = self.samples
.iter()
.map(|s| s.timestamp.timestamp() as f64)
.collect();
let usages: Vec<f64> = self.samples
.iter()
.map(|s| s.usage_bytes as f64)
.collect();
let sum_x: f64 = timestamps.iter().sum();
let sum_y: f64 = usages.iter().sum();
let sum_xy: f64 = timestamps.iter().zip(usages.iter()).map(|(x, y)| x * y).sum();
let sum_xx: f64 = timestamps.iter().map(|x| x * x).sum();
let slope = (n * sum_xy - sum_x * sum_y) / (n * sum_xx - sum_x * sum_x);
let intercept = (sum_y - slope * sum_x) / n;
// Calculate R-squared to determine how well the line fits
let mean_y = sum_y / n;
let ss_tot: f64 = usages.iter().map(|y| (y - mean_y).powi(2)).sum();
let ss_res: f64 = usages
.iter()
.zip(timestamps.iter())
.map(|(y, x)| (y - (slope * x + intercept)).powi(2))
.sum();
let r_squared = 1.0 - (ss_res / ss_tot);
// Calculate growth rate in bytes per hour
let growth_rate_per_second = slope;
let growth_rate_per_hour = growth_rate_per_second * 3600.0;
// If we have a positive slope, good R-squared, and significant growth, it's a leak
if slope > 0.0 && r_squared > 0.7 && growth_rate_per_hour > 1024.0 * 1024.0 {
let hours_to_1gb = (1024.0 * 1024.0 * 1024.0 - self.samples.last().unwrap().usage_bytes as f64) / growth_rate_per_hour;
Some(LeakInfo {
growth_rate_bytes_per_hour: growth_rate_per_hour as u64,
r_squared,
hours_to_1gb: if hours_to_1gb > 0.0 { Some(hours_to_1gb) } else { None },
current_usage_bytes: self.samples.last().unwrap().usage_bytes,
duration_hours: (self.last_updated - self.start_time).num_seconds() as f64 / 3600.0,
})
} else {
None
}
}
}
#[derive(Debug, Clone, Serialize, Deserialize)]
struct LeakInfo {
growth_rate_bytes_per_hour: u64,
r_squared: f64,
hours_to_1gb: Option<f64>,
current_usage_bytes: u64,
duration_hours: f64,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
struct LeakReport {
namespace: String,
pod_name: String,
node_name: String,
leak_info: LeakInfo,
timestamp: DateTime<Utc>,
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Initialize Kubernetes client
let client = Client::try_default().await?;
// Create a shared state for pod memory history
let pod_memory_history = Arc::new(Mutex::new(HashMap::<String, PodMemoryHistory>::new()));
// Start the memory usage collector
let collector_history = pod_memory_history.clone();
tokio::spawn(async move {
collect_memory_usage(client.clone(), collector_history).await;
});
// Start the leak detector
let detector_history = pod_memory_history.clone();
tokio::spawn(async move {
detect_memory_leaks(detector_history).await;
});
// Start the pod watcher to track new and deleted pods
let watcher_history = pod_memory_history.clone();
tokio::spawn(async move {
watch_pods(client.clone(), watcher_history).await;
});
// Keep the main thread running
loop {
time::sleep(Duration::from_secs(3600)).await;
}
}
async fn collect_memory_usage(client: Client, history: Arc<Mutex<HashMap<String, PodMemoryHistory>>>) {
let metrics_client = client.clone();
loop {
// Get pod metrics
match metrics_client.request::<k8s_openapi::api::metrics::v1beta1::PodMetricsList>(
k8s_openapi::http::Request::get("/apis/metrics.k8s.io/v1beta1/pods")
).await {
Ok(metrics) => {
let mut history_lock = history.lock().await;
for pod_metrics in metrics.items {
let namespace = pod_metrics.metadata.namespace.unwrap_or_default();
let pod_name = pod_metrics.metadata.name.unwrap_or_default();
let key = format!("{}/{}", namespace, pod_name);
// Calculate total memory usage for the pod
let total_memory = pod_metrics.containers.iter()
.filter_map(|c| c.usage.get("memory").and_then(|q| q.0.parse::<u64>().ok()))
.sum();
// Update or create pod history
if let Some(pod_history) = history_lock.get_mut(&key) {
pod_history.add_sample(total_memory);
} else {
// Try to get the node name from the pod
let pods: Api<Pod> = Api::namespaced(client.clone(), &namespace);
if let Ok(pod) = pods.get(&pod_name).await {
let node_name = pod.spec.and_then(|s| s.node_name).unwrap_or_default();
let mut new_history = PodMemoryHistory::new(namespace, pod_name, node_name);
new_history.add_sample(total_memory);
history_lock.insert(key, new_history);
}
}
}
}
Err(e) => {
eprintln!("Failed to get pod metrics: {}", e);
}
}
// Collect metrics every minute
time::sleep(Duration::from_secs(60)).await;
}
}
async fn detect_memory_leaks(history: Arc<Mutex<HashMap<String, PodMemoryHistory>>>) {
loop {
let mut reports = Vec::new();
// Check for leaks
{
let history_lock = history.lock().await;
for (key, pod_history) in history_lock.iter() {
if let Some(leak_info) = pod_history.detect_leak() {
reports.push(LeakReport {
namespace: pod_history.namespace.clone(),
pod_name: pod_history.pod_name.clone(),
node_name: pod_history.node_name.clone(),
leak_info,
timestamp: Utc::now(),
});
}
}
}
// Report leaks
if !reports.is_empty() {
println!("Memory leak reports:");
for report in reports {
println!("Pod: {}/{} (Node: {})", report.namespace, report.pod_name, report.node_name);
println!(" Growth rate: {} MB/hour", report.leak_info.growth_rate_bytes_per_hour / (1024 * 1024));
println!(" R-squared: {:.4}", report.leak_info.r_squared);
if let Some(hours) = report.leak_info.hours_to_1gb {
println!(" Hours to 1GB: {:.2}", hours);
}
println!(" Current usage: {} MB", report.leak_info.current_usage_bytes / (1024 * 1024));
println!(" Monitored for: {:.2} hours", report.leak_info.duration_hours);
println!();
// Send alert (implement your preferred alerting mechanism)
send_alert(&report).await;
}
}
// Run leak detection every 15 minutes
time::sleep(Duration::from_secs(900)).await;
}
}
async fn watch_pods(client: Client, history: Arc<Mutex<HashMap<String, PodMemoryHistory>>>) {
let pods: Api<Pod> = Api::all(client);
let lp = ListParams::default()
.timeout(60)
.allow_bookmarks();
let mut stream = pods.watch(&lp, "").await.unwrap().boxed();
while let Some(event) = stream.next().await {
match event {
Ok(WatchEvent::Added(pod)) | Ok(WatchEvent::Modified(pod)) => {
let namespace = pod.metadata.namespace.unwrap_or_default();
let pod_name = pod.metadata.name.unwrap_or_default();
let node_name = pod.spec.and_then(|s| s.node_name).unwrap_or_default();
let key = format!("{}/{}", namespace, pod_name);
let mut history_lock = history.lock().await;
if !history_lock.contains_key(&key) {
history_lock.insert(key, PodMemoryHistory::new(namespace, pod_name, node_name));
}
}
Ok(WatchEvent::Deleted(pod)) => {
let namespace = pod.metadata.namespace.unwrap_or_default();
let pod_name = pod.metadata.name.unwrap_or_default();
let key = format!("{}/{}", namespace, pod_name);
let mut history_lock = history.lock().await;
history_lock.remove(&key);
}
Ok(WatchEvent::Bookmark(_)) => {}
Err(e) => {
eprintln!("Watch error: {}", e);
time::sleep(Duration::from_secs(5)).await;
}
}
}
}
async fn send_alert(report: &LeakReport) {
// Implement your preferred alerting mechanism
// This could be Slack, PagerDuty, email, etc.
println!("ALERT: Memory leak detected in pod {}/{}", report.namespace, report.pod_name);
// Example: Send to webhook
let client = reqwest::Client::new();
let _ = client.post("https://alerts.example.com/webhook")
.json(report)
.send()
.await;
}
Lessons Learned:
Proper resource management is critical for Kubernetes cluster stability.
How to Avoid:
Define appropriate resource requests and limits for all containers.
Configure kubelet with proper eviction thresholds and system reservations.
Implement monitoring for memory usage and pressure.
Use LimitRange and ResourceQuota to enforce resource constraints.
Implement memory leak detection for critical applications.
No summary provided
What Happened:
After implementing Cluster Autoscaler for a production EKS cluster, the operations team noticed that nodes were being added and removed frequently throughout the day, sometimes within minutes of each other. This caused pods to be evicted and rescheduled repeatedly, leading to service disruptions and increased AWS costs due to the frequent scaling operations.
Diagnosis Steps:
Analyzed Cluster Autoscaler logs to understand scaling decisions.
Reviewed pod resource requests and limits across the cluster.
Examined node utilization patterns and metrics.
Checked HorizontalPodAutoscaler configurations and behavior.
Investigated pod scheduling and eviction events.
Root Cause:
Multiple issues contributed to the autoscaler thrashing: 1. Pod resource requests were set too low compared to actual usage 2. The Cluster Autoscaler's scale-down delay was too short (5 minutes) 3. HorizontalPodAutoscalers were configured with too aggressive scaling parameters 4. Some stateful applications had no Pod Disruption Budgets defined 5. The node groups were configured with small step sizes (1 node at a time)
Fix/Workaround:
• Short-term: Implemented immediate configuration changes:
# Before: Problematic Cluster Autoscaler configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-config
namespace: kube-system
data:
config.yaml: |
scaleDownUtilizationThreshold: 0.5
scaleDownUnneededTime: 5m
scaleDownDelayAfterAdd: 5m
maxNodeProvisionTime: 15m
scanInterval: 10s
expendablePodsPriorityCutoff: -10
# After: Optimized Cluster Autoscaler configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-config
namespace: kube-system
data:
config.yaml: |
scaleDownUtilizationThreshold: 0.5
scaleDownUnneededTime: 15m
scaleDownDelayAfterAdd: 15m
maxNodeProvisionTime: 15m
scanInterval: 30s
expendablePodsPriorityCutoff: -10
scaleDownUnreadyTime: 20m
okTotalUnreadyCount: 3
maxGracefulTerminationSec: 600
ignoreDaemonSetsUtilization: true
skipNodesWithLocalStorage: true
• Added Pod Disruption Budgets for critical services:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: critical-service-pdb
namespace: production
spec:
minAvailable: 2 # or use maxUnavailable: 1
selector:
matchLabels:
app: critical-service
• Adjusted resource requests to match actual usage:
# Before: Underprovisioned resource requests
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-service
namespace: production
spec:
replicas: 5
template:
spec:
containers:
- name: api-container
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 1000m
memory: 1Gi
# After: Properly sized resource requests
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-service
namespace: production
spec:
replicas: 5
template:
spec:
containers:
- name: api-container
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
• Long-term: Implemented a comprehensive autoscaling strategy:
// autoscaling_analyzer.go
package main
import (
"context"
"encoding/json"
"fmt"
"io/ioutil"
"log"
"os"
"sort"
"time"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/rest"
"k8s.io/client-go/tools/clientcmd"
"k8s.io/metrics/pkg/client/clientset/versioned"
)
// PodMetrics represents resource usage metrics for a pod
type PodMetrics struct {
Name string
Namespace string
Containers map[string]ContainerMetrics
RequestCPU int64
RequestMemory int64
UsageCPU int64
UsageMemory int64
CPURatio float64
MemoryRatio float64
}
// ContainerMetrics represents resource usage metrics for a container
type ContainerMetrics struct {
Name string
RequestCPU int64
RequestMemory int64
UsageCPU int64
UsageMemory int64
CPURatio float64
MemoryRatio float64
}
// NodeMetrics represents resource usage metrics for a node
type NodeMetrics struct {
Name string
AllocatableCPU int64
AllocatableMemory int64
RequestCPU int64
RequestMemory int64
UsageCPU int64
UsageMemory int64
CPUUtilization float64
MemoryUtilization float64
Pods int
PodCapacity int
PodUtilization float64
}
// AutoscalingRecommendation represents a recommendation for autoscaling configuration
type AutoscalingRecommendation struct {
ResourceType string `json:"resourceType"`
Name string `json:"name"`
Namespace string `json:"namespace"`
CurrentValue string `json:"currentValue"`
RecommendedValue string `json:"recommendedValue"`
Reason string `json:"reason"`
Impact string `json:"impact"`
Priority int `json:"priority"`
}
func main() {
// Load Kubernetes configuration
kubeconfig := os.Getenv("KUBECONFIG")
var config *rest.Config
var err error
if kubeconfig != "" {
// Use kubeconfig file
config, err = clientcmd.BuildConfigFromFlags("", kubeconfig)
} else {
// Use in-cluster config
config, err = rest.InClusterConfig()
}
if err != nil {
log.Fatalf("Error building kubeconfig: %v", err)
}
// Create Kubernetes clientset
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
log.Fatalf("Error creating Kubernetes client: %v", err)
}
// Create metrics clientset
metricsClientset, err := versioned.NewForConfig(config)
if err != nil {
log.Fatalf("Error creating metrics client: %v", err)
}
// Collect pod metrics
podMetrics, err := collectPodMetrics(clientset, metricsClientset)
if err != nil {
log.Fatalf("Error collecting pod metrics: %v", err)
}
// Collect node metrics
nodeMetrics, err := collectNodeMetrics(clientset, metricsClientset)
if err != nil {
log.Fatalf("Error collecting node metrics: %v", err)
}
// Analyze autoscaling configuration
recommendations := analyzeAutoscaling(clientset, podMetrics, nodeMetrics)
// Output recommendations
outputRecommendations(recommendations)
}
func collectPodMetrics(clientset *kubernetes.Clientset, metricsClientset *versioned.Clientset) ([]PodMetrics, error) {
// Get all pods
pods, err := clientset.CoreV1().Pods("").List(context.TODO(), metav1.ListOptions{})
if err != nil {
return nil, fmt.Errorf("error listing pods: %v", err)
}
// Get pod metrics
podMetricsList, err := metricsClientset.MetricsV1beta1().PodMetricses("").List(context.TODO(), metav1.ListOptions{})
if err != nil {
return nil, fmt.Errorf("error getting pod metrics: %v", err)
}
// Create a map of pod metrics
podMetricsMap := make(map[string]map[string]map[string]int64)
for _, podMetric := range podMetricsList.Items {
if _, ok := podMetricsMap[podMetric.Namespace]; !ok {
podMetricsMap[podMetric.Namespace] = make(map[string]map[string]int64)
}
podMetricsMap[podMetric.Namespace][podMetric.Name] = make(map[string]map[string]int64)
for _, container := range podMetric.Containers {
if _, ok := podMetricsMap[podMetric.Namespace][podMetric.Name][container.Name]; !ok {
podMetricsMap[podMetric.Namespace][podMetric.Name][container.Name] = make(map[string]int64)
}
podMetricsMap[podMetric.Namespace][podMetric.Name][container.Name]["cpu"] = container.Usage.Cpu().MilliValue()
podMetricsMap[podMetric.Namespace][podMetric.Name][container.Name]["memory"] = container.Usage.Memory().Value() / (1024 * 1024) // Convert to Mi
}
}
// Process pod metrics
var result []PodMetrics
for _, pod := range pods.Items {
if pod.Status.Phase != "Running" {
continue
}
podMetric := PodMetrics{
Name: pod.Name,
Namespace: pod.Namespace,
Containers: make(map[string]ContainerMetrics),
}
for _, container := range pod.Spec.Containers {
containerMetric := ContainerMetrics{
Name: container.Name,
}
// Get resource requests
if container.Resources.Requests != nil {
containerMetric.RequestCPU = container.Resources.Requests.Cpu().MilliValue()
containerMetric.RequestMemory = container.Resources.Requests.Memory().Value() / (1024 * 1024) // Convert to Mi
}
// Get resource usage
if metrics, ok := podMetricsMap[pod.Namespace][pod.Name][container.Name]; ok {
containerMetric.UsageCPU = metrics["cpu"]
containerMetric.UsageMemory = metrics["memory"]
// Calculate ratios
if containerMetric.RequestCPU > 0 {
containerMetric.CPURatio = float64(containerMetric.UsageCPU) / float64(containerMetric.RequestCPU)
}
if containerMetric.RequestMemory > 0 {
containerMetric.MemoryRatio = float64(containerMetric.UsageMemory) / float64(containerMetric.RequestMemory)
}
}
podMetric.Containers[container.Name] = containerMetric
// Aggregate pod-level metrics
podMetric.RequestCPU += containerMetric.RequestCPU
podMetric.RequestMemory += containerMetric.RequestMemory
podMetric.UsageCPU += containerMetric.UsageCPU
podMetric.UsageMemory += containerMetric.UsageMemory
}
// Calculate pod-level ratios
if podMetric.RequestCPU > 0 {
podMetric.CPURatio = float64(podMetric.UsageCPU) / float64(podMetric.RequestCPU)
}
if podMetric.RequestMemory > 0 {
podMetric.MemoryRatio = float64(podMetric.UsageMemory) / float64(podMetric.RequestMemory)
}
result = append(result, podMetric)
}
return result, nil
}
func collectNodeMetrics(clientset *kubernetes.Clientset, metricsClientset *versioned.Clientset) ([]NodeMetrics, error) {
// Get all nodes
nodes, err := clientset.CoreV1().Nodes().List(context.TODO(), metav1.ListOptions{})
if err != nil {
return nil, fmt.Errorf("error listing nodes: %v", err)
}
// Get node metrics
nodeMetricsList, err := metricsClientset.MetricsV1beta1().NodeMetricses().List(context.TODO(), metav1.ListOptions{})
if err != nil {
return nil, fmt.Errorf("error getting node metrics: %v", err)
}
// Create a map of node metrics
nodeMetricsMap := make(map[string]map[string]int64)
for _, nodeMetric := range nodeMetricsList.Items {
nodeMetricsMap[nodeMetric.Name] = make(map[string]int64)
nodeMetricsMap[nodeMetric.Name]["cpu"] = nodeMetric.Usage.Cpu().MilliValue()
nodeMetricsMap[nodeMetric.Name]["memory"] = nodeMetric.Usage.Memory().Value() / (1024 * 1024) // Convert to Mi
}
// Get all pods to calculate resource requests per node
pods, err := clientset.CoreV1().Pods("").List(context.TODO(), metav1.ListOptions{})
if err != nil {
return nil, fmt.Errorf("error listing pods: %v", err)
}
// Create a map of resource requests per node
nodeRequestsMap := make(map[string]map[string]int64)
nodePodsMap := make(map[string]int)
for _, pod := range pods.Items {
if pod.Status.Phase != "Running" {
continue
}
nodeName := pod.Spec.NodeName
if nodeName == "" {
continue
}
if _, ok := nodeRequestsMap[nodeName]; !ok {
nodeRequestsMap[nodeName] = make(map[string]int64)
nodeRequestsMap[nodeName]["cpu"] = 0
nodeRequestsMap[nodeName]["memory"] = 0
}
nodePodsMap[nodeName]++
for _, container := range pod.Spec.Containers {
if container.Resources.Requests != nil {
nodeRequestsMap[nodeName]["cpu"] += container.Resources.Requests.Cpu().MilliValue()
nodeRequestsMap[nodeName]["memory"] += container.Resources.Requests.Memory().Value() / (1024 * 1024) // Convert to Mi
}
}
}
// Process node metrics
var result []NodeMetrics
for _, node := range nodes.Items {
nodeMetric := NodeMetrics{
Name: node.Name,
}
// Get allocatable resources
nodeMetric.AllocatableCPU = node.Status.Allocatable.Cpu().MilliValue()
nodeMetric.AllocatableMemory = node.Status.Allocatable.Memory().Value() / (1024 * 1024) // Convert to Mi
nodeMetric.PodCapacity = int(node.Status.Allocatable.Pods().Value())
// Get resource requests
if requests, ok := nodeRequestsMap[node.Name]; ok {
nodeMetric.RequestCPU = requests["cpu"]
nodeMetric.RequestMemory = requests["memory"]
}
// Get resource usage
if metrics, ok := nodeMetricsMap[node.Name]; ok {
nodeMetric.UsageCPU = metrics["cpu"]
nodeMetric.UsageMemory = metrics["memory"]
}
// Get pod count
nodeMetric.Pods = nodePodsMap[node.Name]
// Calculate utilization
nodeMetric.CPUUtilization = float64(nodeMetric.UsageCPU) / float64(nodeMetric.AllocatableCPU)
nodeMetric.MemoryUtilization = float64(nodeMetric.UsageMemory) / float64(nodeMetric.AllocatableMemory)
nodeMetric.PodUtilization = float64(nodeMetric.Pods) / float64(nodeMetric.PodCapacity)
result = append(result, nodeMetric)
}
return result, nil
}
func analyzeAutoscaling(clientset *kubernetes.Clientset, podMetrics []PodMetrics, nodeMetrics []NodeMetrics) []AutoscalingRecommendation {
var recommendations []AutoscalingRecommendation
// Analyze pod resource requests
for _, pod := range podMetrics {
// Check for underprovisioned CPU requests
if pod.CPURatio > 1.5 {
recommendations = append(recommendations, AutoscalingRecommendation{
ResourceType: "PodCPURequest",
Name: pod.Name,
Namespace: pod.Namespace,
CurrentValue: fmt.Sprintf("%dm", pod.RequestCPU),
RecommendedValue: fmt.Sprintf("%dm", int64(float64(pod.UsageCPU)*1.2)),
Reason: fmt.Sprintf("CPU usage (%.2f%%) significantly exceeds request", pod.CPURatio*100),
Impact: "Pod may be throttled and cause node autoscaler thrashing",
Priority: 1,
})
}
// Check for overprovisioned CPU requests
if pod.CPURatio < 0.3 && pod.RequestCPU > 100 {
recommendations = append(recommendations, AutoscalingRecommendation{
ResourceType: "PodCPURequest",
Name: pod.Name,
Namespace: pod.Namespace,
CurrentValue: fmt.Sprintf("%dm", pod.RequestCPU),
RecommendedValue: fmt.Sprintf("%dm", int64(float64(pod.UsageCPU)*1.5)),
Reason: fmt.Sprintf("CPU usage (%.2f%%) is much lower than request", pod.CPURatio*100),
Impact: "Wasted resources and unnecessary scaling",
Priority: 2,
})
}
// Check for underprovisioned memory requests
if pod.MemoryRatio > 1.2 {
recommendations = append(recommendations, AutoscalingRecommendation{
ResourceType: "PodMemoryRequest",
Name: pod.Name,
Namespace: pod.Namespace,
CurrentValue: fmt.Sprintf("%dMi", pod.RequestMemory),
RecommendedValue: fmt.Sprintf("%dMi", int64(float64(pod.UsageMemory)*1.2)),
Reason: fmt.Sprintf("Memory usage (%.2f%%) exceeds request", pod.MemoryRatio*100),
Impact: "Pod may be OOM killed and cause node autoscaler thrashing",
Priority: 1,
})
}
// Check for overprovisioned memory requests
if pod.MemoryRatio < 0.3 && pod.RequestMemory > 100 {
recommendations = append(recommendations, AutoscalingRecommendation{
ResourceType: "PodMemoryRequest",
Name: pod.Name,
Namespace: pod.Namespace,
CurrentValue: fmt.Sprintf("%dMi", pod.RequestMemory),
RecommendedValue: fmt.Sprintf("%dMi", int64(float64(pod.UsageMemory)*1.5)),
Reason: fmt.Sprintf("Memory usage (%.2f%%) is much lower than request", pod.MemoryRatio*100),
Impact: "Wasted resources and unnecessary scaling",
Priority: 2,
})
}
}
// Analyze node utilization
var cpuUtilization, memoryUtilization, podUtilization float64
for _, node := range nodeMetrics {
cpuUtilization += node.CPUUtilization
memoryUtilization += node.MemoryUtilization
podUtilization += node.PodUtilization
}
nodeCount := float64(len(nodeMetrics))
if nodeCount > 0 {
cpuUtilization /= nodeCount
memoryUtilization /= nodeCount
podUtilization /= nodeCount
// Check for cluster-wide resource imbalance
if cpuUtilization < 0.3 && memoryUtilization > 0.7 {
recommendations = append(recommendations, AutoscalingRecommendation{
ResourceType: "ClusterAutoscaler",
Name: "cluster-autoscaler",
Namespace: "kube-system",
CurrentValue: "Memory-bound scaling",
RecommendedValue: "Consider memory-optimized node types",
Reason: fmt.Sprintf("Cluster is memory-bound (CPU: %.2f%%, Memory: %.2f%%)", cpuUtilization*100, memoryUtilization*100),
Impact: "Inefficient resource usage and unnecessary scaling",
Priority: 3,
})
}
if memoryUtilization < 0.3 && cpuUtilization > 0.7 {
recommendations = append(recommendations, AutoscalingRecommendation{
ResourceType: "ClusterAutoscaler",
Name: "cluster-autoscaler",
Namespace: "kube-system",
CurrentValue: "CPU-bound scaling",
RecommendedValue: "Consider compute-optimized node types",
Reason: fmt.Sprintf("Cluster is CPU-bound (CPU: %.2f%%, Memory: %.2f%%)", cpuUtilization*100, memoryUtilization*100),
Impact: "Inefficient resource usage and unnecessary scaling",
Priority: 3,
})
}
if podUtilization > 0.8 {
recommendations = append(recommendations, AutoscalingRecommendation{
ResourceType: "ClusterAutoscaler",
Name: "cluster-autoscaler",
Namespace: "kube-system",
CurrentValue: "Pod-bound scaling",
RecommendedValue: "Increase max pods per node",
Reason: fmt.Sprintf("Cluster is pod-bound (Pod utilization: %.2f%%)", podUtilization*100),
Impact: "Scaling based on pod count rather than resource usage",
Priority: 3,
})
}
}
// Get Cluster Autoscaler configuration
caConfigMap, err := clientset.CoreV1().ConfigMaps("kube-system").Get(context.TODO(), "cluster-autoscaler-config", metav1.GetOptions{})
if err == nil {
// Check for aggressive scale-down settings
if caConfig, ok := caConfigMap.Data["config.yaml"]; ok {
if scaleDownTime := getConfigValue(caConfig, "scaleDownUnneededTime"); scaleDownTime != "" {
if scaleDownTime == "5m" || scaleDownTime == "300s" {
recommendations = append(recommendations, AutoscalingRecommendation{
ResourceType: "ClusterAutoscalerConfig",
Name: "scaleDownUnneededTime",
Namespace: "kube-system",
CurrentValue: scaleDownTime,
RecommendedValue: "15m",
Reason: "Scale-down delay is too short",
Impact: "Causes frequent node removal and pod disruption",
Priority: 1,
})
}
}
if scaleDownDelay := getConfigValue(caConfig, "scaleDownDelayAfterAdd"); scaleDownDelay != "" {
if scaleDownDelay == "5m" || scaleDownDelay == "300s" {
recommendations = append(recommendations, AutoscalingRecommendation{
ResourceType: "ClusterAutoscalerConfig",
Name: "scaleDownDelayAfterAdd",
Namespace: "kube-system",
CurrentValue: scaleDownDelay,
RecommendedValue: "15m",
Reason: "Scale-down delay after add is too short",
Impact: "Causes thrashing when load fluctuates",
Priority: 1,
})
}
}
if scanInterval := getConfigValue(caConfig, "scanInterval"); scanInterval != "" {
if scanInterval == "10s" {
recommendations = append(recommendations, AutoscalingRecommendation{
ResourceType: "ClusterAutoscalerConfig",
Name: "scanInterval",
Namespace: "kube-system",
CurrentValue: scanInterval,
RecommendedValue: "30s",
Reason: "Scan interval is too short",
Impact: "Causes frequent scaling decisions and API server load",
Priority: 2,
})
}
}
}
}
// Get HorizontalPodAutoscaler configurations
hpas, err := clientset.AutoscalingV2beta2().HorizontalPodAutoscalers("").List(context.TODO(), metav1.ListOptions{})
if err == nil {
for _, hpa := range hpas.Items {
// Check for aggressive scaling settings
if hpa.Spec.Behavior != nil {
if hpa.Spec.Behavior.ScaleDown != nil {
for _, policy := range hpa.Spec.Behavior.ScaleDown.Policies {
if policy.PeriodSeconds < 60 && policy.Type == "Percent" && policy.Value > 20 {
recommendations = append(recommendations, AutoscalingRecommendation{
ResourceType: "HorizontalPodAutoscaler",
Name: hpa.Name,
Namespace: hpa.Namespace,
CurrentValue: fmt.Sprintf("%d%% every %ds", policy.Value, policy.PeriodSeconds),
RecommendedValue: fmt.Sprintf("10%% every 60s"),
Reason: "Scale-down policy is too aggressive",
Impact: "Causes rapid pod scaling and potential thrashing",
Priority: 2,
})
}
}
}
if hpa.Spec.Behavior.ScaleUp != nil {
for _, policy := range hpa.Spec.Behavior.ScaleUp.Policies {
if policy.PeriodSeconds < 60 && policy.Type == "Percent" && policy.Value > 100 {
recommendations = append(recommendations, AutoscalingRecommendation{
ResourceType: "HorizontalPodAutoscaler",
Name: hpa.Name,
Namespace: hpa.Namespace,
CurrentValue: fmt.Sprintf("%d%% every %ds", policy.Value, policy.PeriodSeconds),
RecommendedValue: fmt.Sprintf("50%% every 60s"),
Reason: "Scale-up policy is too aggressive",
Impact: "Causes rapid pod scaling and potential thrashing",
Priority: 2,
})
}
}
}
}
}
}
// Sort recommendations by priority
sort.Slice(recommendations, func(i, j int) bool {
return recommendations[i].Priority < recommendations[j].Priority
})
return recommendations
}
func getConfigValue(config string, key string) string {
// Simple parser for YAML-like config
// In a real implementation, use a proper YAML parser
lines := strings.Split(config, "\n")
for _, line := range lines {
parts := strings.SplitN(line, ":", 2)
if len(parts) == 2 && strings.TrimSpace(parts[0]) == key {
return strings.TrimSpace(parts[1])
}
}
return ""
}
func outputRecommendations(recommendations []AutoscalingRecommendation) {
// Output recommendations as JSON
output, err := json.MarshalIndent(recommendations, "", " ")
if err != nil {
log.Fatalf("Error marshaling recommendations: %v", err)
}
fmt.Println(string(output))
// Write recommendations to file
err = ioutil.WriteFile("autoscaling_recommendations.json", output, 0644)
if err != nil {
log.Fatalf("Error writing recommendations to file: %v", err)
}
// Print summary
fmt.Printf("\nAutoscaling Analysis Summary:\n")
fmt.Printf("Found %d recommendations\n", len(recommendations))
priorityCount := make(map[int]int)
typeCount := make(map[string]int)
for _, rec := range recommendations {
priorityCount[rec.Priority]++
typeCount[rec.ResourceType]++
}
fmt.Printf("\nRecommendations by priority:\n")
for priority := 1; priority <= 3; priority++ {
fmt.Printf("Priority %d: %d recommendations\n", priority, priorityCount[priority])
}
fmt.Printf("\nRecommendations by type:\n")
for resourceType, count := range typeCount {
fmt.Printf("%s: %d recommendations\n", resourceType, count)
}
}
• Created a comprehensive autoscaling configuration guide:
# autoscaling_best_practices.yaml
---
cluster_autoscaler:
configuration:
# Delay before scaling down a node after it becomes unneeded
scaleDownUnneededTime: 15m
# Delay before scaling down after a node was added
scaleDownDelayAfterAdd: 15m
# Delay before scaling down after a node was removed
scaleDownDelayAfterDelete: 15m
# Delay before scaling down after a failure
scaleDownDelayAfterFailure: 3m
# How often CA checks for scale up/down
scanInterval: 30s
# Utilization threshold below which a node may be considered for scale down
scaleDownUtilizationThreshold: 0.5
# Maximum time to wait for node provisioning
maxNodeProvisionTime: 15m
# Maximum number of nodes that can be added in a single scale up
maxNodesTotal: 100
# Maximum number of unready nodes
maxTotalUnreadyPercentage: 45
# Skip nodes with local storage
skipNodesWithLocalStorage: true
# Skip nodes with system pods
skipNodesWithSystemPods: true
# Ignore DaemonSet pods when calculating resource utilization
ignoreDaemonSetsUtilization: true
# Maximum time to wait for pod deletion during scale down
maxGracefulTerminationSec: 600
node_groups:
# Configure multiple node groups with different instance types
- name: "general-purpose"
min_size: 3
max_size: 20
instance_type: "m5.2xlarge"
- name: "cpu-optimized"
min_size: 0
max_size: 10
instance_type: "c5.2xlarge"
- name: "memory-optimized"
min_size: 0
max_size: 10
instance_type: "r5.2xlarge"
scaling_strategy:
# Scale up quickly, scale down slowly
scale_up_step_size: 2
scale_down_step_size: 1
# Use priority expander to prefer cheaper node groups
expander: "priority"
expander_priorities:
- group_name: "general-purpose"
priority: 100
- group_name: "spot-instances"
priority: 50
- group_name: "cpu-optimized"
priority: 25
- group_name: "memory-optimized"
priority: 25
horizontal_pod_autoscaler:
configuration:
# Default HPA behavior
default_behavior:
scale_up:
stabilization_window_seconds: 300
select_policy: "Max"
policies:
- type: "Percent"
value: 50
period_seconds: 60
- type: "Pods"
value: 4
period_seconds: 60
scale_down:
stabilization_window_seconds: 300
select_policy: "Min"
policies:
- type: "Percent"
value: 10
period_seconds: 60
metrics:
# Use multiple metrics for more stable scaling
- type: "Resource"
resource:
name: "cpu"
target:
type: "Utilization"
averageUtilization: 70
- type: "Resource"
resource:
name: "memory"
target:
type: "Utilization"
averageUtilization: 70
# Consider custom metrics for application-specific scaling
- type: "Pods"
pods:
metric:
name: "requests_per_second"
target:
type: "AverageValue"
averageValue: 1000
pod_disruption_budgets:
# Configure PDBs for all critical services
critical_services:
min_available: "50%"
stateful_services:
min_available: 1
batch_jobs:
max_unavailable: "50%"
resource_requests:
# Guidelines for setting resource requests
cpu:
# Set requests based on observed P90 usage
buffer_factor: 1.5
min_request: "50m"
memory:
# Set requests based on observed P95 usage
buffer_factor: 1.2
min_request: "64Mi"
# Regularly review and adjust based on actual usage
review_period: "2 weeks"
monitoring:
# Metrics to monitor for autoscaling health
metrics:
- name: "cluster_autoscaler_activity"
description: "Count of scale up/down activities"
alert_threshold: "10 per hour"
- name: "node_creation_time"
description: "Time taken to create and ready a new node"
alert_threshold: "> 5 minutes"
- name: "pod_disruption_events"
description: "Count of pods disrupted due to node scaling"
alert_threshold: "> 20 per hour"
- name: "resource_request_vs_usage_ratio"
description: "Ratio of requested resources to actual usage"
alert_threshold: "< 0.3 or > 1.5"
dashboards:
- name: "Autoscaling Overview"
panels:
- title: "Cluster Autoscaler Activity"
metrics: ["cluster_autoscaler_scale_up_count", "cluster_autoscaler_scale_down_count"]
- title: "Node Count by Type"
metrics: ["kube_node_labels"]
- title: "Pod Disruptions"
metrics: ["kube_pod_deleted", "kube_pod_container_status_waiting"]
- title: "Resource Utilization vs Requests"
metrics: ["container_cpu_usage_seconds_total", "kube_pod_container_resource_requests"]
Lessons Learned:
Effective autoscaling requires careful tuning of multiple components working together.
How to Avoid:
Configure appropriate scale-down delays to prevent thrashing.
Set accurate resource requests based on actual usage.
Implement Pod Disruption Budgets for all critical services.
Use multiple node groups with different instance types.
Monitor autoscaling behavior and adjust configurations regularly.
No summary provided
What Happened:
A company upgraded their Kubernetes cluster from version 1.23 to 1.24. After the upgrade, their distributed database running as a StatefulSet experienced severe issues. Pods were unable to form a proper cluster, and some nodes reported data corruption. The application logs showed that pods were unable to recognize their peers, despite the StatefulSet appearing to be running normally according to Kubernetes.
Diagnosis Steps:
Analyzed application logs for error patterns.
Compared pod hostnames and network identities before and after the upgrade.
Reviewed StatefulSet configuration and PersistentVolumeClaim bindings.
Examined Kubernetes events and controller logs.
Tested pod-to-pod communication using network utilities.
Root Cause:
The investigation revealed multiple issues: 1. The Kubernetes upgrade included changes to the StatefulSet controller's pod identity management 2. Custom pod management policy settings in the StatefulSet were incompatible with the new version 3. The headless service DNS records were not properly updated after the upgrade 4. PersistentVolumeClaims were correctly bound, but pods couldn't resolve their peers by hostname 5. A custom admission controller was modifying pod hostnames in a way that conflicted with StatefulSet naming
Fix/Workaround:
• Short-term: Implemented a temporary fix by manually correcting DNS entries and restarting pods in the correct order
• Updated the StatefulSet configuration to be compatible with Kubernetes 1.24:
# Before: Problematic StatefulSet configuration
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: db-cluster
namespace: production
spec:
serviceName: db-cluster-headless
replicas: 5
podManagementPolicy: Parallel
updateStrategy:
type: OnDelete
selector:
matchLabels:
app: distributed-db
template:
metadata:
labels:
app: distributed-db
spec:
terminationGracePeriodSeconds: 10
hostname: db-node
subdomain: db-cluster-domain
containers:
- name: db
image: distributed-db:v1.2.3
ports:
- containerPort: 9042
name: client
- containerPort: 7000
name: intra-node
- containerPort: 7001
name: tls-intra-node
env:
- name: CLUSTER_NAME
value: "db-cluster"
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: HOST_IP
valueFrom:
fieldRef:
fieldPath: status.hostIP
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
volumeMounts:
- name: data
mountPath: /var/lib/db
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "fast-ssd"
resources:
requests:
storage: 100Gi
# After: Fixed StatefulSet configuration
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: db-cluster
namespace: production
spec:
serviceName: db-cluster-headless
replicas: 5
podManagementPolicy: OrderedReady # Changed from Parallel
updateStrategy:
type: RollingUpdate # Changed from OnDelete
rollingUpdate:
partition: 0
selector:
matchLabels:
app: distributed-db
template:
metadata:
labels:
app: distributed-db
spec:
terminationGracePeriodSeconds: 30 # Increased grace period
# Removed custom hostname and subdomain settings
containers:
- name: db
image: distributed-db:v1.2.3
ports:
- containerPort: 9042
name: client
- containerPort: 7000
name: intra-node
- containerPort: 7001
name: tls-intra-node
env:
- name: CLUSTER_NAME
value: "db-cluster"
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: HOST_IP
valueFrom:
fieldRef:
fieldPath: status.hostIP
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
# Added readiness probe
readinessProbe:
exec:
command:
- /bin/sh
- -c
- nodetool status | grep -E "^UN\\s+${POD_IP}"
initialDelaySeconds: 15
timeoutSeconds: 5
# Added lifecycle hooks
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- nodetool drain
volumeMounts:
- name: data
mountPath: /var/lib/db
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "fast-ssd"
resources:
requests:
storage: 100Gi
• Fixed the headless service configuration:
# Before: Problematic headless service
apiVersion: v1
kind: Service
metadata:
name: db-cluster-headless
namespace: production
labels:
app: distributed-db
spec:
ports:
- port: 9042
name: client
- port: 7000
name: intra-node
- port: 7001
name: tls-intra-node
clusterIP: None
selector:
app: distributed-db
publishNotReadyAddresses: true
# After: Fixed headless service
apiVersion: v1
kind: Service
metadata:
name: db-cluster-headless
namespace: production
labels:
app: distributed-db
annotations:
service.alpha.kubernetes.io/tolerate-unready-endpoints: "true"
spec:
ports:
- port: 9042
name: client
- port: 7000
name: intra-node
- port: 7001
name: tls-intra-node
clusterIP: None
selector:
app: distributed-db
publishNotReadyAddresses: false # Changed to false
• Implemented a custom StatefulSet controller in Go to handle the migration:
// statefulset_migration.go
package main
import (
"context"
"flag"
"fmt"
"os"
"os/signal"
"syscall"
"time"
appsv1 "k8s.io/api/apps/v1"
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/labels"
"k8s.io/apimachinery/pkg/types"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/rest"
"k8s.io/client-go/tools/clientcmd"
"k8s.io/klog/v2"
)
var (
kubeconfig string
namespace string
statefulSetName string
migrationMode string
gracePeriodSec int
waitTimeSec int
skipConfirmation bool
preserveNodeOrder bool
)
func init() {
flag.StringVar(&kubeconfig, "kubeconfig", "", "Path to kubeconfig file")
flag.StringVar(&namespace, "namespace", "", "Namespace of the StatefulSet")
flag.StringVar(&statefulSetName, "statefulset", "", "Name of the StatefulSet")
flag.StringVar(&migrationMode, "mode", "validate", "Migration mode: validate, prepare, migrate, rollback")
flag.IntVar(&gracePeriodSec, "grace-period", 30, "Grace period in seconds for pod termination")
flag.IntVar(&waitTimeSec, "wait-time", 60, "Wait time in seconds between pod operations")
flag.BoolVar(&skipConfirmation, "yes", false, "Skip confirmation prompts")
flag.BoolVar(&preserveNodeOrder, "preserve-order", true, "Preserve node order during migration")
flag.Parse()
}
func main() {
if namespace == "" || statefulSetName == "" {
klog.Fatal("Namespace and StatefulSet name are required")
}
// Create Kubernetes client
var config *rest.Config
var err error
if kubeconfig == "" {
klog.Info("Using in-cluster configuration")
config, err = rest.InClusterConfig()
} else {
klog.Infof("Using configuration from %s", kubeconfig)
config, err = clientcmd.BuildConfigFromFlags("", kubeconfig)
}
if err != nil {
klog.Fatalf("Error building kubeconfig: %s", err.Error())
}
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
klog.Fatalf("Error creating Kubernetes client: %s", err.Error())
}
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
// Handle termination signals
signalChan := make(chan os.Signal, 1)
signal.Notify(signalChan, syscall.SIGINT, syscall.SIGTERM)
go func() {
<-signalChan
klog.Info("Received termination signal, shutting down gracefully...")
cancel()
os.Exit(0)
}()
// Execute the selected mode
switch migrationMode {
case "validate":
validateStatefulSet(ctx, clientset)
case "prepare":
prepareStatefulSetMigration(ctx, clientset)
case "migrate":
migrateStatefulSet(ctx, clientset)
case "rollback":
rollbackStatefulSetMigration(ctx, clientset)
default:
klog.Fatalf("Unknown migration mode: %s", migrationMode)
}
}
func validateStatefulSet(ctx context.Context, clientset *kubernetes.Clientset) {
klog.Infof("Validating StatefulSet %s in namespace %s", statefulSetName, namespace)
// Get the StatefulSet
sts, err := clientset.AppsV1().StatefulSets(namespace).Get(ctx, statefulSetName, metav1.GetOptions{})
if err != nil {
klog.Fatalf("Error getting StatefulSet: %s", err.Error())
}
// Validate StatefulSet configuration
issues := []string{}
// Check podManagementPolicy
if sts.Spec.PodManagementPolicy == appsv1.ParallelPodManagement {
issues = append(issues, "PodManagementPolicy is set to Parallel, which may cause issues after upgrade. Consider changing to OrderedReady.")
}
// Check updateStrategy
if sts.Spec.UpdateStrategy.Type == appsv1.OnDeleteStatefulSetStrategyType {
issues = append(issues, "UpdateStrategy is set to OnDelete, which may cause issues after upgrade. Consider changing to RollingUpdate.")
}
// Check for custom hostname and subdomain
if sts.Spec.Template.Spec.Hostname != "" || sts.Spec.Template.Spec.Subdomain != "" {
issues = append(issues, "Custom hostname or subdomain is set, which may conflict with StatefulSet naming in newer Kubernetes versions.")
}
// Check for readiness probes
for _, container := range sts.Spec.Template.Spec.Containers {
if container.ReadinessProbe == nil {
issues = append(issues, fmt.Sprintf("Container %s does not have a readiness probe, which may cause issues with pod identity.", container.Name))
}
}
// Check for lifecycle hooks
for _, container := range sts.Spec.Template.Spec.Containers {
if container.Lifecycle == nil || container.Lifecycle.PreStop == nil {
issues = append(issues, fmt.Sprintf("Container %s does not have a preStop lifecycle hook, which may cause issues with graceful termination.", container.Name))
}
}
// Check termination grace period
if sts.Spec.Template.Spec.TerminationGracePeriodSeconds != nil && *sts.Spec.Template.Spec.TerminationGracePeriodSeconds < 30 {
issues = append(issues, "TerminationGracePeriodSeconds is less than 30 seconds, which may not be enough for graceful termination.")
}
// Check headless service
svc, err := clientset.CoreV1().Services(namespace).Get(ctx, sts.Spec.ServiceName, metav1.GetOptions{})
if err != nil {
issues = append(issues, fmt.Sprintf("Error getting headless service %s: %s", sts.Spec.ServiceName, err.Error()))
} else {
if svc.Spec.ClusterIP != "None" {
issues = append(issues, "Service is not headless (ClusterIP is not None).")
}
if svc.Spec.PublishNotReadyAddresses {
issues = append(issues, "PublishNotReadyAddresses is set to true, which may cause issues with pod identity in newer Kubernetes versions.")
}
}
// Check PVCs
pvcs, err := clientset.CoreV1().PersistentVolumeClaims(namespace).List(ctx, metav1.ListOptions{
LabelSelector: labels.SelectorFromSet(sts.Spec.Selector.MatchLabels).String(),
})
if err != nil {
issues = append(issues, fmt.Sprintf("Error listing PVCs: %s", err.Error()))
} else {
klog.Infof("Found %d PVCs for StatefulSet %s", len(pvcs.Items), statefulSetName)
for _, pvc := range pvcs.Items {
if pvc.Status.Phase != corev1.ClaimBound {
issues = append(issues, fmt.Sprintf("PVC %s is not bound (status: %s)", pvc.Name, pvc.Status.Phase))
}
}
}
// Check pods
pods, err := clientset.CoreV1().Pods(namespace).List(ctx, metav1.ListOptions{
LabelSelector: labels.SelectorFromSet(sts.Spec.Selector.MatchLabels).String(),
})
if err != nil {
issues = append(issues, fmt.Sprintf("Error listing pods: %s", err.Error()))
} else {
klog.Infof("Found %d pods for StatefulSet %s", len(pods.Items), statefulSetName)
for _, pod := range pods.Items {
if !strings.HasPrefix(pod.Name, statefulSetName+"-") {
issues = append(issues, fmt.Sprintf("Pod %s does not follow StatefulSet naming convention", pod.Name))
}
}
}
// Print validation results
if len(issues) == 0 {
klog.Info("No issues found with StatefulSet configuration.")
} else {
klog.Info("Found the following issues with StatefulSet configuration:")
for i, issue := range issues {
klog.Infof("%d. %s", i+1, issue)
}
}
}
func prepareStatefulSetMigration(ctx context.Context, clientset *kubernetes.Clientset) {
klog.Infof("Preparing StatefulSet %s in namespace %s for migration", statefulSetName, namespace)
// Get the StatefulSet
sts, err := clientset.AppsV1().StatefulSets(namespace).Get(ctx, statefulSetName, metav1.GetOptions{})
if err != nil {
klog.Fatalf("Error getting StatefulSet: %s", err.Error())
}
// Create a backup of the StatefulSet
backupName := fmt.Sprintf("%s-backup-%d", statefulSetName, time.Now().Unix())
backupSts := sts.DeepCopy()
backupSts.ObjectMeta.Name = backupName
backupSts.ObjectMeta.ResourceVersion = ""
backupSts.ObjectMeta.UID = ""
backupSts.ObjectMeta.CreationTimestamp = metav1.Time{}
backupSts.Status = appsv1.StatefulSetStatus{}
// Add backup annotation to the original StatefulSet
if sts.Annotations == nil {
sts.Annotations = make(map[string]string)
}
sts.Annotations["statefulset-migration.kubernetes.io/backup"] = backupName
// Create the backup StatefulSet
_, err = clientset.AppsV1().StatefulSets(namespace).Create(ctx, backupSts, metav1.CreateOptions{})
if err != nil {
klog.Fatalf("Error creating backup StatefulSet: %s", err.Error())
}
klog.Infof("Created backup StatefulSet %s", backupName)
// Update the original StatefulSet with the backup annotation
_, err = clientset.AppsV1().StatefulSets(namespace).Update(ctx, sts, metav1.UpdateOptions{})
if err != nil {
klog.Fatalf("Error updating StatefulSet with backup annotation: %s", err.Error())
}
klog.Infof("Updated StatefulSet %s with backup annotation", statefulSetName)
// Create a ConfigMap with the migration plan
migrationPlan := map[string]string{
"backup": backupName,
"statefulset": statefulSetName,
"namespace": namespace,
"created_at": time.Now().Format(time.RFC3339),
"created_by": "statefulset-migration-tool",
"instructions": "This ConfigMap contains the migration plan for StatefulSet " + statefulSetName + " in namespace " + namespace + ". Do not delete this ConfigMap until the migration is complete.",
}
migrationPlanCM := &corev1.ConfigMap{
ObjectMeta: metav1.ObjectMeta{
Name: fmt.Sprintf("%s-migration-plan", statefulSetName),
Namespace: namespace,
Labels: map[string]string{
"app.kubernetes.io/managed-by": "statefulset-migration-tool",
"app.kubernetes.io/name": statefulSetName,
},
},
Data: migrationPlan,
}
_, err = clientset.CoreV1().ConfigMaps(namespace).Create(ctx, migrationPlanCM, metav1.CreateOptions{})
if err != nil {
klog.Fatalf("Error creating migration plan ConfigMap: %s", err.Error())
}
klog.Infof("Created migration plan ConfigMap %s", migrationPlanCM.Name)
klog.Info("StatefulSet migration preparation complete.")
}
func migrateStatefulSet(ctx context.Context, clientset *kubernetes.Clientset) {
klog.Infof("Migrating StatefulSet %s in namespace %s", statefulSetName, namespace)
// Get the StatefulSet
sts, err := clientset.AppsV1().StatefulSets(namespace).Get(ctx, statefulSetName, metav1.GetOptions{})
if err != nil {
klog.Fatalf("Error getting StatefulSet: %s", err.Error())
}
// Check if the StatefulSet has a backup annotation
backupName, ok := sts.Annotations["statefulset-migration.kubernetes.io/backup"]
if !ok {
klog.Fatal("StatefulSet does not have a backup annotation. Run prepare mode first.")
}
// Get the backup StatefulSet
_, err = clientset.AppsV1().StatefulSets(namespace).Get(ctx, backupName, metav1.GetOptions{})
if err != nil {
klog.Fatalf("Error getting backup StatefulSet: %s", err.Error())
}
// Get the pods
pods, err := clientset.CoreV1().Pods(namespace).List(ctx, metav1.ListOptions{
LabelSelector: labels.SelectorFromSet(sts.Spec.Selector.MatchLabels).String(),
})
if err != nil {
klog.Fatalf("Error listing pods: %s", err.Error())
}
// Sort pods by ordinal index
sort.Slice(pods.Items, func(i, j int) bool {
ordinalI := getOrdinalFromPodName(pods.Items[i].Name, statefulSetName)
ordinalJ := getOrdinalFromPodName(pods.Items[j].Name, statefulSetName)
return ordinalI < ordinalJ
})
// Confirm migration
if !skipConfirmation {
klog.Infof("Will migrate %d pods of StatefulSet %s in namespace %s", len(pods.Items), statefulSetName, namespace)
klog.Info("This will cause temporary disruption to the service.")
klog.Info("Press Ctrl+C to cancel or Enter to continue...")
fmt.Scanln()
}
// Update StatefulSet configuration
newSts := sts.DeepCopy()
// Update podManagementPolicy
newSts.Spec.PodManagementPolicy = appsv1.OrderedReadyPodManagement
// Update updateStrategy
newSts.Spec.UpdateStrategy = appsv1.StatefulSetUpdateStrategy{
Type: appsv1.RollingUpdateStatefulSetStrategyType,
RollingUpdate: &appsv1.RollingUpdateStatefulSetStrategy{
Partition: new(int32),
},
}
// Remove custom hostname and subdomain
newSts.Spec.Template.Spec.Hostname = ""
newSts.Spec.Template.Spec.Subdomain = ""
// Set termination grace period
gracePeriod := int64(gracePeriodSec)
newSts.Spec.Template.Spec.TerminationGracePeriodSeconds = &gracePeriod
// Add migration in progress annotation
if newSts.Annotations == nil {
newSts.Annotations = make(map[string]string)
}
newSts.Annotations["statefulset-migration.kubernetes.io/in-progress"] = "true"
// Update the StatefulSet
_, err = clientset.AppsV1().StatefulSets(namespace).Update(ctx, newSts, metav1.UpdateOptions{})
if err != nil {
klog.Fatalf("Error updating StatefulSet: %s", err.Error())
}
klog.Infof("Updated StatefulSet %s configuration", statefulSetName)
// Scale down the StatefulSet to 0 replicas
klog.Info("Scaling down StatefulSet to 0 replicas")
scale, err := clientset.AppsV1().StatefulSets(namespace).GetScale(ctx, statefulSetName, metav1.GetOptions{})
if err != nil {
klog.Fatalf("Error getting StatefulSet scale: %s", err.Error())
}
originalReplicas := scale.Spec.Replicas
scale.Spec.Replicas = 0
_, err = clientset.AppsV1().StatefulSets(namespace).UpdateScale(ctx, statefulSetName, scale, metav1.UpdateOptions{})
if err != nil {
klog.Fatalf("Error scaling down StatefulSet: %s", err.Error())
}
// Wait for all pods to terminate
klog.Info("Waiting for all pods to terminate...")
for {
pods, err := clientset.CoreV1().Pods(namespace).List(ctx, metav1.ListOptions{
LabelSelector: labels.SelectorFromSet(sts.Spec.Selector.MatchLabels).String(),
})
if err != nil {
klog.Fatalf("Error listing pods: %s", err.Error())
}
if len(pods.Items) == 0 {
break
}
klog.Infof("%d pods still running, waiting...", len(pods.Items))
time.Sleep(5 * time.Second)
}
// Update the headless service
svc, err := clientset.CoreV1().Services(namespace).Get(ctx, sts.Spec.ServiceName, metav1.GetOptions{})
if err != nil {
klog.Fatalf("Error getting headless service: %s", err.Error())
}
newSvc := svc.DeepCopy()
newSvc.Spec.PublishNotReadyAddresses = false
// Add migration annotation
if newSvc.Annotations == nil {
newSvc.Annotations = make(map[string]string)
}
newSvc.Annotations["service.alpha.kubernetes.io/tolerate-unready-endpoints"] = "true"
_, err = clientset.CoreV1().Services(namespace).Update(ctx, newSvc, metav1.UpdateOptions{})
if err != nil {
klog.Fatalf("Error updating headless service: %s", err.Error())
}
klog.Infof("Updated headless service %s", svc.Name)
// Scale up the StatefulSet to the original number of replicas
klog.Infof("Scaling up StatefulSet to %d replicas", originalReplicas)
scale.Spec.Replicas = originalReplicas
_, err = clientset.AppsV1().StatefulSets(namespace).UpdateScale(ctx, statefulSetName, scale, metav1.UpdateOptions{})
if err != nil {
klog.Fatalf("Error scaling up StatefulSet: %s", err.Error())
}
// Wait for all pods to be ready
klog.Info("Waiting for all pods to be ready...")
for {
sts, err := clientset.AppsV1().StatefulSets(namespace).Get(ctx, statefulSetName, metav1.GetOptions{})
if err != nil {
klog.Fatalf("Error getting StatefulSet: %s", err.Error())
}
if sts.Status.ReadyReplicas == sts.Status.Replicas {
break
}
klog.Infof("%d/%d pods ready, waiting...", sts.Status.ReadyReplicas, sts.Status.Replicas)
time.Sleep(5 * time.Second)
}
// Remove migration in progress annotation
sts, err = clientset.AppsV1().StatefulSets(namespace).Get(ctx, statefulSetName, metav1.GetOptions{})
if err != nil {
klog.Fatalf("Error getting StatefulSet: %s", err.Error())
}
delete(sts.Annotations, "statefulset-migration.kubernetes.io/in-progress")
sts.Annotations["statefulset-migration.kubernetes.io/completed"] = time.Now().Format(time.RFC3339)
_, err = clientset.AppsV1().StatefulSets(namespace).Update(ctx, sts, metav1.UpdateOptions{})
if err != nil {
klog.Fatalf("Error updating StatefulSet: %s", err.Error())
}
klog.Info("StatefulSet migration completed successfully.")
}
func rollbackStatefulSetMigration(ctx context.Context, clientset *kubernetes.Clientset) {
klog.Infof("Rolling back StatefulSet %s in namespace %s", statefulSetName, namespace)
// Get the StatefulSet
sts, err := clientset.AppsV1().StatefulSets(namespace).Get(ctx, statefulSetName, metav1.GetOptions{})
if err != nil {
klog.Fatalf("Error getting StatefulSet: %s", err.Error())
}
// Check if the StatefulSet has a backup annotation
backupName, ok := sts.Annotations["statefulset-migration.kubernetes.io/backup"]
if !ok {
klog.Fatal("StatefulSet does not have a backup annotation. Cannot rollback.")
}
// Get the backup StatefulSet
backupSts, err := clientset.AppsV1().StatefulSets(namespace).Get(ctx, backupName, metav1.GetOptions{})
if err != nil {
klog.Fatalf("Error getting backup StatefulSet: %s", err.Error())
}
// Confirm rollback
if !skipConfirmation {
klog.Infof("Will roll back StatefulSet %s in namespace %s to backup %s", statefulSetName, namespace, backupName)
klog.Info("This will cause temporary disruption to the service.")
klog.Info("Press Ctrl+C to cancel or Enter to continue...")
fmt.Scanln()
}
// Scale down the StatefulSet to 0 replicas
klog.Info("Scaling down StatefulSet to 0 replicas")
scale, err := clientset.AppsV1().StatefulSets(namespace).GetScale(ctx, statefulSetName, metav1.GetOptions{})
if err != nil {
klog.Fatalf("Error getting StatefulSet scale: %s", err.Error())
}
originalReplicas := scale.Spec.Replicas
scale.Spec.Replicas = 0
_, err = clientset.AppsV1().StatefulSets(namespace).UpdateScale(ctx, statefulSetName, scale, metav1.UpdateOptions{})
if err != nil {
klog.Fatalf("Error scaling down StatefulSet: %s", err.Error())
}
// Wait for all pods to terminate
klog.Info("Waiting for all pods to terminate...")
for {
pods, err := clientset.CoreV1().Pods(namespace).List(ctx, metav1.ListOptions{
LabelSelector: labels.SelectorFromSet(sts.Spec.Selector.MatchLabels).String(),
})
if err != nil {
klog.Fatalf("Error listing pods: %s", err.Error())
}
if len(pods.Items) == 0 {
break
}
klog.Infof("%d pods still running, waiting...", len(pods.Items))
time.Sleep(5 * time.Second)
}
// Update the StatefulSet with the backup configuration
newSts := sts.DeepCopy()
newSts.Spec = backupSts.Spec
// Add rollback annotation
if newSts.Annotations == nil {
newSts.Annotations = make(map[string]string)
}
newSts.Annotations["statefulset-migration.kubernetes.io/rollback"] = time.Now().Format(time.RFC3339)
delete(newSts.Annotations, "statefulset-migration.kubernetes.io/in-progress")
delete(newSts.Annotations, "statefulset-migration.kubernetes.io/completed")
_, err = clientset.AppsV1().StatefulSets(namespace).Update(ctx, newSts, metav1.UpdateOptions{})
if err != nil {
klog.Fatalf("Error updating StatefulSet: %s", err.Error())
}
klog.Infof("Updated StatefulSet %s configuration to backup", statefulSetName)
// Restore the headless service
svc, err := clientset.CoreV1().Services(namespace).Get(ctx, sts.Spec.ServiceName, metav1.GetOptions{})
if err != nil {
klog.Fatalf("Error getting headless service: %s", err.Error())
}
newSvc := svc.DeepCopy()
newSvc.Spec.PublishNotReadyAddresses = true
// Remove migration annotation
if newSvc.Annotations != nil {
delete(newSvc.Annotations, "service.alpha.kubernetes.io/tolerate-unready-endpoints")
}
_, err = clientset.CoreV1().Services(namespace).Update(ctx, newSvc, metav1.UpdateOptions{})
if err != nil {
klog.Fatalf("Error updating headless service: %s", err.Error())
}
klog.Infof("Updated headless service %s", svc.Name)
// Scale up the StatefulSet to the original number of replicas
klog.Infof("Scaling up StatefulSet to %d replicas", originalReplicas)
scale.Spec.Replicas = originalReplicas
_, err = clientset.AppsV1().StatefulSets(namespace).UpdateScale(ctx, statefulSetName, scale, metav1.UpdateOptions{})
if err != nil {
klog.Fatalf("Error scaling up StatefulSet: %s", err.Error())
}
// Wait for all pods to be ready
klog.Info("Waiting for all pods to be ready...")
for {
sts, err := clientset.AppsV1().StatefulSets(namespace).Get(ctx, statefulSetName, metav1.GetOptions{})
if err != nil {
klog.Fatalf("Error getting StatefulSet: %s", err.Error())
}
if sts.Status.ReadyReplicas == sts.Status.Replicas {
break
}
klog.Infof("%d/%d pods ready, waiting...", sts.Status.ReadyReplicas, sts.Status.Replicas)
time.Sleep(5 * time.Second)
}
klog.Info("StatefulSet rollback completed successfully.")
}
// getOrdinalFromPodName extracts the ordinal index from a pod name
func getOrdinalFromPodName(podName, statefulSetName string) int {
if !strings.HasPrefix(podName, statefulSetName+"-") {
return -1
}
ordinalStr := strings.TrimPrefix(podName, statefulSetName+"-")
ordinal, err := strconv.Atoi(ordinalStr)
if err != nil {
return -1
}
return ordinal
}
• Long-term: Implemented a comprehensive StatefulSet management strategy:
- Created a StatefulSet upgrade validation tool
- Implemented automated pre-upgrade testing
- Added monitoring for StatefulSet identity issues
- Documented best practices for StatefulSet configuration
- Created a StatefulSet migration playbook
Lessons Learned:
StatefulSet pod identity management is critical for distributed systems and requires careful handling during Kubernetes upgrades.
How to Avoid:
Test StatefulSet configurations in a staging environment before upgrading production.
Follow Kubernetes version upgrade notes carefully for StatefulSet changes.
Use the default StatefulSet naming convention without custom hostnames.
Implement proper readiness probes and lifecycle hooks.
Monitor DNS resolution between StatefulSet pods after upgrades.
No summary provided
What Happened:
A company was upgrading their Kubernetes cluster from version 1.23 to 1.25. During the upgrade process, several StatefulSet-managed applications experienced unexpected pod restarts. After the upgrade, users reported data inconsistencies and some services failed to start properly. Investigation revealed that some StatefulSet pods had been assigned incorrect persistent volumes, causing data corruption and service disruption.
Diagnosis Steps:
Analyzed pod events and logs before and after the upgrade.
Examined StatefulSet configurations and PersistentVolumeClaim bindings.
Reviewed the upgrade process and component versions.
Tested pod identity preservation in a staging environment.
Consulted Kubernetes release notes for known issues with StatefulSets.
Root Cause:
The investigation revealed multiple issues: 1. The upgrade process did not properly preserve StatefulSet pod identities 2. A custom admission controller was modifying pod names during recreation 3. The storage class used for PVCs had reclaim policy set to "Delete" instead of "Retain" 4. Some PVCs were using volumeMode: Filesystem with block storage, causing mounting issues 5. The StatefulSet controller had a race condition when recreating pods during the upgrade
Fix/Workaround:
• Short-term: Implemented a controlled StatefulSet migration process:
// statefulset_migration.go
package main
import (
"context"
"flag"
"fmt"
"os"
"time"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/tools/clientcmd"
)
func main() {
kubeconfig := flag.String("kubeconfig", "", "Path to kubeconfig file")
namespace := flag.String("namespace", "", "Namespace of the StatefulSet")
name := flag.String("name", "", "Name of the StatefulSet")
action := flag.String("action", "validate", "Action to perform: validate, prepare, migrate, rollback")
flag.Parse()
if *namespace == "" || *name == "" {
fmt.Println("Error: namespace and name are required")
os.Exit(1)
}
// Load kubeconfig
config, err := clientcmd.BuildConfigFromFlags("", *kubeconfig)
if err != nil {
fmt.Printf("Error building kubeconfig: %v\n", err)
os.Exit(1)
}
// Create clientset
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
fmt.Printf("Error creating Kubernetes client: %v\n", err)
os.Exit(1)
}
ctx := context.Background()
// Perform the requested action
switch *action {
case "validate":
validateStatefulSet(ctx, clientset, *namespace, *name)
case "prepare":
prepareStatefulSetMigration(ctx, clientset, *namespace, *name)
case "migrate":
migrateStatefulSet(ctx, clientset, *namespace, *name)
case "rollback":
rollbackStatefulSetMigration(ctx, clientset, *namespace, *name)
default:
fmt.Printf("Unknown action: %s\n", *action)
os.Exit(1)
}
}
func validateStatefulSet(ctx context.Context, clientset *kubernetes.Clientset, namespace, name string) {
fmt.Printf("Validating StatefulSet %s in namespace %s...\n", name, namespace)
// Get the StatefulSet
sts, err := clientset.AppsV1().StatefulSets(namespace).Get(ctx, name, metav1.GetOptions{})
if err != nil {
fmt.Printf("Error getting StatefulSet: %v\n", err)
os.Exit(1)
}
// Check if StatefulSet is stable
if sts.Status.Replicas != sts.Status.ReadyReplicas {
fmt.Printf("Warning: StatefulSet is not stable. Replicas: %d, ReadyReplicas: %d\n",
sts.Status.Replicas, sts.Status.ReadyReplicas)
} else {
fmt.Printf("StatefulSet is stable with %d replicas\n", sts.Status.Replicas)
}
// Get PVCs associated with the StatefulSet
pvcs, err := clientset.CoreV1().PersistentVolumeClaims(namespace).List(ctx, metav1.ListOptions{
LabelSelector: fmt.Sprintf("app=%s", name),
})
if err != nil {
fmt.Printf("Error listing PVCs: %v\n", err)
os.Exit(1)
}
fmt.Printf("Found %d PVCs associated with the StatefulSet\n", len(pvcs.Items))
// Check PVC bindings
for _, pvc := range pvcs.Items {
if pvc.Status.Phase != "Bound" {
fmt.Printf("Warning: PVC %s is not bound (status: %s)\n", pvc.Name, pvc.Status.Phase)
} else {
fmt.Printf("PVC %s is bound to PV %s\n", pvc.Name, pvc.Spec.VolumeName)
}
// Check volumeMode
if pvc.Spec.VolumeMode != nil && *pvc.Spec.VolumeMode == "Block" {
fmt.Printf("Warning: PVC %s uses Block volumeMode which requires special handling\n", pvc.Name)
}
}
// Get pods associated with the StatefulSet
pods, err := clientset.CoreV1().Pods(namespace).List(ctx, metav1.ListOptions{
LabelSelector: fmt.Sprintf("app=%s", name),
})
if err != nil {
fmt.Printf("Error listing pods: %v\n", err)
os.Exit(1)
}
fmt.Printf("Found %d pods associated with the StatefulSet\n", len(pods.Items))
// Check pod status and volume mounts
for _, pod := range pods.Items {
fmt.Printf("Pod %s status: %s\n", pod.Name, pod.Status.Phase)
// Check if pod name follows StatefulSet naming convention
if !isValidStatefulSetPodName(pod.Name, name) {
fmt.Printf("Warning: Pod %s does not follow StatefulSet naming convention\n", pod.Name)
}
// Check volume mounts
for _, volume := range pod.Spec.Volumes {
if volume.PersistentVolumeClaim != nil {
fmt.Printf("Pod %s uses PVC %s\n", pod.Name, volume.PersistentVolumeClaim.ClaimName)
// Verify PVC name matches pod ordinal
expectedPVCName := fmt.Sprintf("data-%s-%s", name, getPodOrdinal(pod.Name))
if volume.PersistentVolumeClaim.ClaimName != expectedPVCName {
fmt.Printf("Warning: Pod %s uses PVC %s but expected %s\n",
pod.Name, volume.PersistentVolumeClaim.ClaimName, expectedPVCName)
}
}
}
}
fmt.Println("Validation complete")
}
func prepareStatefulSetMigration(ctx context.Context, clientset *kubernetes.Clientset, namespace, name string) {
fmt.Printf("Preparing StatefulSet %s in namespace %s for migration...\n", name, namespace)
// Get the StatefulSet
sts, err := clientset.AppsV1().StatefulSets(namespace).Get(ctx, name, metav1.GetOptions{})
if err != nil {
fmt.Printf("Error getting StatefulSet: %v\n", err)
os.Exit(1)
}
// Create a backup of the StatefulSet
backupName := fmt.Sprintf("%s-backup-%d", name, time.Now().Unix())
sts.ObjectMeta.Name = backupName
sts.ObjectMeta.ResourceVersion = ""
sts.ObjectMeta.UID = ""
sts.ObjectMeta.CreationTimestamp = metav1.Time{}
sts.Status = metav1.Status{}
// Add backup annotation
if sts.ObjectMeta.Annotations == nil {
sts.ObjectMeta.Annotations = make(map[string]string)
}
sts.ObjectMeta.Annotations["migration-backup-of"] = name
sts.ObjectMeta.Annotations["migration-timestamp"] = fmt.Sprintf("%d", time.Now().Unix())
// Set replicas to 0 to avoid creating pods
replicas := int32(0)
sts.Spec.Replicas = &replicas
// Create the backup StatefulSet
_, err = clientset.AppsV1().StatefulSets(namespace).Create(ctx, sts, metav1.CreateOptions{})
if err != nil {
fmt.Printf("Error creating backup StatefulSet: %v\n", err)
os.Exit(1)
}
fmt.Printf("Created backup StatefulSet %s\n", backupName)
// Get PVCs associated with the StatefulSet
pvcs, err := clientset.CoreV1().PersistentVolumeClaims(namespace).List(ctx, metav1.ListOptions{
LabelSelector: fmt.Sprintf("app=%s", name),
})
if err != nil {
fmt.Printf("Error listing PVCs: %v\n", err)
os.Exit(1)
}
// Create snapshots or backups of PVCs if supported
fmt.Printf("Found %d PVCs to prepare for migration\n", len(pvcs.Items))
for _, pvc := range pvcs.Items {
// Add migration annotation
pvcCopy := pvc.DeepCopy()
if pvcCopy.Annotations == nil {
pvcCopy.Annotations = make(map[string]string)
}
pvcCopy.Annotations["migration-prepared"] = "true"
pvcCopy.Annotations["migration-timestamp"] = fmt.Sprintf("%d", time.Now().Unix())
_, err = clientset.CoreV1().PersistentVolumeClaims(namespace).Update(ctx, pvcCopy, metav1.UpdateOptions{})
if err != nil {
fmt.Printf("Warning: Failed to update PVC %s annotations: %v\n", pvc.Name, err)
} else {
fmt.Printf("Updated PVC %s annotations for migration\n", pvc.Name)
}
}
fmt.Println("Preparation complete. Ready for migration.")
}
func migrateStatefulSet(ctx context.Context, clientset *kubernetes.Clientset, namespace, name string) {
fmt.Printf("Migrating StatefulSet %s in namespace %s...\n", name, namespace)
// Scale down the StatefulSet
fmt.Printf("Scaling down StatefulSet %s to 0 replicas...\n", name)
scale, err := clientset.AppsV1().StatefulSets(namespace).GetScale(ctx, name, metav1.GetOptions{})
if err != nil {
fmt.Printf("Error getting StatefulSet scale: %v\n", err)
os.Exit(1)
}
// Save original replicas for later
originalReplicas := scale.Spec.Replicas
// Set replicas to 0
scale.Spec.Replicas = 0
_, err = clientset.AppsV1().StatefulSets(namespace).UpdateScale(ctx, name, scale, metav1.UpdateOptions{})
if err != nil {
fmt.Printf("Error scaling down StatefulSet: %v\n", err)
os.Exit(1)
}
// Wait for pods to terminate
fmt.Println("Waiting for pods to terminate...")
for {
pods, err := clientset.CoreV1().Pods(namespace).List(ctx, metav1.ListOptions{
LabelSelector: fmt.Sprintf("app=%s", name),
})
if err != nil {
fmt.Printf("Error listing pods: %v\n", err)
os.Exit(1)
}
if len(pods.Items) == 0 {
fmt.Println("All pods terminated")
break
}
fmt.Printf("%d pods still running, waiting...\n", len(pods.Items))
time.Sleep(5 * time.Second)
}
// Get the current StatefulSet
sts, err := clientset.AppsV1().StatefulSets(namespace).Get(ctx, name, metav1.GetOptions{})
if err != nil {
fmt.Printf("Error getting StatefulSet: %v\n", err)
os.Exit(1)
}
// Create a new StatefulSet with migration fixes
newSTS := sts.DeepCopy()
newSTS.ObjectMeta.ResourceVersion = ""
newSTS.ObjectMeta.UID = ""
newSTS.ObjectMeta.CreationTimestamp = metav1.Time{}
newSTS.Status = metav1.Status{}
// Add migration annotation
if newSTS.ObjectMeta.Annotations == nil {
newSTS.ObjectMeta.Annotations = make(map[string]string)
}
newSTS.ObjectMeta.Annotations["migration-timestamp"] = fmt.Sprintf("%d", time.Now().Unix())
// Set replicas to 0 initially
replicas := int32(0)
newSTS.Spec.Replicas = &replicas
// Fix volumeClaimTemplates if needed
for i := range newSTS.Spec.VolumeClaimTemplates {
// Ensure volumeMode is set correctly
if newSTS.Spec.VolumeClaimTemplates[i].Spec.VolumeMode == nil {
filesystemMode := "Filesystem"
newSTS.Spec.VolumeClaimTemplates[i].Spec.VolumeMode = &filesystemMode
}
}
// Delete the old StatefulSet
fmt.Printf("Deleting StatefulSet %s...\n", name)
err = clientset.AppsV1().StatefulSets(namespace).Delete(ctx, name, metav1.DeleteOptions{})
if err != nil {
fmt.Printf("Error deleting StatefulSet: %v\n", err)
os.Exit(1)
}
// Wait for StatefulSet to be deleted
fmt.Println("Waiting for StatefulSet to be deleted...")
for {
_, err := clientset.AppsV1().StatefulSets(namespace).Get(ctx, name, metav1.GetOptions{})
if err != nil {
fmt.Println("StatefulSet deleted")
break
}
time.Sleep(2 * time.Second)
}
// Create the new StatefulSet
fmt.Printf("Creating new StatefulSet %s...\n", name)
_, err = clientset.AppsV1().StatefulSets(namespace).Create(ctx, newSTS, metav1.CreateOptions{})
if err != nil {
fmt.Printf("Error creating new StatefulSet: %v\n", err)
os.Exit(1)
}
// Scale up the StatefulSet to original replicas
fmt.Printf("Scaling up StatefulSet %s to %d replicas...\n", name, originalReplicas)
scale, err = clientset.AppsV1().StatefulSets(namespace).GetScale(ctx, name, metav1.GetOptions{})
if err != nil {
fmt.Printf("Error getting StatefulSet scale: %v\n", err)
os.Exit(1)
}
scale.Spec.Replicas = originalReplicas
_, err = clientset.AppsV1().StatefulSets(namespace).UpdateScale(ctx, name, scale, metav1.UpdateOptions{})
if err != nil {
fmt.Printf("Error scaling up StatefulSet: %v\n", err)
os.Exit(1)
}
fmt.Println("Migration complete. Monitoring pod creation...")
// Monitor pod creation
for i := 0; i < int(originalReplicas); i++ {
podName := fmt.Sprintf("%s-%d", name, i)
fmt.Printf("Waiting for pod %s to be ready...\n", podName)
for {
pod, err := clientset.CoreV1().Pods(namespace).Get(ctx, podName, metav1.GetOptions{})
if err != nil {
fmt.Printf("Pod %s not found yet, waiting...\n", podName)
time.Sleep(5 * time.Second)
continue
}
if isPodReady(pod) {
fmt.Printf("Pod %s is ready\n", podName)
break
}
fmt.Printf("Pod %s is not ready yet (status: %s), waiting...\n", podName, pod.Status.Phase)
time.Sleep(5 * time.Second)
}
}
fmt.Println("All pods are ready. Migration successful.")
}
func rollbackStatefulSetMigration(ctx context.Context, clientset *kubernetes.Clientset, namespace, name string) {
fmt.Printf("Rolling back StatefulSet %s migration in namespace %s...\n", name, namespace)
// Find the backup StatefulSet
statefulSets, err := clientset.AppsV1().StatefulSets(namespace).List(ctx, metav1.ListOptions{})
if err != nil {
fmt.Printf("Error listing StatefulSets: %v\n", err)
os.Exit(1)
}
var backupSTS *metav1.ObjectMeta
for _, sts := range statefulSets.Items {
if sts.Annotations != nil && sts.Annotations["migration-backup-of"] == name {
stsCopy := sts.ObjectMeta
backupSTS = &stsCopy
break
}
}
if backupSTS == nil {
fmt.Printf("Error: No backup StatefulSet found for %s\n", name)
os.Exit(1)
}
fmt.Printf("Found backup StatefulSet %s\n", backupSTS.Name)
// Scale down the current StatefulSet
fmt.Printf("Scaling down StatefulSet %s to 0 replicas...\n", name)
scale, err := clientset.AppsV1().StatefulSets(namespace).GetScale(ctx, name, metav1.GetOptions{})
if err != nil {
fmt.Printf("Error getting StatefulSet scale: %v\n", err)
os.Exit(1)
}
// Save original replicas for later
originalReplicas := scale.Spec.Replicas
// Set replicas to 0
scale.Spec.Replicas = 0
_, err = clientset.AppsV1().StatefulSets(namespace).UpdateScale(ctx, name, scale, metav1.UpdateOptions{})
if err != nil {
fmt.Printf("Error scaling down StatefulSet: %v\n", err)
os.Exit(1)
}
// Wait for pods to terminate
fmt.Println("Waiting for pods to terminate...")
for {
pods, err := clientset.CoreV1().Pods(namespace).List(ctx, metav1.ListOptions{
LabelSelector: fmt.Sprintf("app=%s", name),
})
if err != nil {
fmt.Printf("Error listing pods: %v\n", err)
os.Exit(1)
}
if len(pods.Items) == 0 {
fmt.Println("All pods terminated")
break
}
fmt.Printf("%d pods still running, waiting...\n", len(pods.Items))
time.Sleep(5 * time.Second)
}
// Get the backup StatefulSet
backupStatefulSet, err := clientset.AppsV1().StatefulSets(namespace).Get(ctx, backupSTS.Name, metav1.GetOptions{})
if err != nil {
fmt.Printf("Error getting backup StatefulSet: %v\n", err)
os.Exit(1)
}
// Prepare the backup for restoration
backupStatefulSet.ObjectMeta.Name = name
backupStatefulSet.ObjectMeta.ResourceVersion = ""
backupStatefulSet.ObjectMeta.UID = ""
backupStatefulSet.ObjectMeta.CreationTimestamp = metav1.Time{}
backupStatefulSet.Status = metav1.Status{}
// Remove backup annotations
delete(backupStatefulSet.ObjectMeta.Annotations, "migration-backup-of")
delete(backupStatefulSet.ObjectMeta.Annotations, "migration-timestamp")
// Set replicas to original count
backupStatefulSet.Spec.Replicas = &originalReplicas
// Delete the current StatefulSet
fmt.Printf("Deleting StatefulSet %s...\n", name)
err = clientset.AppsV1().StatefulSets(namespace).Delete(ctx, name, metav1.DeleteOptions{})
if err != nil {
fmt.Printf("Error deleting StatefulSet: %v\n", err)
os.Exit(1)
}
// Wait for StatefulSet to be deleted
fmt.Println("Waiting for StatefulSet to be deleted...")
for {
_, err := clientset.AppsV1().StatefulSets(namespace).Get(ctx, name, metav1.GetOptions{})
if err != nil {
fmt.Println("StatefulSet deleted")
break
}
time.Sleep(2 * time.Second)
}
// Create the restored StatefulSet
fmt.Printf("Creating restored StatefulSet %s...\n", name)
_, err = clientset.AppsV1().StatefulSets(namespace).Create(ctx, backupStatefulSet, metav1.CreateOptions{})
if err != nil {
fmt.Printf("Error creating restored StatefulSet: %v\n", err)
os.Exit(1)
}
fmt.Println("Rollback complete. Monitoring pod creation...")
// Monitor pod creation
for i := 0; i < int(originalReplicas); i++ {
podName := fmt.Sprintf("%s-%d", name, i)
fmt.Printf("Waiting for pod %s to be ready...\n", podName)
for {
pod, err := clientset.CoreV1().Pods(namespace).Get(ctx, podName, metav1.GetOptions{})
if err != nil {
fmt.Printf("Pod %s not found yet, waiting...\n", podName)
time.Sleep(5 * time.Second)
continue
}
if isPodReady(pod) {
fmt.Printf("Pod %s is ready\n", podName)
break
}
fmt.Printf("Pod %s is not ready yet (status: %s), waiting...\n", podName, pod.Status.Phase)
time.Sleep(5 * time.Second)
}
}
fmt.Println("All pods are ready. Rollback successful.")
}
// Helper functions
func isValidStatefulSetPodName(podName, stsName string) bool {
return len(podName) > len(stsName) &&
podName[:len(stsName)] == stsName &&
podName[len(stsName)] == '-'
}
func getPodOrdinal(podName string) string {
for i := len(podName) - 1; i >= 0; i-- {
if podName[i] == '-' {
return podName[i+1:]
}
}
return ""
}
func isPodReady(pod *metav1.Object) bool {
// Type assertion to access status
if p, ok := pod.(*metav1.Pod); ok {
if p.Status.Phase != "Running" {
return false
}
for _, condition := range p.Status.Conditions {
if condition.Type == "Ready" && condition.Status == "True" {
return true
}
}
}
return false
}
• Fixed StatefulSet configuration to ensure proper pod identity preservation:
# Before: Problematic StatefulSet configuration
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: database
namespace: production
spec:
serviceName: database
replicas: 3
selector:
matchLabels:
app: database
template:
metadata:
labels:
app: database
spec:
containers:
- name: database
image: postgres:14
ports:
- containerPort: 5432
name: postgres
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: standard
resources:
requests:
storage: 10Gi
# After: Improved StatefulSet configuration
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: database
namespace: production
annotations:
statefulset.kubernetes.io/pod-name-generation: "consistent"
spec:
serviceName: database
replicas: 3
podManagementPolicy: OrderedReady
updateStrategy:
type: OnDelete # Prevents automatic updates during cluster upgrades
selector:
matchLabels:
app: database
template:
metadata:
labels:
app: database
spec:
terminationGracePeriodSeconds: 60
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- database
topologyKey: kubernetes.io/hostname
containers:
- name: database
image: postgres:14
ports:
- containerPort: 5432
name: postgres
env:
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: database-credentials
key: password
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
readinessProbe:
exec:
command:
- pg_isready
- -U
- postgres
initialDelaySeconds: 5
periodSeconds: 10
lifecycle:
preStop:
exec:
command:
- sh
- -c
- "pg_ctl -D /var/lib/postgresql/data stop -m fast"
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: managed-premium
volumeMode: Filesystem
resources:
requests:
storage: 10Gi
• Implemented a StatefulSet upgrade procedure:
#!/bin/bash
# statefulset_upgrade.sh
# Usage: ./statefulset_upgrade.sh <namespace> <statefulset-name>
set -e
NAMESPACE=$1
STATEFULSET=$2
if [ -z "$NAMESPACE" ] || [ -z "$STATEFULSET" ]; then
echo "Usage: $0 <namespace> <statefulset-name>"
exit 1
fi
echo "Starting controlled upgrade of StatefulSet $STATEFULSET in namespace $NAMESPACE"
# Validate StatefulSet
echo "Validating StatefulSet..."
kubectl get statefulset -n $NAMESPACE $STATEFULSET -o json > /tmp/statefulset-original.json
REPLICAS=$(kubectl get statefulset -n $NAMESPACE $STATEFULSET -o jsonpath='{.spec.replicas}')
echo "StatefulSet has $REPLICAS replicas"
# Check for PVCs
echo "Checking PersistentVolumeClaims..."
PVC_COUNT=$(kubectl get pvc -n $NAMESPACE -l app=$STATEFULSET -o name | wc -l)
echo "Found $PVC_COUNT PVCs"
# Backup StatefulSet definition
echo "Creating backup of StatefulSet definition..."
TIMESTAMP=$(date +%Y%m%d%H%M%S)
kubectl get statefulset -n $NAMESPACE $STATEFULSET -o yaml > statefulset-$STATEFULSET-backup-$TIMESTAMP.yaml
echo "Backup saved to statefulset-$STATEFULSET-backup-$TIMESTAMP.yaml"
# Scale down StatefulSet one pod at a time
echo "Scaling down StatefulSet one pod at a time..."
for ((i=$REPLICAS-1; i>=0; i--)); do
POD_NAME="$STATEFULSET-$i"
echo "Processing pod $POD_NAME"
# Check if pod exists
if kubectl get pod -n $NAMESPACE $POD_NAME &>/dev/null; then
echo "Taking pod $POD_NAME out of service..."
# Check if pod has a preStop hook
HAS_PRESTOP=$(kubectl get pod -n $NAMESPACE $POD_NAME -o json | jq '.spec.containers[0].lifecycle.preStop')
if [ "$HAS_PRESTOP" != "null" ]; then
echo "Pod has preStop hook, allowing time for graceful shutdown..."
GRACE_PERIOD=$(kubectl get pod -n $NAMESPACE $POD_NAME -o jsonpath='{.spec.terminationGracePeriodSeconds}')
if [ -z "$GRACE_PERIOD" ]; then
GRACE_PERIOD=30
fi
echo "Grace period is $GRACE_PERIOD seconds"
fi
# Delete pod
kubectl delete pod -n $NAMESPACE $POD_NAME
# Wait for pod to be deleted
echo "Waiting for pod $POD_NAME to be deleted..."
kubectl wait --for=delete pod/$POD_NAME -n $NAMESPACE --timeout=300s
echo "Pod $POD_NAME deleted"
else
echo "Pod $POD_NAME does not exist, skipping"
fi
done
# Delete the StatefulSet but keep the pods
echo "Deleting StatefulSet $STATEFULSET while keeping pods..."
kubectl delete statefulset -n $NAMESPACE $STATEFULSET --cascade=false
# Create the new StatefulSet
echo "Creating new StatefulSet from backup with modifications..."
cat statefulset-$STATEFULSET-backup-$TIMESTAMP.yaml | \
sed 's/updateStrategy:/updateStrategy:\n type: OnDelete/' | \
sed '/creationTimestamp:/d' | \
sed '/resourceVersion:/d' | \
sed '/uid:/d' | \
sed '/status:/,$ d' | \
kubectl apply -f -
# Wait for StatefulSet to be ready
echo "Waiting for StatefulSet to be ready..."
kubectl rollout status statefulset/$STATEFULSET -n $NAMESPACE
echo "StatefulSet upgrade completed successfully"
• Long-term: Implemented a comprehensive StatefulSet management strategy:
- Created a StatefulSet operator for advanced lifecycle management
- Implemented automated backup and restore procedures
- Developed a StatefulSet migration framework for version upgrades
- Established clear procedures for StatefulSet maintenance
- Implemented monitoring for StatefulSet identity and data integrity
Lessons Learned:
StatefulSet pod identity preservation requires careful management during Kubernetes upgrades.
How to Avoid:
Use the "OnDelete" update strategy for critical StatefulSets.
Implement proper backup procedures before cluster upgrades.
Test StatefulSet behavior in a staging environment before production upgrades.
Use consistent storage classes with appropriate reclaim policies.
Implement StatefulSet-aware upgrade procedures.
No summary provided
What Happened:
During a planned maintenance window, the operations team initiated a rolling upgrade of Kubernetes nodes. Despite having Pod Disruption Budgets in place, a critical service became completely unavailable, causing a production outage. The incident occurred when multiple nodes were drained simultaneously, and the service pods could not be rescheduled fast enough due to resource constraints and misconfigured PDBs.
Diagnosis Steps:
Analyzed Kubernetes events during the outage timeframe.
Reviewed Pod Disruption Budget configurations for affected services.
Examined node drain logs and kubectl drain command parameters.
Checked pod scheduling and resource allocation during the incident.
Reviewed cluster autoscaling configuration and behavior.
Root Cause:
The investigation revealed multiple issues with the Pod Disruption Budget implementation: 1. The PDB was configured with maxUnavailable: 50%, allowing too many pods to be evicted simultaneously 2. The service had resource requests that were too high, limiting where pods could be scheduled 3. Node draining was performed without the --pod-eviction-timeout flag, causing rapid evictions 4. Cluster autoscaling was too slow to provision new nodes for rescheduled pods 5. The service lacked proper readiness probes, causing traffic to be sent to initializing pods
Fix/Workaround:
• Short-term: Implemented immediate fixes to prevent recurrence:
# Before: Problematic PDB configuration
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: critical-service-pdb
namespace: production
spec:
maxUnavailable: 50%
selector:
matchLabels:
app: critical-service
# After: Improved PDB configuration
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: critical-service-pdb
namespace: production
spec:
minAvailable: 80% # Ensure at least 80% of pods are available
selector:
matchLabels:
app: critical-service
• Created a node drain script with proper safeguards:
#!/bin/bash
# safe_node_drain.sh - Safely drain nodes with proper timeouts and monitoring
set -e
NODE=$1
NAMESPACE=${2:-"production"}
EVICTION_TIMEOUT=${3:-"300s"}
DRAIN_TIMEOUT=${4:-"600"}
GRACE_PERIOD=${5:-"120"}
if [ -z "$NODE" ]; then
echo "Usage: $0 <node-name> [namespace] [eviction-timeout] [drain-timeout] [grace-period]"
exit 1
fi
echo "Checking PDBs in namespace $NAMESPACE..."
kubectl get pdb -n $NAMESPACE -o wide
echo "Checking pods on node $NODE..."
POD_COUNT=$(kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=$NODE | grep -v NAME | wc -l)
echo "Found $POD_COUNT pods running on $NODE"
echo "Cordoning node $NODE..."
kubectl cordon $NODE
echo "Waiting for grace period ($GRACE_PERIOD seconds) before starting evictions..."
sleep $GRACE_PERIOD
echo "Starting drain with eviction timeout $EVICTION_TIMEOUT..."
timeout $DRAIN_TIMEOUT kubectl drain $NODE \
--pod-eviction-timeout=$EVICTION_TIMEOUT \
--ignore-daemonsets \
--delete-emptydir-data \
--force
if [ $? -eq 0 ]; then
echo "Node $NODE drained successfully"
else
echo "Node drain timed out or failed. Check node status."
kubectl uncordon $NODE
exit 1
fi
echo "Checking service health..."
for svc in $(kubectl get svc -n $NAMESPACE -o name); do
echo "Checking $svc..."
kubectl get $svc -n $NAMESPACE -o wide
done
echo "Node drain completed. Node is ready for maintenance."
• Improved the deployment configuration with proper resource settings and probes:
apiVersion: apps/v1
kind: Deployment
metadata:
name: critical-service
namespace: production
spec:
replicas: 10
selector:
matchLabels:
app: critical-service
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
template:
metadata:
labels:
app: critical-service
spec:
terminationGracePeriodSeconds: 60
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- critical-service
topologyKey: "kubernetes.io/hostname"
containers:
- name: critical-service
image: critical-service:v1.2.3
ports:
- containerPort: 8080
resources:
requests:
cpu: 250m # Reduced from 1000m
memory: 512Mi # Reduced from 2Gi
limits:
cpu: 1000m
memory: 1Gi
readinessProbe: # Added proper readiness probe
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 2
successThreshold: 1
failureThreshold: 3
livenessProbe: # Added proper liveness probe
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 15
timeoutSeconds: 5
failureThreshold: 3
startupProbe: # Added startup probe
httpGet:
path: /health/startup
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 2
failureThreshold: 12 # Allow 60s for startup
• Long-term: Implemented a comprehensive node maintenance strategy:
- Created a node maintenance controller to automate safe node drains
- Implemented PDB validation in CI/CD pipelines
- Improved cluster autoscaling configuration for faster node provisioning
- Developed a service impact prediction tool for maintenance operations
- Established clear maintenance procedures with proper testing
Lessons Learned:
Pod Disruption Budgets require careful configuration and testing to be effective.
How to Avoid:
Configure PDBs with minAvailable rather than maxUnavailable for critical services.
Test PDB configurations in a staging environment before production.
Implement proper node drain procedures with adequate timeouts.
Ensure services have appropriate resource requests and limits.
Use pod anti-affinity to distribute pods across nodes.
No summary provided
What Happened:
A large production Kubernetes cluster began experiencing increased pod scheduling latency and performance issues. Some nodes were heavily utilized while others remained underutilized. Critical workloads were competing for resources on specific nodes, while other nodes had abundant capacity. The issue gradually worsened over several weeks as new workloads were deployed, eventually leading to pod evictions and scheduling failures during peak traffic periods.
Diagnosis Steps:
Analyzed node resource utilization across the cluster.
Examined pod distribution and resource requests/limits.
Reviewed node labels, taints, and pod affinity/anti-affinity rules.
Checked the scheduler configuration and default policies.
Investigated historical deployment patterns and workload growth.
Root Cause:
The investigation revealed multiple issues contributing to the resource imbalance: 1. Pod affinity rules were causing workloads to cluster on specific nodes 2. Node labels for specialized hardware were applied inconsistently 3. Some applications had overly restrictive node selectors 4. The default scheduler configuration wasn't optimized for the workload mix 5. Resource requests were significantly lower than actual usage for some workloads
Fix/Workaround:
• Implemented immediate fixes to rebalance the cluster
• Optimized pod affinity/anti-affinity rules
• Standardized node labeling and workload placement
• Adjusted resource requests to match actual usage patterns
• Implemented proper node autoscaling
Lessons Learned:
Kubernetes resource allocation requires ongoing monitoring and optimization as workloads evolve.
How to Avoid:
Implement regular resource utilization reviews and optimization.
Use pod topology spread constraints for better workload distribution.
Configure appropriate resource requests based on actual usage.
Standardize node labeling and workload placement strategies.
Monitor scheduling metrics to detect imbalances early.
No summary provided
What Happened:
During a scheduled update to a database cluster running in Kubernetes as a StatefulSet, the operations team initiated a rolling update to upgrade the database version. Shortly after the update began, the entire cluster became unavailable, causing a major production outage. The update process had terminated all pods simultaneously instead of following the expected ordered, one-at-a-time update pattern. Recovery required manual intervention and data restoration from backups.
Diagnosis Steps:
Examined the StatefulSet configuration and update history.
Reviewed Kubernetes events and pod termination sequences.
Analyzed the StatefulSet controller logs.
Checked PersistentVolume and PersistentVolumeClaim status.
Verified the database cluster's quorum and replication state.
Root Cause:
The investigation revealed multiple issues with the StatefulSet configuration: 1. The StatefulSet updateStrategy was incorrectly set to "OnDelete" instead of "RollingUpdate" 2. The team manually deleted all pods simultaneously to trigger the update 3. Pod disruption budgets were not configured to protect the cluster quorum 4. The database was not configured to handle multiple simultaneous node failures 5. The StatefulSet did not have proper readiness probes to verify cluster health
Fix/Workaround:
• Restored the database cluster from backups
• Corrected the StatefulSet updateStrategy to "RollingUpdate"
• Implemented proper pod disruption budgets
• Configured appropriate readiness and liveness probes
• Created a comprehensive update procedure for stateful applications
Lessons Learned:
StatefulSet update strategies require careful configuration and understanding of the application's resilience properties.
How to Avoid:
Always use "RollingUpdate" strategy for StatefulSets unless there's a specific reason not to.
Implement pod disruption budgets to protect application quorum.
Configure appropriate readiness probes that verify application health.
Test update procedures in a staging environment before production.
Create detailed runbooks for stateful application updates.
No summary provided
What Happened:
A company deployed a microservices application to a multi-zone Kubernetes cluster. After several weeks of operation, they noticed that certain nodes were consistently overloaded while others remained underutilized. During a traffic spike, the overloaded nodes experienced performance degradation, causing service disruptions. Investigation revealed that the workload distribution was severely imbalanced due to misconfigured node affinity and anti-affinity rules.
Diagnosis Steps:
Analyzed node resource utilization across the cluster.
Examined pod distribution and placement patterns.
Reviewed node affinity and anti-affinity configurations.
Checked node labels and taints/tolerations.
Assessed the impact of topology spread constraints.
Root Cause:
The investigation revealed multiple issues with workload placement: 1. Node affinity rules were too restrictive, forcing certain workloads onto specific nodes 2. Anti-affinity rules were missing for critical components, allowing them to co-locate 3. Some node labels were inconsistent across the cluster, affecting affinity rules 4. Topology spread constraints were not implemented for multi-zone distribution 5. Pod resource requests were inaccurate, leading to poor scheduling decisions
Fix/Workaround:
• Implemented immediate fixes to rebalance workloads
• Corrected node affinity and anti-affinity configurations
• Standardized node labels across the cluster
• Implemented topology spread constraints for zone distribution
• Adjusted pod resource requests based on actual usage
Lessons Learned:
Node affinity and anti-affinity configurations significantly impact workload distribution and cluster stability.
How to Avoid:
Implement proper topology spread constraints for multi-zone deployments.
Use node affinity selectively and avoid overly restrictive rules.
Ensure consistent node labeling across the cluster.
Regularly review and validate workload distribution patterns.
Implement pod disruption budgets to protect critical services during rebalancing.
```yaml
Example of improved pod specification with proper affinity rules
apiVersion: v1
kind: Pod
metadata:
name: web-app
labels:
app: web
component: frontend
spec:
Topology spread constraints for multi-zone distribution
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: web
Pod affinity for related services
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- cache
topologyKey: kubernetes.io/hostname
Pod anti-affinity to avoid co-location of same components
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: component
operator: In
values:
- frontend
topologyKey: kubernetes.io/hostname
Node affinity for hardware requirements
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values:
- c5.large
- c5.xlarge
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 50
preference:
matchExpressions:
- key: failure-domain.beta.kubernetes.io/zone
operator: In
values:
- us-east-1a
- us-east-1b
containers:
- name: web-app
image: web-app:v1.2.3
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
```
No summary provided
What Happened:
A large e-commerce company used an automated node maintenance system to perform regular updates on their Kubernetes cluster. The system was designed to cordon nodes, drain workloads, perform updates, and then uncordon nodes to return them to service. During a scheduled maintenance window, the system identified multiple nodes requiring kernel updates and initiated the cordoning process for all of them simultaneously. This resulted in insufficient capacity for the rescheduled workloads, causing pod evictions, service disruptions, and eventually a partial outage of the platform during a high-traffic period.
Diagnosis Steps:
Analyzed cluster events and pod scheduling logs.
Examined node cordoning timestamps and patterns.
Reviewed the automated maintenance system's configuration.
Checked cluster capacity and resource allocation.
Investigated pod priority and preemption settings.
Root Cause:
The investigation revealed multiple issues with the node maintenance automation: 1. The maintenance system had no limits on how many nodes could be cordoned simultaneously 2. There was no pre-check for available cluster capacity before cordoning 3. Pod disruption budgets were not properly configured for critical services 4. The system lacked awareness of application-level dependencies 5. There was no circuit breaker to halt the process if too many pods failed to reschedule
Fix/Workaround:
• Implemented immediate fixes to restore service
• Uncordoned sufficient nodes to restore capacity
• Implemented a phased maintenance approach with capacity checks
• Configured proper pod disruption budgets for all services
• Added circuit breaker logic to the maintenance system
# Improved Node Maintenance Controller Configuration
# File: node-maintenance-controller-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: node-maintenance-controller-config
namespace: kube-system
data:
config.yaml: |
maintenance:
# Maximum percentage of nodes that can be under maintenance simultaneously
maxConcurrentNodePercentage: 10
# Minimum available capacity required before cordoning additional nodes
minimumAvailableCapacity:
cpu: 20%
memory: 20%
# Wait time between node maintenance operations
nodeProcessingInterval: 5m
# Circuit breaker settings
circuitBreaker:
# Maximum number of pod scheduling failures before pausing
maxPodSchedulingFailures: 5
# Pause duration when circuit breaker is triggered
pauseDuration: 30m
# Automatically uncordon nodes if circuit breaker is triggered
autoUncordonOnBreak: true
# Node selection strategy
nodeSelectionStrategy: LeastDisruptive
# Pre-cordoning validation
preCordoning:
# Verify all PDBs are satisfied before cordoning
checkPodDisruptionBudgets: true
# Check for critical pods that might not reschedule
checkCriticalPods: true
# Verify node affinity/anti-affinity constraints can be satisfied
checkAffinityConstraints: true
# Post-cordoning validation
postCordoning:
# Maximum time to wait for pods to be evicted
maxPodEvictionWaitTime: 5m
# Monitor for scheduling failures
monitorSchedulingFailures: true
# Maintenance window settings
maintenanceWindows:
- name: "overnight"
daysOfWeek: ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"]
timeWindow:
start: "01:00"
end: "05:00"
timeZone: "UTC"
- name: "weekend"
daysOfWeek: ["Saturday", "Sunday"]
timeWindow:
start: "10:00"
end: "16:00"
timeZone: "UTC"
// Improved Node Maintenance Controller Logic
// File: node_maintenance_controller.go
package controller
import (
"context"
"fmt"
"time"
corev1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/api/resource"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/util/retry"
"k8s.io/klog/v2"
)
// NodeMaintenanceController manages safe node maintenance operations
type NodeMaintenanceController struct {
kubeClient kubernetes.Interface
config *MaintenanceConfig
maintenanceState *MaintenanceState
capacityCalculator *ClusterCapacityCalculator
pdbChecker *PodDisruptionBudgetChecker
}
// MaintenanceState tracks the current state of maintenance operations
type MaintenanceState struct {
NodesInMaintenance map[string]*NodeMaintenance
FailedSchedulingEvents int
CircuitBreakerTripped bool
CircuitBreakerTrippedAt time.Time
}
// NodeMaintenance tracks the state of a node in maintenance
type NodeMaintenance struct {
NodeName string
CordoneAt time.Time
PodsEvicted int
PodsRemaining int
MaintenanceCompleted bool
}
// ProcessNodeMaintenance evaluates and performs maintenance on eligible nodes
func (c *NodeMaintenanceController) ProcessNodeMaintenance(ctx context.Context) error {
// Check if we're in a maintenance window
if !c.isInMaintenanceWindow() {
klog.V(4).Info("Not currently in a maintenance window")
return nil
}
// Check if circuit breaker is tripped
if c.maintenanceState.CircuitBreakerTripped {
elapsed := time.Since(c.maintenanceState.CircuitBreakerTrippedAt)
if elapsed < c.config.CircuitBreaker.PauseDuration {
klog.Warningf("Circuit breaker tripped, pausing maintenance for %v more",
c.config.CircuitBreaker.PauseDuration - elapsed)
// Auto-uncordon nodes if configured
if c.config.CircuitBreaker.AutoUncordonOnBreak {
if err := c.uncordonNodesInMaintenance(ctx); err != nil {
klog.Errorf("Failed to auto-uncordon nodes: %v", err)
}
}
return nil
}
// Reset circuit breaker
c.maintenanceState.CircuitBreakerTripped = false
c.maintenanceState.FailedSchedulingEvents = 0
}
// Get nodes that need maintenance
nodesToMaintain, err := c.getNodesNeedingMaintenance(ctx)
if err != nil {
return fmt.Errorf("failed to get nodes needing maintenance: %w", err)
}
// Check how many nodes are already in maintenance
currentMaintenanceCount := len(c.maintenanceState.NodesInMaintenance)
// Get all nodes to calculate percentage
allNodes, err := c.kubeClient.CoreV1().Nodes().List(ctx, metav1.ListOptions{})
if err != nil {
return fmt.Errorf("failed to list nodes: %w", err)
}
// Calculate maximum allowed concurrent maintenance
maxConcurrent := int(float64(len(allNodes.Items)) * c.config.MaxConcurrentNodePercentage / 100.0)
if maxConcurrent < 1 {
maxConcurrent = 1
}
// Check if we're already at the limit
if currentMaintenanceCount >= maxConcurrent {
klog.Infof("Already at maximum concurrent maintenance limit (%d nodes)", maxConcurrent)
return nil
}
// Calculate how many more nodes we can process
remainingSlots := maxConcurrent - currentMaintenanceCount
// Check available capacity before cordoning more nodes
availableCapacity, err := c.capacityCalculator.GetAvailableCapacity(ctx)
if err != nil {
return fmt.Errorf("failed to calculate available capacity: %w", err)
}
// Check if we have enough capacity
requiredCPU := c.config.MinimumAvailableCapacity.CPU
requiredMemory := c.config.MinimumAvailableCapacity.Memory
if availableCapacity.CPU.Cmp(requiredCPU) < 0 ||
availableCapacity.Memory.Cmp(requiredMemory) < 0 {
klog.Warningf("Insufficient capacity for maintenance: CPU: %v (required %v), Memory: %v (required %v)",
availableCapacity.CPU, requiredCPU, availableCapacity.Memory, requiredMemory)
return nil
}
// Sort nodes by maintenance strategy
sortedNodes := c.sortNodesByStrategy(nodesToMaintain)
// Process nodes up to the remaining slot limit
nodesToProcess := min(len(sortedNodes), remainingSlots)
for i := 0; i < nodesToProcess; i++ {
node := sortedNodes[i]
// Perform pre-cordoning validation
if err := c.validatePreCordoning(ctx, node); err != nil {
klog.Warningf("Pre-cordoning validation failed for node %s: %v", node.Name, err)
continue
}
// Cordon the node
if err := c.cordonNode(ctx, node); err != nil {
klog.Errorf("Failed to cordon node %s: %v", node.Name, err)
continue
}
// Track the node in maintenance
c.maintenanceState.NodesInMaintenance[node.Name] = &NodeMaintenance{
NodeName: node.Name,
CordoneAt: time.Now(),
}
// Wait before processing the next node
time.Sleep(c.config.NodeProcessingInterval)
}
return nil
}
// validatePreCordoning performs checks before cordoning a node
func (c *NodeMaintenanceController) validatePreCordoning(ctx context.Context, node *corev1.Node) error {
// Check if PDBs would be violated
if c.config.PreCordoning.CheckPodDisruptionBudgets {
if violations, err := c.pdbChecker.CheckPDBViolations(ctx, node.Name); err != nil {
return fmt.Errorf("failed to check PDB violations: %w", err)
} else if len(violations) > 0 {
return fmt.Errorf("cordoning would violate PDBs: %v", violations)
}
}
// Check for critical pods
if c.config.PreCordoning.CheckCriticalPods {
if criticalPods, err := c.getCriticalPodsOnNode(ctx, node.Name); err != nil {
return fmt.Errorf("failed to check critical pods: %w", err)
} else if len(criticalPods) > 0 {
return fmt.Errorf("node has %d critical pods that may not reschedule", len(criticalPods))
}
}
// Check affinity constraints
if c.config.PreCordoning.CheckAffinityConstraints {
if err := c.validateAffinityConstraints(ctx, node); err != nil {
return fmt.Errorf("affinity validation failed: %w", err)
}
}
return nil
}
// monitorNodeDrain watches for pod scheduling failures during node draining
func (c *NodeMaintenanceController) monitorNodeDrain(ctx context.Context, nodeName string) {
// Set up event watcher for scheduling failures
// Implementation omitted for brevity
// If too many scheduling failures occur, trip the circuit breaker
if c.maintenanceState.FailedSchedulingEvents >= c.config.CircuitBreaker.MaxPodSchedulingFailures {
klog.Warningf("Circuit breaker tripped due to %d pod scheduling failures",
c.maintenanceState.FailedSchedulingEvents)
c.maintenanceState.CircuitBreakerTripped = true
c.maintenanceState.CircuitBreakerTrippedAt = time.Now()
}
}
// Helper functions
func (c *NodeMaintenanceController) isInMaintenanceWindow() bool {
// Implementation omitted for brevity
return true
}
func (c *NodeMaintenanceController) getNodesNeedingMaintenance(ctx context.Context) ([]*corev1.Node, error) {
// Implementation omitted for brevity
return nil, nil
}
func (c *NodeMaintenanceController) sortNodesByStrategy(nodes []*corev1.Node) []*corev1.Node {
// Implementation omitted for brevity
return nodes
}
func (c *NodeMaintenanceController) cordonNode(ctx context.Context, node *corev1.Node) error {
// Implementation omitted for brevity
return nil
}
func (c *NodeMaintenanceController) uncordonNodesInMaintenance(ctx context.Context) error {
// Implementation omitted for brevity
return nil
}
func (c *NodeMaintenanceController) getCriticalPodsOnNode(ctx context.Context, nodeName string) ([]corev1.Pod, error) {
// Implementation omitted for brevity
return nil, nil
}
func (c *NodeMaintenanceController) validateAffinityConstraints(ctx context.Context, node *corev1.Node) error {
// Implementation omitted for brevity
return nil
}
func min(a, b int) int {
if a < b {
return a
}
return b
}
Lessons Learned:
Automated node maintenance requires careful orchestration to prevent capacity issues and service disruptions.
How to Avoid:
Implement limits on concurrent node maintenance operations.
Perform capacity checks before cordoning nodes.
Configure proper pod disruption budgets for all services.
Implement circuit breaker logic to halt operations if too many failures occur.
Consider application dependencies when designing maintenance processes.
No summary provided
What Happened:
A large e-commerce company was performing routine maintenance on their Kubernetes cluster, which involved upgrading the operating system on a subset of nodes. The operations team followed their standard procedure: cordon the nodes to prevent new pods from being scheduled, drain existing pods to other nodes, perform the maintenance, and then uncordon the nodes. However, despite the nodes showing as cordoned in kubectl, new pods were still being scheduled to these nodes. When the team proceeded with rebooting the nodes for maintenance, several critical services experienced unexpected downtime as newly scheduled pods were terminated.
Diagnosis Steps:
Verified the cordoned status of nodes using kubectl.
Examined the Kubernetes scheduler logs.
Reviewed recent changes to the cluster configuration.
Checked the Kubernetes API server audit logs.
Tested node cordoning in a controlled environment.
Root Cause:
The investigation revealed multiple issues with the node cordoning process: 1. A custom admission controller was bypassing the standard scheduling restrictions 2. The admission controller was implemented to ensure certain workloads always had capacity 3. The controller was not respecting the cordoned status of nodes 4. The maintenance procedure didn't account for this custom component 5. There was no testing of the cordoning process before the maintenance window
Fix/Workaround:
• Implemented immediate improvements to the maintenance process
• Modified the custom admission controller to respect node cordoning
• Created a pre-flight check for maintenance procedures
• Implemented a maintenance mode flag in the custom controller
• Established a testing protocol for maintenance procedures
Lessons Learned:
Custom Kubernetes components can interfere with standard cluster operations if not properly designed and tested.
How to Avoid:
Ensure custom components respect standard Kubernetes control mechanisms.
Test maintenance procedures in a staging environment before production.
Implement pre-flight checks for maintenance operations.
Document all custom components and their behavior.
Create comprehensive testing protocols for maintenance procedures.