Question

sre_scenario_01

Answer 1

output:

Site Reliability Engineering (SRE) Microservices architecture, Java Spring Cloud, Production environment

Summary:

No summary provided

What Happened:

During a database slowdown, one service began experiencing timeouts. Instead of failing gracefully, the service continued to accept requests but responded slowly. This caused a ripple effect as dependent services also slowed down, eventually bringing down the entire system despite only one component having issues.

Diagnosis Steps:

Analyzed service dependency graph to understand the failure propagation.
Reviewed logs to identify the initial failure point.
Examined circuit breaker configurations across services.
Tested service behavior under simulated failure conditions.
Reviewed recent changes to resilience configurations.

Root Cause:

The circuit breaker in the database access service was misconfigured with too high a failure threshold (90%) and too long a timeout period (30 seconds). This allowed the service to continue accepting requests despite being unable to process them in a timely manner. Additionally, dependent services had no circuit breakers configured, allowing the failure to propagate.

Fix/Workaround:

• Short-term: Implemented proper circuit breaker configuration:


// Before: Misconfigured circuit breaker
@Bean
public Customizer<Resilience4JCircuitBreakerFactory> defaultCustomizer() {
    return factory -> factory.configureDefault(id -> new Resilience4JConfigBuilder(id)
            .circuitBreakerConfig(CircuitBreakerConfig.custom()
                .failureRateThreshold(90.0f)
                .waitDurationInOpenState(Duration.ofSeconds(60))
                .slidingWindowSize(100)
                .build())
            .timeLimiterConfig(TimeLimiterConfig.custom()
                .timeoutDuration(Duration.ofSeconds(30))
                .build())
            .build());
}
// After: Properly configured circuit breaker
@Bean
public Customizer<Resilience4JCircuitBreakerFactory> defaultCustomizer() {
    return factory -> factory.configureDefault(id -> new Resilience4JConfigBuilder(id)
            .circuitBreakerConfig(CircuitBreakerConfig.custom()
                .failureRateThreshold(50.0f)
                .waitDurationInOpenState(Duration.ofSeconds(30))
                .slidingWindowSize(10)
                .permittedNumberOfCallsInHalfOpenState(5)
                .automaticTransitionFromOpenToHalfOpenEnabled(true)
                .build())
            .timeLimiterConfig(TimeLimiterConfig.custom()
                .timeoutDuration(Duration.ofSeconds(5))
                .build())
            .build());
}

• Long-term: Implemented a comprehensive resilience strategy:


# Service Resilience Configuration (resilience.yaml)
global:
  circuitBreaker:
    failureRateThreshold: 50.0
    slowCallRateThreshold: 50.0
    slowCallDurationThreshold: 2s
    waitDurationInOpenState: 30s
    permittedNumberOfCallsInHalfOpenState: 5
    slidingWindowSize: 10
    slidingWindowType: COUNT_BASED
    minimumNumberOfCalls: 5
    automaticTransitionFromOpenToHalfOpenEnabled: true
  timeLimiter:
    timeoutDuration: 5s
    cancelRunningFuture: true
  bulkhead:
    maxConcurrentCalls: 25
    maxWaitDuration: 1s
  retry:
    maxAttempts: 3
    waitDuration: 500ms
    retryExceptions:
      - java.io.IOException
      - java.net.SocketTimeoutException
    ignoreExceptions:
      - java.lang.IllegalArgumentException
services:
  database-service:
    circuitBreaker:
      failureRateThreshold: 30.0
      waitDurationInOpenState: 45s
    timeLimiter:
      timeoutDuration: 3s
  payment-service:
    circuitBreaker:
      failureRateThreshold: 20.0
    bulkhead:
      maxConcurrentCalls: 10
  inventory-service:
    retry:
      maxAttempts: 5
      waitDuration: 1s

• Implemented service degradation strategies:


// Graceful degradation with fallbacks
@Service
public class ProductServiceImpl implements ProductService {
    private final ProductRepository repository;
    private final RecommendationService recommendationService;
    private final InventoryService inventoryService;
    private final CircuitBreakerFactory circuitBreakerFactory;
    public ProductServiceImpl(ProductRepository repository,
                             RecommendationService recommendationService,
                             InventoryService inventoryService,
                             CircuitBreakerFactory circuitBreakerFactory) {
        this.repository = repository;
        this.recommendationService = recommendationService;
        this.inventoryService = inventoryService;
        this.circuitBreakerFactory = circuitBreakerFactory;
    }
    @Override
    public ProductDetails getProductDetails(String productId) {
        Product product = repository.findById(productId)
            .orElseThrow(() -> new ProductNotFoundException(productId));
        // Get recommendations with circuit breaker and fallback
        List<Product> recommendations = circuitBreakerFactory
            .create("recommendations")
            .run(() -> recommendationService.getRecommendations(productId),
                 throwable -> getDefaultRecommendations(product.getCategory()));
        // Get inventory status with circuit breaker and fallback
        InventoryStatus inventoryStatus = circuitBreakerFactory
            .create("inventory")
            .run(() -> inventoryService.getInventoryStatus(productId),
                 throwable -> InventoryStatus.UNKNOWN);
        return new ProductDetails(product, recommendations, inventoryStatus);
    }
    private List<Product> getDefaultRecommendations(String category) {
        // Fallback to cached or popular products in the same category
        return repository.findTopByCategory(category, 5);
    }
}

• Implemented a service mesh for centralized resilience management:


# Istio VirtualService with resilience policies
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: database-service
spec:
  hosts:
  - database-service
  http:
  - route:
    - destination:
        host: database-service
        subset: v1
    timeout: 3s
    retries:
      attempts: 3
      perTryTimeout: 1s
      retryOn: gateway-error,connect-failure,refused-stream,unavailable,cancelled,resource-exhausted
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: database-service
spec:
  host: database-service
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        http1MaxPendingRequests: 10
        maxRequestsPerConnection: 10
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 100
  subsets:
  - name: v1
    labels:
      version: v1

Lessons Learned:

Circuit breakers must be properly configured to prevent cascading failures.

How to Avoid:

Implement circuit breakers with appropriate thresholds and timeouts.
Test failure scenarios regularly through chaos engineering.
Configure all services with appropriate resilience patterns.
Monitor circuit breaker states and failure rates.
Implement fallback mechanisms for critical services.

Answer 2

output:

Site Reliability Engineering (SRE) Kubernetes, Horizontal Pod Autoscaler, Production environment

Summary:

No summary provided

What Happened:

A web application was migrated to Kubernetes with Horizontal Pod Autoscaler (HPA) configured to scale based on CPU utilization. During traffic spikes, users experienced increased error rates and latency despite the system scaling up as configured.

Diagnosis Steps:

Analyzed metrics during scaling events.
Reviewed pod startup logs and timing.
Examined database connection patterns during scaling.
Tested scaling behavior in a controlled environment.
Profiled application performance under load.

Root Cause:

Multiple issues contributed to the problem: 1. New pods took too long to become fully operational (slow startup time) 2. The application didn't properly handle connection pooling, creating a database connection storm during scaling events 3. The readiness probe was too permissive, allowing pods to receive traffic before they were truly ready 4. The scaling threshold was too high (80% CPU), causing scaling to begin only after performance had already degraded

Fix/Workaround:

• Short-term: Implemented more conservative scaling parameters:


# Before: Aggressive scaling with high threshold
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
# After: More conservative scaling with lower threshold and better behavior
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 5
  maxReplicas: 30
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 60
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: 1000
  behavior:
    scaleUp:
      policies:
      - type: Pods
        value: 4
        periodSeconds: 60
      - type: Percent
        value: 100
        periodSeconds: 60
      selectPolicy: Max
      stabilizationWindowSeconds: 0
    scaleDown:
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60
      selectPolicy: Max
      stabilizationWindowSeconds: 300

• Long-term: Improved application startup and connection handling:


// Connection pool configuration with proper handling for scaling
@Configuration
public class DatabaseConfig {
    @Value("${db.min-pool-size:5}")
    private int minPoolSize;
    @Value("${db.max-pool-size:20}")
    private int maxPoolSize;
    @Value("${db.connection-timeout:5000}")
    private int connectionTimeout;
    @Value("${db.idle-timeout:600000}")
    private int idleTimeout;
    @Value("${db.max-lifetime:1800000}")
    private int maxLifetime;
    @Value("${db.pool-name:app-connection-pool}")
    private String poolName;
    @Bean(destroyMethod = "close")
    public DataSource dataSource() {
        HikariConfig config = new HikariConfig();
        config.setJdbcUrl(System.getenv("DB_URL"));
        config.setUsername(System.getenv("DB_USER"));
        config.setPassword(System.getenv("DB_PASSWORD"));
        config.setDriverClassName("org.postgresql.Driver");
        // Connection pool settings optimized for scaling
        config.setMinimumIdle(minPoolSize);
        config.setMaximumPoolSize(maxPoolSize);
        config.setConnectionTimeout(connectionTimeout);
        config.setIdleTimeout(idleTimeout);
        config.setMaxLifetime(maxLifetime);
        config.setPoolName(poolName);
        // Add metrics for monitoring
        config.setMetricRegistry(metricRegistry());
        // Add health checks
        config.setHealthCheckRegistry(healthCheckRegistry());
        // Connection initialization
        config.setInitializationFailTimeout(0); // Don't fail startup if DB is unavailable
        config.setConnectionInitSql("SELECT 1"); // Validate connections
        // Connection testing
        config.setConnectionTestQuery("SELECT 1");
        config.setValidationTimeout(5000);
        return new HikariDataSource(config);
    }
    @Bean
    public MetricRegistry metricRegistry() {
        return new MetricRegistry();
    }
    @Bean
    public HealthCheckRegistry healthCheckRegistry() {
        return new HealthCheckRegistry();
    }
}

• Improved readiness probe implementation:


// health.go - Improved readiness probe
package main
import (
	"context"
	"database/sql"
	"encoding/json"
	"log"
	"net/http"
	"os"
	"sync"
	"time"
	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
	isReady     bool
	readyMutex  sync.RWMutex
	startupTime time.Time
	db          *sql.DB
	// Prometheus metrics
	healthCheckDuration = prometheus.NewHistogram(prometheus.HistogramOpts{
		Name:    "health_check_duration_seconds",
		Help:    "Duration of health check in seconds",
		Buckets: prometheus.LinearBuckets(0.01, 0.05, 10),
	})
	databaseLatency = prometheus.NewHistogram(prometheus.HistogramOpts{
		Name:    "database_query_duration_seconds",
		Help:    "Duration of database queries in seconds",
		Buckets: prometheus.ExponentialBuckets(0.001, 2, 10),
	})
)
func init() {
	startupTime = time.Now()
	prometheus.MustRegister(healthCheckDuration, databaseLatency)
}
type HealthStatus struct {
	Status            string            `json:"status"`
	Components        map[string]string `json:"components"`
	StartupDuration   float64           `json:"startupDurationSeconds,omitempty"`
	UptimeSeconds     float64           `json:"uptimeSeconds"`
	ResourceUtilization ResourceStats     `json:"resourceUtilization"`
}
type ResourceStats struct {
	CPUUsagePercent    float64 `json:"cpuUsagePercent"`
	MemoryUsagePercent float64 `json:"memoryUsagePercent"`
	ConnectionCount    int     `json:"connectionCount"`
}
func readinessHandler(w http.ResponseWriter, r *http.Request) {
	timer := prometheus.NewTimer(healthCheckDuration)
	defer timer.ObserveDuration()
	readyMutex.RLock()
	ready := isReady
	readyMutex.RUnlock()
	if !ready {
		// Check if minimum startup time has elapsed
		if time.Since(startupTime) < 10*time.Second {
			w.WriteHeader(http.StatusServiceUnavailable)
			w.Write([]byte("Service is still initializing"))
			return
		}
		// Check database connectivity
		dbReady := checkDatabaseConnectivity()
		if !dbReady {
			w.WriteHeader(http.StatusServiceUnavailable)
			w.Write([]byte("Database connectivity check failed"))
			return
		}
		// Check cache warmup
		cacheReady := checkCacheWarmup()
		if !cacheReady {
			w.WriteHeader(http.StatusServiceUnavailable)
			w.Write([]byte("Cache warmup incomplete"))
			return
		}
		// All checks passed, service is ready
		readyMutex.Lock()
		isReady = true
		readyMutex.Unlock()
	}
	// Prepare health status response
	status := HealthStatus{
		Status: "UP",
		Components: map[string]string{
			"database": "UP",
			"cache":    "UP",
			"disk":     "UP",
		},
		StartupDuration: time.Since(startupTime).Seconds(),
		UptimeSeconds:   time.Since(startupTime).Seconds(),
		ResourceUtilization: ResourceStats{
			CPUUsagePercent:    getCPUUsage(),
			MemoryUsagePercent: getMemoryUsage(),
			ConnectionCount:    getConnectionCount(),
		},
	}
	w.Header().Set("Content-Type", "application/json")
	w.WriteHeader(http.StatusOK)
	json.NewEncoder(w).Encode(status)
}
func checkDatabaseConnectivity() bool {
	ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
	defer cancel()
	timer := prometheus.NewTimer(databaseLatency)
	defer timer.ObserveDuration()
	err := db.PingContext(ctx)
	if err != nil {
		log.Printf("Database connectivity check failed: %v", err)
		return false
	}
	// Execute a simple query to verify functionality
	var result int
	err = db.QueryRowContext(ctx, "SELECT 1").Scan(&result)
	if err != nil || result != 1 {
		log.Printf("Database query check failed: %v", err)
		return false
	}
	return true
}
func checkCacheWarmup() bool {
	// Implementation of cache warmup check
	return true
}
func getCPUUsage() float64 {
	// Implementation to get CPU usage
	return 0.0
}
func getMemoryUsage() float64 {
	// Implementation to get memory usage
	return 0.0
}
func getConnectionCount() int {
	// Implementation to get connection count
	return 0
}
func main() {
	// Initialize database connection
	var err error
	db, err = sql.Open("postgres", os.Getenv("DATABASE_URL"))
	if err != nil {
		log.Fatalf("Failed to connect to database: %v", err)
	}
	defer db.Close()
	// Configure connection pool
	db.SetMaxOpenConns(25)
	db.SetMaxIdleConns(5)
	db.SetConnMaxLifetime(5 * time.Minute)
	db.SetConnMaxIdleTime(10 * time.Minute)
	// Set up HTTP server
	http.HandleFunc("/health/readiness", readinessHandler)
	http.HandleFunc("/health/liveness", livenessHandler)
	http.Handle("/metrics", promhttp.Handler())
	// Start HTTP server
	log.Fatal(http.ListenAndServe(":8080", nil))
}
func livenessHandler(w http.ResponseWriter, r *http.Request) {
	// Simple liveness check
	w.WriteHeader(http.StatusOK)
	w.Write([]byte("OK"))
}

• Implemented predictive scaling:


# predictive_scaler.py
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from kubernetes import client, config
import time
import logging
from datetime import datetime, timedelta
import schedule
import os
# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger('predictive_scaler')
# Load Kubernetes configuration
try:
    config.load_incluster_config()  # When running inside cluster
except:
    config.load_kube_config()  # When running locally
# Initialize Kubernetes clients
api_client = client.ApiClient()
custom_api = client.CustomObjectsApi(api_client)
apps_api = client.AppsV1Api(api_client)
# Configuration
NAMESPACE = os.environ.get('NAMESPACE', 'production')
DEPLOYMENT_NAME = os.environ.get('DEPLOYMENT_NAME', 'web-app')
HPA_NAME = os.environ.get('HPA_NAME', 'web-app')
PROMETHEUS_URL = os.environ.get('PROMETHEUS_URL', 'http://prometheus:9090')
PREDICTION_WINDOW_HOURS = int(os.environ.get('PREDICTION_WINDOW_HOURS', '1'))
HISTORY_WINDOW_DAYS = int(os.environ.get('HISTORY_WINDOW_DAYS', '14'))
MIN_REPLICAS = int(os.environ.get('MIN_REPLICAS', '5'))
MAX_REPLICAS = int(os.environ.get('MAX_REPLICAS', '30'))
class PredictiveScaler:
    def __init__(self):
        self.model = RandomForestRegressor(n_estimators=100, random_state=42)
        self.scaler = StandardScaler()
        self.is_model_trained = False
    def fetch_historical_metrics(self):
        """Fetch historical metrics from Prometheus"""
        end_time = datetime.now()
        start_time = end_time - timedelta(days=HISTORY_WINDOW_DAYS)
        # Query Prometheus for historical data
        # This is a simplified example - in production, use the Prometheus API
        # to fetch real metrics like request rate, CPU usage, etc.
        # For this example, we'll create synthetic data
        logger.info(f"Fetching historical metrics from {start_time} to {end_time}")
        # Create a date range with hourly intervals
        date_range = pd.date_range(start=start_time, end=end_time, freq='H')
        # Create synthetic metrics
        data = {
            'timestamp': date_range,
            'hour_of_day': [ts.hour for ts in date_range],
            'day_of_week': [ts.dayofweek for ts in date_range],
            'is_weekend': [(ts.dayofweek >= 5) for ts in date_range],
            'request_rate': np.random.normal(1000, 300, size=len(date_range)),
            'cpu_usage': np.random.normal(50, 15, size=len(date_range)),
            'memory_usage': np.random.normal(40, 10, size=len(date_range)),
            'error_rate': np.random.normal(0.5, 0.2, size=len(date_range)),
            'latency_p95': np.random.normal(200, 50, size=len(date_range)),
            'required_replicas': np.random.randint(MIN_REPLICAS, MAX_REPLICAS, size=len(date_range))
        }
        # Add seasonal patterns
        # Increase load during business hours
        for i, hour in enumerate(data['hour_of_day']):
            if 9 <= hour <= 17 and data['day_of_week'][i] < 5:  # Business hours on weekdays
                data['request_rate'][i] *= 1.5
                data['cpu_usage'][i] *= 1.3
                data['required_replicas'][i] = max(MIN_REPLICAS, min(MAX_REPLICAS, int(data['required_replicas'][i] * 1.5)))
        df = pd.DataFrame(data)
        return df
    def train_model(self):
        """Train the predictive model using historical data"""
        logger.info("Training predictive scaling model")
        # Fetch historical data
        df = self.fetch_historical_metrics()
        # Prepare features and target
        features = df[['hour_of_day', 'day_of_week', 'is_weekend', 
                       'request_rate', 'cpu_usage', 'memory_usage', 
                       'error_rate', 'latency_p95']]
        target = df['required_replicas']
        # Scale features
        scaled_features = self.scaler.fit_transform(features)
        # Train model
        self.model.fit(scaled_features, target)
        self.is_model_trained = True
        logger.info("Model training completed")
        # Evaluate model
        predictions = self.model.predict(scaled_features)
        mse = np.mean((predictions - target) ** 2)
        logger.info(f"Model MSE on training data: {mse:.2f}")
        return mse
    def predict_required_replicas(self):
        """Predict the required number of replicas for the next time window"""
        if not self.is_model_trained:
            logger.warning("Model not trained yet, training now")
            self.train_model()
        # Generate features for prediction window
        prediction_window = []
        current_time = datetime.now()
        for hour_offset in range(PREDICTION_WINDOW_HOURS):
            future_time = current_time + timedelta(hours=hour_offset)
            # Fetch current metrics
            # In production, get these from Prometheus or other monitoring system
            current_metrics = {
                'hour_of_day': future_time.hour,
                'day_of_week': future_time.dayofweek,
                'is_weekend': future_time.dayofweek >= 5,
                'request_rate': 1000,  # Example value
                'cpu_usage': 50,       # Example value
                'memory_usage': 40,    # Example value
                'error_rate': 0.5,     # Example value
                'latency_p95': 200     # Example value
            }
            prediction_window.append(current_metrics)
        # Convert to DataFrame and scale
        prediction_df = pd.DataFrame(prediction_window)
        scaled_prediction = self.scaler.transform(prediction_df)
        # Make prediction
        predicted_replicas = self.model.predict(scaled_prediction)
        # Take the maximum predicted value for the window
        max_predicted_replicas = int(np.ceil(max(predicted_replicas)))
        # Ensure within bounds
        max_predicted_replicas = max(MIN_REPLICAS, min(MAX_REPLICAS, max_predicted_replicas))
        logger.info(f"Predicted required replicas for next {PREDICTION_WINDOW_HOURS} hour(s): {max_predicted_replicas}")
        return max_predicted_replicas
    def update_hpa(self, predicted_replicas):
        """Update the HPA min replicas based on prediction"""
        try:
            # Get current HPA
            hpa = custom_api.get_namespaced_custom_object(
                group="autoscaling",
                version="v2",
                namespace=NAMESPACE,
                plural="horizontalpodautoscalers",
                name=HPA_NAME
            )
            current_min_replicas = hpa['spec']['minReplicas']
            if current_min_replicas != predicted_replicas:
                logger.info(f"Updating HPA min replicas from {current_min_replicas} to {predicted_replicas}")
                # Update HPA
                hpa['spec']['minReplicas'] = predicted_replicas
                custom_api.patch_namespaced_custom_object(
                    group="autoscaling",
                    version="v2",
                    namespace=NAMESPACE,
                    plural="horizontalpodautoscalers",
                    name=HPA_NAME,
                    body=hpa
                )
                logger.info("HPA updated successfully")
            else:
                logger.info(f"No update needed, current min replicas already set to {current_min_replicas}")
        except Exception as e:
            logger.error(f"Error updating HPA: {e}")
    def run_scaling_cycle(self):
        """Run a complete predictive scaling cycle"""
        try:
            logger.info("Starting predictive scaling cycle")
            # Predict required replicas
            predicted_replicas = self.predict_required_replicas()
            # Update HPA
            self.update_hpa(predicted_replicas)
            logger.info("Predictive scaling cycle completed")
        except Exception as e:
            logger.error(f"Error in scaling cycle: {e}")
def main():
    scaler = PredictiveScaler()
    # Initial training
    scaler.train_model()
    # Run immediately
    scaler.run_scaling_cycle()
    # Schedule regular runs
    schedule.every(15).minutes.do(scaler.run_scaling_cycle)
    schedule.every(6).hours.do(scaler.train_model)  # Retrain model periodically
    logger.info("Predictive scaler started, running every 15 minutes")
    while True:
        schedule.run_pending()
        time.sleep(60)
if __name__ == "__main__":
    main()

Lessons Learned:

Auto-scaling requires careful configuration and application design to be effective.

How to Avoid:

Optimize application startup time to reduce scaling lag.
Implement proper connection pooling and resource management.
Configure more sophisticated readiness probes.
Set more conservative scaling thresholds.
Consider predictive scaling for workloads with predictable patterns.

Answer 3

output:

Site Reliability Engineering (SRE) Multi-region cloud deployment, microservices architecture

Summary:

No summary provided

What Happened:

A database performance issue in one region cascaded into a global service outage. Despite having SRE teams in multiple regions, the incident response was uncoordinated, leading to conflicting remediation attempts and significantly extended downtime.

Diagnosis Steps:

Reviewed incident timeline and communication logs.
Analyzed actions taken by each regional SRE team.
Examined monitoring data and alerts across regions.
Interviewed team members about decision-making processes.
Reviewed existing incident response procedures.

Root Cause:

Multiple issues contributed to the coordination failure: 1. No clear incident command structure across regions 2. Inconsistent tooling and procedures between regional teams 3. Communication channels were fragmented and not centralized 4. No shared visibility into remediation actions being taken 5. Lack of predefined escalation paths and decision authority

Fix/Workaround:

• Short-term: Implemented an emergency incident command structure:


# incident_roles.yaml - Emergency incident response roles
roles:
  incident_commander:
    description: "Single point of decision-making authority during incidents"
    responsibilities:
      - Declare incident severity and manage overall response
      - Assign roles to team members
      - Make final decisions on remediation actions
      - Coordinate communication across teams
      - Declare incident resolution
    rotation:
      - name: "Alex Chen"
        region: "us-west"
        contact: "alex.chen@example.com, +1-555-123-4567"
      - name: "Priya Sharma"
        region: "eu-central"
        contact: "priya.sharma@example.com, +44-555-987-6543"
      - name: "Jamal Washington"
        region: "ap-southeast"
        contact: "jamal.washington@example.com, +65-555-789-0123"
  communications_lead:
    description: "Manages all external and internal communications"
    responsibilities:
      - Provide regular status updates to stakeholders
      - Coordinate with customer support
      - Draft and send customer communications
      - Document timeline of events
    rotation:
      - name: "Sarah Johnson"
        region: "us-west"
        contact: "sarah.johnson@example.com, +1-555-234-5678"
      - name: "Marco Rossi"
        region: "eu-central"
        contact: "marco.rossi@example.com, +39-555-876-5432"
      - name: "Lin Wei"
        region: "ap-southeast"
        contact: "lin.wei@example.com, +86-555-678-9012"
  operations_lead:
    description: "Coordinates technical response actions"
    responsibilities:
      - Direct technical remediation efforts
      - Coordinate between engineering teams
      - Manage resource allocation during incident
      - Ensure technical actions are documented
    rotation:
      - name: "David Kim"
        region: "us-west"
        contact: "david.kim@example.com, +1-555-345-6789"
      - name: "Sophia Mueller"
        region: "eu-central"
        contact: "sophia.mueller@example.com, +49-555-765-4321"
      - name: "Raj Patel"
        region: "ap-southeast"
        contact: "raj.patel@example.com, +91-555-567-8901"

• Created a centralized incident response tool:


// incident_manager.go
package main
import (
	"context"
	"encoding/json"
	"fmt"
	"log"
	"net/http"
	"sync"
	"time"
	"github.com/gorilla/mux"
	"github.com/gorilla/websocket"
	"go.mongodb.org/mongo-driver/bson"
	"go.mongodb.org/mongo-driver/mongo"
	"go.mongodb.org/mongo-driver/mongo/options"
)
// Incident represents an ongoing incident
type Incident struct {
	ID               string                 `json:"id" bson:"_id"`
	Title            string                 `json:"title" bson:"title"`
	Description      string                 `json:"description" bson:"description"`
	Status           string                 `json:"status" bson:"status"` // "active", "mitigated", "resolved"
	Severity         int                    `json:"severity" bson:"severity"` // 1-5, with 1 being most severe
	StartTime        time.Time              `json:"startTime" bson:"startTime"`
	MitigationTime   *time.Time             `json:"mitigationTime,omitempty" bson:"mitigationTime,omitempty"`
	ResolutionTime   *time.Time             `json:"resolutionTime,omitempty" bson:"resolutionTime,omitempty"`
	AffectedServices []string               `json:"affectedServices" bson:"affectedServices"`
	AffectedRegions  []string               `json:"affectedRegions" bson:"affectedRegions"`
	Commander        string                 `json:"commander" bson:"commander"`
	CommsLead        string                 `json:"commsLead" bson:"commsLead"`
	OpsLead          string                 `json:"opsLead" bson:"opsLead"`
	Timeline         []TimelineEntry        `json:"timeline" bson:"timeline"`
	Actions          []RemediationAction    `json:"actions" bson:"actions"`
	Metadata         map[string]interface{} `json:"metadata" bson:"metadata"`
}
// TimelineEntry represents an entry in the incident timeline
type TimelineEntry struct {
	Timestamp time.Time `json:"timestamp" bson:"timestamp"`
	Message   string    `json:"message" bson:"message"`
	Author    string    `json:"author" bson:"author"`
	Type      string    `json:"type" bson:"type"` // "info", "action", "decision", "milestone"
}
// RemediationAction represents an action taken to remediate the incident
type RemediationAction struct {
	ID          string     `json:"id" bson:"id"`
	Description string     `json:"description" bson:"description"`
	Status      string     `json:"status" bson:"status"` // "proposed", "in-progress", "completed", "abandoned"
	Owner       string     `json:"owner" bson:"owner"`
	Region      string     `json:"region" bson:"region"`
	CreatedAt   time.Time  `json:"createdAt" bson:"createdAt"`
	StartedAt   *time.Time `json:"startedAt,omitempty" bson:"startedAt,omitempty"`
	CompletedAt *time.Time `json:"completedAt,omitempty" bson:"completedAt,omitempty"`
	Impact      string     `json:"impact" bson:"impact"` // Description of potential impact
	Approved    bool       `json:"approved" bson:"approved"`
	ApprovedBy  string     `json:"approvedBy,omitempty" bson:"approvedBy,omitempty"`
}
// IncidentManager manages incidents
type IncidentManager struct {
	db             *mongo.Database
	incidents      *mongo.Collection
	activeIncident *Incident
	mutex          sync.RWMutex
	clients        map[*websocket.Conn]bool
	broadcast      chan []byte
	upgrader       websocket.Upgrader
}
// NewIncidentManager creates a new incident manager
func NewIncidentManager(mongoURI string) (*IncidentManager, error) {
	// Connect to MongoDB
	ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
	defer cancel()
	client, err := mongo.Connect(ctx, options.Client().ApplyURI(mongoURI))
	if err != nil {
		return nil, fmt.Errorf("failed to connect to MongoDB: %w", err)
	}
	// Ping the database
	if err := client.Ping(ctx, nil); err != nil {
		return nil, fmt.Errorf("failed to ping MongoDB: %w", err)
	}
	db := client.Database("incident_manager")
	incidents := db.Collection("incidents")
	// Create indexes
	indexModel := mongo.IndexModel{
		Keys: bson.D{
			{Key: "status", Value: 1},
			{Key: "startTime", Value: -1},
		},
	}
	_, err = incidents.Indexes().CreateOne(ctx, indexModel)
	if err != nil {
		return nil, fmt.Errorf("failed to create index: %w", err)
	}
	return &IncidentManager{
		db:        db,
		incidents: incidents,
		clients:   make(map[*websocket.Conn]bool),
		broadcast: make(chan []byte),
		upgrader: websocket.Upgrader{
			ReadBufferSize:  1024,
			WriteBufferSize: 1024,
			CheckOrigin: func(r *http.Request) bool {
				return true // Allow all connections in this example
			},
		},
	}, nil
}
// Start starts the incident manager
func (im *IncidentManager) Start() {
	// Start the broadcaster
	go im.broadcaster()
	// Load active incident if any
	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
	defer cancel()
	var incident Incident
	err := im.incidents.FindOne(ctx, bson.M{"status": "active"}).Decode(&incident)
	if err == nil {
		im.mutex.Lock()
		im.activeIncident = &incident
		im.mutex.Unlock()
		log.Printf("Loaded active incident: %s", incident.ID)
	} else if err != mongo.ErrNoDocuments {
		log.Printf("Error loading active incident: %v", err)
	}
}
// broadcaster broadcasts messages to all connected clients
func (im *IncidentManager) broadcaster() {
	for {
		message := <-im.broadcast
		for client := range im.clients {
			err := client.WriteMessage(websocket.TextMessage, message)
			if err != nil {
				log.Printf("Error broadcasting message: %v", err)
				client.Close()
				delete(im.clients, client)
			}
		}
	}
}
// handleWebSocket handles WebSocket connections
func (im *IncidentManager) handleWebSocket(w http.ResponseWriter, r *http.Request) {
	conn, err := im.upgrader.Upgrade(w, r, nil)
	if err != nil {
		log.Printf("Error upgrading to WebSocket: %v", err)
		return
	}
	// Register client
	im.clients[conn] = true
	// Send current incident state
	im.mutex.RLock()
	if im.activeIncident != nil {
		data, err := json.Marshal(im.activeIncident)
		if err == nil {
			conn.WriteMessage(websocket.TextMessage, data)
		}
	}
	im.mutex.RUnlock()
	// Handle incoming messages
	go func() {
		defer func() {
			conn.Close()
			delete(im.clients, conn)
		}()
		for {
			_, message, err := conn.ReadMessage()
			if err != nil {
				if websocket.IsUnexpectedCloseError(err, websocket.CloseGoingAway, websocket.CloseAbnormalClosure) {
					log.Printf("WebSocket error: %v", err)
				}
				break
			}
			// Process message (e.g., update incident, add timeline entry, etc.)
			var data map[string]interface{}
			if err := json.Unmarshal(message, &data); err != nil {
				log.Printf("Error unmarshaling message: %v", err)
				continue
			}
			// Handle different message types
			messageType, ok := data["type"].(string)
			if !ok {
				continue
			}
			switch messageType {
			case "add_timeline":
				im.handleAddTimeline(data)
			case "add_action":
				im.handleAddAction(data)
			case "update_action":
				im.handleUpdateAction(data)
			case "update_incident":
				im.handleUpdateIncident(data)
			}
		}
	}()
}
// handleAddTimeline handles adding a timeline entry
func (im *IncidentManager) handleAddTimeline(data map[string]interface{}) {
	// Implementation details omitted for brevity
}
// handleAddAction handles adding a remediation action
func (im *IncidentManager) handleAddAction(data map[string]interface{}) {
	// Implementation details omitted for brevity
}
// handleUpdateAction handles updating a remediation action
func (im *IncidentManager) handleUpdateAction(data map[string]interface{}) {
	// Implementation details omitted for brevity
}
// handleUpdateIncident handles updating incident details
func (im *IncidentManager) handleUpdateIncident(data map[string]interface{}) {
	// Implementation details omitted for brevity
}
// createIncident creates a new incident
func (im *IncidentManager) createIncident(w http.ResponseWriter, r *http.Request) {
	var incident Incident
	if err := json.NewDecoder(r.Body).Decode(&incident); err != nil {
		http.Error(w, err.Error(), http.StatusBadRequest)
		return
	}
	// Set incident fields
	incident.ID = fmt.Sprintf("INC-%d", time.Now().Unix())
	incident.Status = "active"
	incident.StartTime = time.Now()
	incident.Timeline = []TimelineEntry{
		{
			Timestamp: time.Now(),
			Message:   "Incident created",
			Author:    "System",
			Type:      "milestone",
		},
	}
	incident.Actions = []RemediationAction{}
	// Save to database
	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
	defer cancel()
	_, err := im.incidents.InsertOne(ctx, incident)
	if err != nil {
		http.Error(w, err.Error(), http.StatusInternalServerError)
		return
	}
	// Set as active incident
	im.mutex.Lock()
	im.activeIncident = &incident
	im.mutex.Unlock()
	// Broadcast to all clients
	data, _ := json.Marshal(incident)
	im.broadcast <- data
	w.Header().Set("Content-Type", "application/json")
	json.NewEncoder(w).Encode(incident)
}
// getIncident gets an incident by ID
func (im *IncidentManager) getIncident(w http.ResponseWriter, r *http.Request) {
	vars := mux.Vars(r)
	id := vars["id"]
	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
	defer cancel()
	var incident Incident
	err := im.incidents.FindOne(ctx, bson.M{"_id": id}).Decode(&incident)
	if err != nil {
		if err == mongo.ErrNoDocuments {
			http.Error(w, "Incident not found", http.StatusNotFound)
		} else {
			http.Error(w, err.Error(), http.StatusInternalServerError)
		}
		return
	}
	w.Header().Set("Content-Type", "application/json")
	json.NewEncoder(w).Encode(incident)
}
// listIncidents lists all incidents
func (im *IncidentManager) listIncidents(w http.ResponseWriter, r *http.Request) {
	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
	defer cancel()
	findOptions := options.Find()
	findOptions.SetSort(bson.D{{Key: "startTime", Value: -1}})
	findOptions.SetLimit(100)
	cursor, err := im.incidents.Find(ctx, bson.M{}, findOptions)
	if err != nil {
		http.Error(w, err.Error(), http.StatusInternalServerError)
		return
	}
	defer cursor.Close(ctx)
	var incidents []Incident
	if err := cursor.All(ctx, &incidents); err != nil {
		http.Error(w, err.Error(), http.StatusInternalServerError)
		return
	}
	w.Header().Set("Content-Type", "application/json")
	json.NewEncoder(w).Encode(incidents)
}
// addTimelineEntry adds a timeline entry to an incident
func (im *IncidentManager) addTimelineEntry(w http.ResponseWriter, r *http.Request) {
	vars := mux.Vars(r)
	id := vars["id"]
	var entry TimelineEntry
	if err := json.NewDecoder(r.Body).Decode(&entry); err != nil {
		http.Error(w, err.Error(), http.StatusBadRequest)
		return
	}
	entry.Timestamp = time.Now()
	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
	defer cancel()
	update := bson.M{
		"$push": bson.M{
			"timeline": entry,
		},
	}
	result, err := im.incidents.UpdateOne(ctx, bson.M{"_id": id}, update)
	if err != nil {
		http.Error(w, err.Error(), http.StatusInternalServerError)
		return
	}
	if result.MatchedCount == 0 {
		http.Error(w, "Incident not found", http.StatusNotFound)
		return
	}
	// If this is the active incident, update it and broadcast
	im.mutex.RLock()
	isActive := im.activeIncident != nil && im.activeIncident.ID == id
	im.mutex.RUnlock()
	if isActive {
		im.mutex.Lock()
		im.activeIncident.Timeline = append(im.activeIncident.Timeline, entry)
		incident := im.activeIncident
		im.mutex.Unlock()
		data, _ := json.Marshal(incident)
		im.broadcast <- data
	}
	w.WriteHeader(http.StatusCreated)
}
// addRemediationAction adds a remediation action to an incident
func (im *IncidentManager) addRemediationAction(w http.ResponseWriter, r *http.Request) {
	vars := mux.Vars(r)
	id := vars["id"]
	var action RemediationAction
	if err := json.NewDecoder(r.Body).Decode(&action); err != nil {
		http.Error(w, err.Error(), http.StatusBadRequest)
		return
	}
	action.ID = fmt.Sprintf("ACT-%d", time.Now().Unix())
	action.Status = "proposed"
	action.CreatedAt = time.Now()
	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
	defer cancel()
	update := bson.M{
		"$push": bson.M{
			"actions": action,
		},
	}
	result, err := im.incidents.UpdateOne(ctx, bson.M{"_id": id}, update)
	if err != nil {
		http.Error(w, err.Error(), http.StatusInternalServerError)
		return
	}
	if result.MatchedCount == 0 {
		http.Error(w, "Incident not found", http.StatusNotFound)
		return
	}
	// If this is the active incident, update it and broadcast
	im.mutex.RLock()
	isActive := im.activeIncident != nil && im.activeIncident.ID == id
	im.mutex.RUnlock()
	if isActive {
		im.mutex.Lock()
		im.activeIncident.Actions = append(im.activeIncident.Actions, action)
		incident := im.activeIncident
		im.mutex.Unlock()
		data, _ := json.Marshal(incident)
		im.broadcast <- data
	}
	w.Header().Set("Content-Type", "application/json")
	json.NewEncoder(w).Encode(action)
}
// updateRemediationAction updates a remediation action
func (im *IncidentManager) updateRemediationAction(w http.ResponseWriter, r *http.Request) {
	vars := mux.Vars(r)
	id := vars["id"]
	actionID := vars["actionID"]
	var updateData map[string]interface{}
	if err := json.NewDecoder(r.Body).Decode(&updateData); err != nil {
		http.Error(w, err.Error(), http.StatusBadRequest)
		return
	}
	// Build update document
	updateFields := bson.M{}
	for key, value := range updateData {
		updateFields["actions.$."+key] = value
	}
	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
	defer cancel()
	update := bson.M{
		"$set": updateFields,
	}
	result, err := im.incidents.UpdateOne(
		ctx,
		bson.M{
			"_id":        id,
			"actions.id": actionID,
		},
		update,
	)
	if err != nil {
		http.Error(w, err.Error(), http.StatusInternalServerError)
		return
	}
	if result.MatchedCount == 0 {
		http.Error(w, "Incident or action not found", http.StatusNotFound)
		return
	}
	// If this is the active incident, update it and broadcast
	im.mutex.RLock()
	isActive := im.activeIncident != nil && im.activeIncident.ID == id
	im.mutex.RUnlock()
	if isActive {
		// Reload the incident
		var incident Incident
		err := im.incidents.FindOne(ctx, bson.M{"_id": id}).Decode(&incident)
		if err == nil {
			im.mutex.Lock()
			im.activeIncident = &incident
			im.mutex.Unlock()
			data, _ := json.Marshal(incident)
			im.broadcast <- data
		}
	}
	w.WriteHeader(http.StatusOK)
}
// updateIncidentStatus updates the status of an incident
func (im *IncidentManager) updateIncidentStatus(w http.ResponseWriter, r *http.Request) {
	vars := mux.Vars(r)
	id := vars["id"]
	var updateData struct {
		Status string `json:"status"`
	}
	if err := json.NewDecoder(r.Body).Decode(&updateData); err != nil {
		http.Error(w, err.Error(), http.StatusBadRequest)
		return
	}
	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
	defer cancel()
	update := bson.M{
		"$set": bson.M{
			"status": updateData.Status,
		},
	}
	// Add appropriate timestamp based on status
	now := time.Now()
	if updateData.Status == "mitigated" {
		update["$set"].(bson.M)["mitigationTime"] = now
	} else if updateData.Status == "resolved" {
		update["$set"].(bson.M)["resolutionTime"] = now
	}
	result, err := im.incidents.UpdateOne(ctx, bson.M{"_id": id}, update)
	if err != nil {
		http.Error(w, err.Error(), http.StatusInternalServerError)
		return
	}
	if result.MatchedCount == 0 {
		http.Error(w, "Incident not found", http.StatusNotFound)
		return
	}
	// If this is the active incident and it's being resolved, clear it
	im.mutex.RLock()
	isActive := im.activeIncident != nil && im.activeIncident.ID == id
	im.mutex.RUnlock()
	if isActive {
		if updateData.Status == "mitigated" {
			im.mutex.Lock()
			mitigationTime := now
			im.activeIncident.Status = "mitigated"
			im.activeIncident.MitigationTime = &mitigationTime
			incident := im.activeIncident
			im.mutex.Unlock()
			data, _ := json.Marshal(incident)
			im.broadcast <- data
		} else if updateData.Status == "resolved" {
			im.mutex.Lock()
			im.activeIncident = nil
			im.mutex.Unlock()
			// Broadcast incident closure
			data, _ := json.Marshal(map[string]interface{}{
				"type":    "incident_closed",
				"id":      id,
				"message": "Incident has been resolved",
			})
			im.broadcast <- data
		}
	}
	w.WriteHeader(http.StatusOK)
}
func main() {
	// Create incident manager
	im, err := NewIncidentManager("mongodb://localhost:27017")
	if err != nil {
		log.Fatalf("Failed to create incident manager: %v", err)
	}
	im.Start()
	// Create router
	r := mux.NewRouter()
	// API routes
	r.HandleFunc("/api/incidents", im.createIncident).Methods("POST")
	r.HandleFunc("/api/incidents", im.listIncidents).Methods("GET")
	r.HandleFunc("/api/incidents/{id}", im.getIncident).Methods("GET")
	r.HandleFunc("/api/incidents/{id}/timeline", im.addTimelineEntry).Methods("POST")
	r.HandleFunc("/api/incidents/{id}/actions", im.addRemediationAction).Methods("POST")
	r.HandleFunc("/api/incidents/{id}/actions/{actionID}", im.updateRemediationAction).Methods("PUT")
	r.HandleFunc("/api/incidents/{id}/status", im.updateIncidentStatus).Methods("PUT")
	// WebSocket endpoint
	r.HandleFunc("/ws", im.handleWebSocket)
	// Serve static files
	r.PathPrefix("/").Handler(http.FileServer(http.Dir("./static")))
	// Start server
	log.Println("Starting server on :8080")
	log.Fatal(http.ListenAndServe(":8080", r))
}

• Long-term: Implemented a comprehensive incident management framework:

Established a global incident command structure with clear roles and responsibilities

Standardized tooling and procedures across all regions

Created a centralized incident management platform for real-time coordination

Implemented regular cross-region incident response drills

Developed a post-incident review process with actionable improvements

Lessons Learned:

Effective incident response requires clear command structure and coordination across teams.

How to Avoid:

Establish a clear incident command structure with defined roles.
Standardize tools and procedures across all teams and regions.
Implement centralized communication channels for incident coordination.
Conduct regular cross-region incident response drills.
Document and share lessons learned from each incident.

Answer 4

output:

Site Reliability Engineering (SRE) Kubernetes, Production environment, Chaos Monkey

Summary:

No summary provided

What Happened:

An SRE team implemented a chaos engineering experiment to test the resilience of a critical microservices application. The experiment was designed to randomly terminate pods to ensure the system could handle unexpected failures. However, when executed, it caused a cascading failure that took down the entire production environment for over an hour, affecting thousands of users.

Diagnosis Steps:

Analyzed chaos experiment configuration and execution logs.
Reviewed system metrics and alerts during the incident.
Examined application logs for error patterns.
Checked Kubernetes events and pod status changes.
Investigated network and service mesh configurations.

Root Cause:

The investigation revealed multiple issues: 1. The chaos experiment targeted critical infrastructure components that weren't properly excluded 2. The blast radius wasn't properly limited, allowing too many simultaneous failures 3. The application's retry mechanisms created a thundering herd problem 4. Circuit breakers were misconfigured and didn't trip as expected 5. Monitoring didn't detect the early warning signs of the cascade

Fix/Workaround:

• Short-term: Implemented immediate safeguards for chaos experiments:


# Chaos Mesh experiment with proper safeguards
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-failure-demo
  namespace: chaos-testing
spec:
  action: pod-failure
  mode: one
  duration: "30s"
  selector:
    namespaces:
      - app
    labelSelectors:
      "app.kubernetes.io/component": "worker"
  scheduler:
    cron: "@every 5m"
  # Critical safeguards
  containerNames: ["worker"]
  gracePeriod: 30
  maxPercentage: 20
  excludedPods:
    - app-leader-0
    - app-database-0
  excludedNamespaces:
    - kube-system
    - monitoring
    - istio-system
  annotations:
    experiment.chaos-mesh.org/pause: "false"
    experiment.chaos-mesh.org/duration: "30s"

• Long-term: Implemented a comprehensive chaos engineering framework in Go:


// chaos_framework.go - Safe chaos engineering framework
package chaos
import (
	"context"
	"fmt"
	"math/rand"
	"sync"
	"time"
	"github.com/sirupsen/logrus"
	"k8s.io/apimachinery/pkg/api/errors"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/client-go/kubernetes"
	"k8s.io/client-go/rest"
)
// ExperimentConfig defines the configuration for a chaos experiment
type ExperimentConfig struct {
	// Name of the experiment
	Name string `json:"name"`
	// Description of what the experiment tests
	Description string `json:"description"`
	// Target namespaces to include in the experiment
	TargetNamespaces []string `json:"targetNamespaces"`
	// Namespaces to exclude from the experiment
	ExcludedNamespaces []string `json:"excludedNamespaces"`
	// Label selectors to target specific resources
	LabelSelectors map[string]string `json:"labelSelectors"`
	// Resources to exclude by name
	ExcludedResources []string `json:"excludedResources"`
	// Maximum percentage of resources to affect at once (0-100)
	MaxPercentage int `json:"maxPercentage"`
	// Duration of the experiment
	Duration time.Duration `json:"duration"`
	// Interval between actions
	Interval time.Duration `json:"interval"`
	// Whether to run the experiment in dry-run mode
	DryRun bool `json:"dryRun"`
	// Critical service indicators to monitor during experiment
	CSIThresholds map[string]float64 `json:"csiThresholds"`
	// Automatic rollback if thresholds are exceeded
	AutoRollback bool `json:"autoRollback"`
	// Notification channels for experiment events
	NotificationChannels []string `json:"notificationChannels"`
	// Experiment actions to perform
	Actions []ExperimentAction `json:"actions"`
}
// ExperimentAction defines a specific chaos action
type ExperimentAction struct {
	// Type of action (pod-kill, network-delay, cpu-stress, etc.)
	Type string `json:"type"`
	// Action-specific parameters
	Parameters map[string]interface{} `json:"parameters"`
	// Weight of this action (for random selection)
	Weight int `json:"weight"`
}
// ExperimentResult contains the results of a chaos experiment
type ExperimentResult struct {
	// Name of the experiment
	ExperimentName string `json:"experimentName"`
	// Start time of the experiment
	StartTime time.Time `json:"startTime"`
	// End time of the experiment
	EndTime time.Time `json:"endTime"`
	// Whether the experiment completed successfully
	Successful bool `json:"successful"`
	// Reason for failure if not successful
	FailureReason string `json:"failureReason,omitempty"`
	// Resources affected by the experiment
	AffectedResources []string `json:"affectedResources"`
	// Metrics collected during the experiment
	Metrics map[string][]MetricDataPoint `json:"metrics"`
	// Events that occurred during the experiment
	Events []ExperimentEvent `json:"events"`
}
// MetricDataPoint represents a single metric measurement
type MetricDataPoint struct {
	// Timestamp of the measurement
	Timestamp time.Time `json:"timestamp"`
	// Value of the metric
	Value float64 `json:"value"`
}
// ExperimentEvent represents a significant event during the experiment
type ExperimentEvent struct {
	// Timestamp of the event
	Timestamp time.Time `json:"timestamp"`
	// Type of event
	Type string `json:"type"`
	// Description of the event
	Description string `json:"description"`
	// Severity of the event (info, warning, error)
	Severity string `json:"severity"`
}
// ChaosFramework orchestrates chaos engineering experiments
type ChaosFramework struct {
	// Kubernetes client
	client *kubernetes.Clientset
	// Logger
	logger *logrus.Logger
	// Metrics collector
	metricsCollector MetricsCollector
	// Active experiments
	activeExperiments map[string]*runningExperiment
	// Mutex for thread safety
	mu sync.Mutex
}
// runningExperiment tracks an active experiment
type runningExperiment struct {
	// Configuration of the experiment
	config ExperimentConfig
	// Context for cancellation
	ctx context.Context
	// Cancel function
	cancel context.CancelFunc
	// Start time
	startTime time.Time
	// Affected resources
	affectedResources []string
	// Events
	events []ExperimentEvent
}
// MetricsCollector interface for collecting metrics
type MetricsCollector interface {
	// CollectMetrics collects metrics for the given resources
	CollectMetrics(ctx context.Context, resources []string) (map[string]float64, error)
	// StartCollection starts continuous metric collection
	StartCollection(ctx context.Context, resources []string, interval time.Duration) error
	// StopCollection stops continuous metric collection
	StopCollection() (map[string][]MetricDataPoint, error)
}
// NewChaosFramework creates a new chaos engineering framework
func NewChaosFramework(metricsCollector MetricsCollector) (*ChaosFramework, error) {
	// Create Kubernetes client
	config, err := rest.InClusterConfig()
	if err != nil {
		return nil, fmt.Errorf("failed to get cluster config: %w", err)
	}
	clientset, err := kubernetes.NewForConfig(config)
	if err != nil {
		return nil, fmt.Errorf("failed to create Kubernetes client: %w", err)
	}
	// Create logger
	logger := logrus.New()
	logger.SetFormatter(&logrus.JSONFormatter{})
	return &ChaosFramework{
		client:            clientset,
		logger:            logger,
		metricsCollector:  metricsCollector,
		activeExperiments: make(map[string]*runningExperiment),
	}, nil
}
// RunExperiment runs a chaos experiment
func (c *ChaosFramework) RunExperiment(config ExperimentConfig) (*ExperimentResult, error) {
	c.mu.Lock()
	defer c.mu.Unlock()
	// Validate experiment configuration
	if err := c.validateExperimentConfig(config); err != nil {
		return nil, fmt.Errorf("invalid experiment configuration: %w", err)
	}
	// Check if experiment is already running
	if _, exists := c.activeExperiments[config.Name]; exists {
		return nil, fmt.Errorf("experiment %s is already running", config.Name)
	}
	// Create context with cancellation
	ctx, cancel := context.WithTimeout(context.Background(), config.Duration)
	// Create running experiment
	experiment := &runningExperiment{
		config:    config,
		ctx:       ctx,
		cancel:    cancel,
		startTime: time.Now(),
		events: []ExperimentEvent{
			{
				Timestamp:   time.Now(),
				Type:        "ExperimentStarted",
				Description: fmt.Sprintf("Started experiment %s", config.Name),
				Severity:    "info",
			},
		},
	}
	// Store active experiment
	c.activeExperiments[config.Name] = experiment
	// Log experiment start
	c.logger.WithFields(logrus.Fields{
		"experiment": config.Name,
		"duration":   config.Duration.String(),
		"dry_run":    config.DryRun,
	}).Info("Starting chaos experiment")
	// Start experiment in a goroutine
	go func() {
		defer cancel()
		// Start metrics collection
		targetResources, err := c.getTargetResources(config)
		if err != nil {
			c.recordEvent(config.Name, "ResourceSelectionFailed", err.Error(), "error")
			return
		}
		if err := c.metricsCollector.StartCollection(ctx, targetResources, 10*time.Second); err != nil {
			c.recordEvent(config.Name, "MetricsCollectionFailed", err.Error(), "warning")
		}
		// Run experiment actions
		c.runExperimentActions(experiment, targetResources)
		// Wait for experiment completion or cancellation
		<-ctx.Done()
		// Stop metrics collection
		metrics, _ := c.metricsCollector.StopCollection()
		// Create experiment result
		result := &ExperimentResult{
			ExperimentName:    config.Name,
			StartTime:         experiment.startTime,
			EndTime:           time.Now(),
			Successful:        ctx.Err() == context.DeadlineExceeded, // Completed normally
			AffectedResources: experiment.affectedResources,
			Metrics:           metrics,
			Events:            experiment.events,
		}
		if ctx.Err() != context.DeadlineExceeded {
			result.FailureReason = ctx.Err().Error()
		}
		// Log experiment completion
		c.logger.WithFields(logrus.Fields{
			"experiment": config.Name,
			"duration":   result.EndTime.Sub(result.StartTime).String(),
			"successful": result.Successful,
		}).Info("Completed chaos experiment")
		// Remove from active experiments
		c.mu.Lock()
		delete(c.activeExperiments, config.Name)
		c.mu.Unlock()
	}()
	return &ExperimentResult{
		ExperimentName: config.Name,
		StartTime:      experiment.startTime,
	}, nil
}
// StopExperiment stops a running experiment
func (c *ChaosFramework) StopExperiment(name string) error {
	c.mu.Lock()
	defer c.mu.Unlock()
	experiment, exists := c.activeExperiments[name]
	if !exists {
		return fmt.Errorf("experiment %s is not running", name)
	}
	// Cancel experiment
	experiment.cancel()
	// Record event
	c.recordEvent(name, "ExperimentStopped", "Experiment was manually stopped", "warning")
	return nil
}
// GetActiveExperiments returns the names of all active experiments
func (c *ChaosFramework) GetActiveExperiments() []string {
	c.mu.Lock()
	defer c.mu.Unlock()
	names := make([]string, 0, len(c.activeExperiments))
	for name := range c.activeExperiments {
		names = append(names, name)
	}
	return names
}
// validateExperimentConfig validates the experiment configuration
func (c *ChaosFramework) validateExperimentConfig(config ExperimentConfig) error {
	// Check required fields
	if config.Name == "" {
		return fmt.Errorf("experiment name is required")
	}
	if len(config.TargetNamespaces) == 0 && len(config.LabelSelectors) == 0 {
		return fmt.Errorf("at least one target namespace or label selector is required")
	}
	if config.Duration <= 0 {
		return fmt.Errorf("experiment duration must be positive")
	}
	if config.MaxPercentage <= 0 || config.MaxPercentage > 100 {
		return fmt.Errorf("max percentage must be between 1 and 100")
	}
	// Validate actions
	if len(config.Actions) == 0 {
		return fmt.Errorf("at least one action is required")
	}
	for i, action := range config.Actions {
		if action.Type == "" {
			return fmt.Errorf("action %d has no type", i)
		}
		if action.Weight <= 0 {
			return fmt.Errorf("action %d has invalid weight", i)
		}
		// Validate action-specific parameters
		switch action.Type {
		case "pod-kill":
			// No specific parameters required
		case "network-delay":
			if _, ok := action.Parameters["latency"]; !ok {
				return fmt.Errorf("network-delay action requires latency parameter")
			}
		case "cpu-stress":
			if _, ok := action.Parameters["load"]; !ok {
				return fmt.Errorf("cpu-stress action requires load parameter")
			}
		default:
			return fmt.Errorf("unknown action type: %s", action.Type)
		}
	}
	// Check for critical namespaces
	criticalNamespaces := []string{"kube-system", "kube-public", "kube-node-lease"}
	for _, ns := range config.TargetNamespaces {
		for _, critical := range criticalNamespaces {
			if ns == critical && !contains(config.ExcludedNamespaces, critical) {
				return fmt.Errorf("targeting critical namespace %s without excluding it", critical)
			}
		}
	}
	return nil
}
// getTargetResources gets the resources targeted by the experiment
func (c *ChaosFramework) getTargetResources(config ExperimentConfig) ([]string, error) {
	var resources []string
	// Get pods matching the criteria
	for _, namespace := range config.TargetNamespaces {
		// Skip excluded namespaces
		if contains(config.ExcludedNamespaces, namespace) {
			continue
		}
		// Create label selector string
		var labelSelector string
		for k, v := range config.LabelSelectors {
			if labelSelector != "" {
				labelSelector += ","
			}
			labelSelector += fmt.Sprintf("%s=%s", k, v)
		}
		// List pods
		pods, err := c.client.CoreV1().Pods(namespace).List(context.Background(), metav1.ListOptions{
			LabelSelector: labelSelector,
		})
		if err != nil {
			return nil, fmt.Errorf("failed to list pods in namespace %s: %w", namespace, err)
		}
		// Add pods to resources
		for _, pod := range pods.Items {
			// Skip excluded resources
			if contains(config.ExcludedResources, pod.Name) {
				continue
			}
			resources = append(resources, fmt.Sprintf("%s/%s", namespace, pod.Name))
		}
	}
	// Apply max percentage
	maxCount := len(resources) * config.MaxPercentage / 100
	if maxCount < 1 {
		maxCount = 1
	}
	// Shuffle and limit resources
	rand.Shuffle(len(resources), func(i, j int) {
		resources[i], resources[j] = resources[j], resources[i]
	})
	if len(resources) > maxCount {
		resources = resources[:maxCount]
	}
	return resources, nil
}
// runExperimentActions runs the experiment actions
func (c *ChaosFramework) runExperimentActions(experiment *runningExperiment, targetResources []string) {
	config := experiment.config
	ctx := experiment.ctx
	// Calculate total weight
	totalWeight := 0
	for _, action := range config.Actions {
		totalWeight += action.Weight
	}
	// Run actions at intervals
	ticker := time.NewTicker(config.Interval)
	defer ticker.Stop()
	for {
		select {
		case <-ticker.C:
			// Select random action based on weight
			actionIndex := -1
			randomWeight := rand.Intn(totalWeight)
			weightSum := 0
			for i, action := range config.Actions {
				weightSum += action.Weight
				if randomWeight < weightSum {
					actionIndex = i
					break
				}
			}
			if actionIndex == -1 {
				continue
			}
			action := config.Actions[actionIndex]
			// Select random resource
			if len(targetResources) == 0 {
				continue
			}
			resourceIndex := rand.Intn(len(targetResources))
			resource := targetResources[resourceIndex]
			// Execute action
			if err := c.executeAction(ctx, action, resource, config.DryRun); err != nil {
				c.recordEvent(config.Name, "ActionFailed", 
					fmt.Sprintf("Failed to execute action %s on resource %s: %v", action.Type, resource, err),
					"error")
				continue
			}
			// Record affected resource
			experiment.affectedResources = append(experiment.affectedResources, resource)
			// Record event
			c.recordEvent(config.Name, "ActionExecuted",
				fmt.Sprintf("Executed action %s on resource %s", action.Type, resource),
				"info")
			// Check metrics for abort conditions
			metrics, err := c.metricsCollector.CollectMetrics(ctx, []string{resource})
			if err != nil {
				c.logger.WithError(err).Warn("Failed to collect metrics")
				continue
			}
			// Check thresholds
			for metric, threshold := range config.CSIThresholds {
				if value, ok := metrics[metric]; ok && value > threshold {
					c.logger.WithFields(logrus.Fields{
						"metric":    metric,
						"value":     value,
						"threshold": threshold,
					}).Warn("Metric threshold exceeded")
					if config.AutoRollback {
						c.recordEvent(config.Name, "ThresholdExceeded",
							fmt.Sprintf("Metric %s exceeded threshold (%.2f > %.2f)", metric, value, threshold),
							"warning")
						// Cancel experiment
						experiment.cancel()
						return
					}
				}
			}
		case <-ctx.Done():
			return
		}
	}
}
// executeAction executes a chaos action on a resource
func (c *ChaosFramework) executeAction(ctx context.Context, action ExperimentAction, resource string, dryRun bool) error {
	// Parse namespace and name
	parts := splitResource(resource)
	if len(parts) != 2 {
		return fmt.Errorf("invalid resource format: %s", resource)
	}
	namespace, name := parts[0], parts[1]
	// Execute action based on type
	switch action.Type {
	case "pod-kill":
		if dryRun {
			c.logger.WithFields(logrus.Fields{
				"action":    "pod-kill",
				"namespace": namespace,
				"pod":       name,
				"dry_run":   true,
			}).Info("Would delete pod")
			return nil
		}
		return c.client.CoreV1().Pods(namespace).Delete(ctx, name, metav1.DeleteOptions{})
	case "network-delay":
		latency, ok := action.Parameters["latency"].(string)
		if !ok {
			return fmt.Errorf("invalid latency parameter")
		}
		if dryRun {
			c.logger.WithFields(logrus.Fields{
				"action":    "network-delay",
				"namespace": namespace,
				"pod":       name,
				"latency":   latency,
				"dry_run":   true,
			}).Info("Would add network delay")
			return nil
		}
		// In a real implementation, this would use a CNI plugin or sidecar to add network delay
		c.logger.WithFields(logrus.Fields{
			"action":    "network-delay",
			"namespace": namespace,
			"pod":       name,
			"latency":   latency,
		}).Info("Adding network delay")
		return nil
	case "cpu-stress":
		load, ok := action.Parameters["load"].(float64)
		if !ok {
			return fmt.Errorf("invalid load parameter")
		}
		if dryRun {
			c.logger.WithFields(logrus.Fields{
				"action":    "cpu-stress",
				"namespace": namespace,
				"pod":       name,
				"load":      load,
				"dry_run":   true,
			}).Info("Would add CPU stress")
			return nil
		}
		// In a real implementation, this would inject a stress test into the pod
		c.logger.WithFields(logrus.Fields{
			"action":    "cpu-stress",
			"namespace": namespace,
			"pod":       name,
			"load":      load,
		}).Info("Adding CPU stress")
		return nil
	default:
		return fmt.Errorf("unknown action type: %s", action.Type)
	}
}
// recordEvent records an experiment event
func (c *ChaosFramework) recordEvent(experimentName, eventType, description, severity string) {
	c.mu.Lock()
	defer c.mu.Unlock()
	experiment, exists := c.activeExperiments[experimentName]
	if !exists {
		return
	}
	event := ExperimentEvent{
		Timestamp:   time.Now(),
		Type:        eventType,
		Description: description,
		Severity:    severity,
	}
	experiment.events = append(experiment.events, event)
	c.logger.WithFields(logrus.Fields{
		"experiment": experimentName,
		"event_type": eventType,
		"severity":   severity,
	}).Info(description)
}
// Helper functions
// contains checks if a string is in a slice
func contains(slice []string, item string) bool {
	for _, s := range slice {
		if s == item {
			return true
		}
	}
	return false
}
// splitResource splits a resource string into namespace and name
func splitResource(resource string) []string {
	return strings.Split(resource, "/")
}

• Created a comprehensive chaos testing policy:


# chaos-testing-policy.yaml
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: chaos-testing-psp
  annotations:
    seccomp.security.alpha.kubernetes.io/allowedProfileNames: 'docker/default,runtime/default'
    seccomp.security.alpha.kubernetes.io/defaultProfileName: 'runtime/default'
spec:
  privileged: false
  allowPrivilegeEscalation: false
  requiredDropCapabilities:
    - ALL
  volumes:
    - 'configMap'
    - 'emptyDir'
    - 'projected'
    - 'secret'
    - 'downwardAPI'
  hostNetwork: false
  hostIPC: false
  hostPID: false
  runAsUser:
    rule: 'MustRunAsNonRoot'
  seLinux:
    rule: 'RunAsAny'
  supplementalGroups:
    rule: 'MustRunAs'
    ranges:
      - min: 1
        max: 65535
  fsGroup:
    rule: 'MustRunAs'
    ranges:
      - min: 1
        max: 65535
  readOnlyRootFilesystem: true
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: chaos-testing-role
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list", "watch", "delete"]
  - apiGroups: [""]
    resources: ["events"]
    verbs: ["create", "patch"]
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get", "list"]
  - apiGroups: ["metrics.k8s.io"]
    resources: ["pods"]
    verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: chaos-testing-binding
subjects:
- kind: ServiceAccount
  name: chaos-testing-sa
  namespace: chaos-testing
roleRef:
  kind: ClusterRole
  name: chaos-testing-role
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: Namespace
metadata:
  name: chaos-testing
  labels:
    name: chaos-testing
    purpose: chaos-engineering
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: chaos-testing-sa
  namespace: chaos-testing
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: chaos-testing-config
  namespace: chaos-testing
data:
  config.yaml: |
    # Global chaos testing configuration
    defaultExperimentDuration: 10m
    defaultInterval: 30s
    maxConcurrentExperiments: 1
    notificationChannels:
      - slack
      - email
    slackWebhook: "${SLACK_WEBHOOK}"
    emailRecipients:
      - sre-team@example.com
    # Default safeguards
    defaultSafeguards:
      excludedNamespaces:
        - kube-system
        - kube-public
        - kube-node-lease
        - istio-system
        - monitoring
        - logging
      excludedLabels:
        "chaos-exempt": "true"
        "criticality": "high"
      maxPercentage: 20
      autoRollback: true
      csiThresholds:
        "error_rate": 5.0
        "latency_p99": 1000
        "cpu_usage": 90.0

Lessons Learned:

Chaos engineering requires careful planning, proper safeguards, and controlled blast radius.

How to Avoid:

Implement proper safeguards for all chaos experiments.
Start with non-production environments and gradually expand.
Define clear abort conditions and automatic rollbacks.
Ensure monitoring can detect early warning signs of cascading failures.
Educate teams about chaos engineering principles and practices.

Answer 5

output:

Site Reliability Engineering (SRE) Kubernetes, Production environment, Chaos Engineering tools

Summary:

No summary provided

What Happened:

An SRE team implemented a chaos engineering program to improve system resilience. During a scheduled experiment to test database failover, the chaos experiment unexpectedly affected critical production services outside the intended blast radius. The experiment was supposed to terminate a single database pod in a non-critical service, but due to misconfiguration, it terminated multiple database pods across several services, including the company's payment processing system. This resulted in a 45-minute outage for critical business functions.

Diagnosis Steps:

Analyzed chaos experiment configuration and execution logs.
Reviewed Kubernetes resource selectors and labels used in the experiment.
Examined the blast radius containment mechanisms.
Checked monitoring alerts and system behavior during the incident.
Reviewed the experiment approval and execution process.

Root Cause:

The investigation revealed multiple issues: 1. The chaos experiment used overly broad pod selectors that matched more resources than intended 2. The blast radius containment mechanism failed due to incorrect namespace configuration 3. Pre-experiment verification steps were skipped due to time pressure 4. Monitoring for critical services affected by the experiment was inadequate 5. The experiment was run during a high-traffic period without proper scheduling controls

Fix/Workaround:

• Short-term: Implemented immediate improvements to chaos engineering practices:

Added strict label selectors for all experiments

Implemented mandatory pre-experiment verification

Created a chaos engineering approval workflow

• Long-term: Developed a comprehensive chaos engineering framework in Go:


// chaos_framework.go
package chaos
import (
	"context"
	"fmt"
	"log"
	"time"
	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promauto"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/client-go/kubernetes"
	"k8s.io/client-go/rest"
)
var (
	// Prometheus metrics
	experimentsTotal = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "chaos_experiments_total",
			Help: "Total number of chaos experiments executed",
		},
		[]string{"name", "target_type", "namespace", "status"},
	)
	experimentDuration = promauto.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "chaos_experiment_duration_seconds",
			Help:    "Duration of chaos experiments in seconds",
			Buckets: prometheus.LinearBuckets(10, 30, 10), // 10s, 40s, 70s, ...
		},
		[]string{"name", "target_type"},
	)
	affectedResources = promauto.NewGaugeVec(
		prometheus.GaugeOpts{
			Name: "chaos_affected_resources",
			Help: "Number of resources affected by chaos experiments",
		},
		[]string{"name", "target_type", "namespace"},
	)
)
// ExperimentConfig defines the configuration for a chaos experiment
type ExperimentConfig struct {
	Name              string            `json:"name"`
	Description       string            `json:"description"`
	TargetType        string            `json:"target_type"` // pod, node, service, etc.
	Namespace         string            `json:"namespace"`
	LabelSelectors    map[string]string `json:"label_selectors"`
	ExcludeLabels     map[string]string `json:"exclude_labels"`
	Duration          time.Duration     `json:"duration"`
	MaxTargets        int               `json:"max_targets"`
	ConcurrentTargets int               `json:"concurrent_targets"`
	DryRun            bool              `json:"dry_run"`
	Actions           []Action          `json:"actions"`
	PreChecks         []Check           `json:"pre_checks"`
	PostChecks        []Check           `json:"post_checks"`
	Schedule          *Schedule         `json:"schedule"`
	Notifications     []Notification    `json:"notifications"`
	SafetyChecks      []SafetyCheck     `json:"safety_checks"`
}
// Action defines a chaos action to be performed
type Action struct {
	Type       string                 `json:"type"`
	Parameters map[string]interface{} `json:"parameters"`
}
// Check defines a check to be performed before or after an experiment
type Check struct {
	Type       string                 `json:"type"`
	Parameters map[string]interface{} `json:"parameters"`
	Timeout    time.Duration          `json:"timeout"`
}
// Schedule defines when an experiment should be executed
type Schedule struct {
	Cron           string   `json:"cron"`
	AllowedDays    []string `json:"allowed_days"`
	AllowedHours   []int    `json:"allowed_hours"`
	ExcludedDates  []string `json:"excluded_dates"`
	TimeZone       string   `json:"time_zone"`
	MaxConcurrent  int      `json:"max_concurrent"`
	RequireApproval bool     `json:"require_approval"`
}
// Notification defines how to notify about experiment execution
type Notification struct {
	Type       string                 `json:"type"`
	Parameters map[string]interface{} `json:"parameters"`
	Events     []string               `json:"events"`
}
// SafetyCheck defines a safety check to prevent unintended damage
type SafetyCheck struct {
	Type       string                 `json:"type"`
	Parameters map[string]interface{} `json:"parameters"`
	StopOnFail bool                   `json:"stop_on_fail"`
}
// ExperimentResult contains the result of a chaos experiment
type ExperimentResult struct {
	ExperimentName string                  `json:"experiment_name"`
	StartTime      time.Time               `json:"start_time"`
	EndTime        time.Time               `json:"end_time"`
	Duration       time.Duration           `json:"duration"`
	Status         string                  `json:"status"`
	TargetsAffected int                    `json:"targets_affected"`
	ActionResults  map[string]ActionResult `json:"action_results"`
	CheckResults   map[string]CheckResult  `json:"check_results"`
	Logs           []LogEntry              `json:"logs"`
	Metrics        map[string]float64      `json:"metrics"`
}
// ActionResult contains the result of a chaos action
type ActionResult struct {
	ActionType  string    `json:"action_type"`
	StartTime   time.Time `json:"start_time"`
	EndTime     time.Time `json:"end_time"`
	Duration    time.Duration `json:"duration"`
	Status      string    `json:"status"`
	TargetCount int       `json:"target_count"`
	Error       string    `json:"error,omitempty"`
}
// CheckResult contains the result of a pre or post check
type CheckResult struct {
	CheckType string    `json:"check_type"`
	StartTime time.Time `json:"start_time"`
	EndTime   time.Time `json:"end_time"`
	Duration  time.Duration `json:"duration"`
	Status    string    `json:"status"`
	Error     string    `json:"error,omitempty"`
}
// LogEntry represents a log entry from the experiment
type LogEntry struct {
	Timestamp time.Time `json:"timestamp"`
	Level     string    `json:"level"`
	Component string    `json:"component"`
	Message   string    `json:"message"`
}
// ChaosFramework is the main entry point for chaos experiments
type ChaosFramework struct {
	clientset *kubernetes.Clientset
	logger    *log.Logger
	config    *rest.Config
}
// NewChaosFramework creates a new chaos framework
func NewChaosFramework(logger *log.Logger) (*ChaosFramework, error) {
	// Get in-cluster config
	config, err := rest.InClusterConfig()
	if err != nil {
		return nil, fmt.Errorf("failed to get in-cluster config: %w", err)
	}
	// Create clientset
	clientset, err := kubernetes.NewForConfig(config)
	if err != nil {
		return nil, fmt.Errorf("failed to create clientset: %w", err)
	}
	return &ChaosFramework{
		clientset: clientset,
		logger:    logger,
		config:    config,
	}, nil
}
// RunExperiment runs a chaos experiment
func (c *ChaosFramework) RunExperiment(ctx context.Context, config ExperimentConfig) (*ExperimentResult, error) {
	c.logger.Printf("Starting chaos experiment: %s", config.Name)
	// Initialize result
	result := &ExperimentResult{
		ExperimentName: config.Name,
		StartTime:      time.Now(),
		Status:         "running",
		ActionResults:  make(map[string]ActionResult),
		CheckResults:   make(map[string]CheckResult),
		Logs:           []LogEntry{},
		Metrics:        make(map[string]float64),
	}
	// Add initial log entry
	result.Logs = append(result.Logs, LogEntry{
		Timestamp: time.Now(),
		Level:     "info",
		Component: "framework",
		Message:   fmt.Sprintf("Starting experiment: %s", config.Name),
	})
	// Validate configuration
	if err := c.validateConfig(config); err != nil {
		result.Status = "failed"
		result.Logs = append(result.Logs, LogEntry{
			Timestamp: time.Now(),
			Level:     "error",
			Component: "framework",
			Message:   fmt.Sprintf("Configuration validation failed: %v", err),
		})
		experimentsTotal.WithLabelValues(config.Name, config.TargetType, config.Namespace, "failed").Inc()
		return result, fmt.Errorf("invalid configuration: %w", err)
	}
	// Check if experiment is allowed to run now
	if config.Schedule != nil && !c.isScheduleAllowed(config.Schedule) {
		result.Status = "skipped"
		result.Logs = append(result.Logs, LogEntry{
			Timestamp: time.Now(),
			Level:     "info",
			Component: "framework",
			Message:   "Experiment skipped due to schedule constraints",
		})
		experimentsTotal.WithLabelValues(config.Name, config.TargetType, config.Namespace, "skipped").Inc()
		return result, nil
	}
	// Run safety checks
	if err := c.runSafetyChecks(ctx, config); err != nil {
		result.Status = "aborted"
		result.Logs = append(result.Logs, LogEntry{
			Timestamp: time.Now(),
			Level:     "error",
			Component: "framework",
			Message:   fmt.Sprintf("Safety check failed: %v", err),
		})
		experimentsTotal.WithLabelValues(config.Name, config.TargetType, config.Namespace, "aborted").Inc()
		return result, fmt.Errorf("safety check failed: %w", err)
	}
	// Get targets
	targets, err := c.getTargets(ctx, config)
	if err != nil {
		result.Status = "failed"
		result.Logs = append(result.Logs, LogEntry{
			Timestamp: time.Now(),
			Level:     "error",
			Component: "framework",
			Message:   fmt.Sprintf("Failed to get targets: %v", err),
		})
		experimentsTotal.WithLabelValues(config.Name, config.TargetType, config.Namespace, "failed").Inc()
		return result, fmt.Errorf("failed to get targets: %w", err)
	}
	// Check if we have targets
	if len(targets) == 0 {
		result.Status = "skipped"
		result.Logs = append(result.Logs, LogEntry{
			Timestamp: time.Now(),
			Level:     "info",
			Component: "framework",
			Message:   "No targets found for experiment",
		})
		experimentsTotal.WithLabelValues(config.Name, config.TargetType, config.Namespace, "skipped").Inc()
		return result, nil
	}
	// Limit targets if needed
	if config.MaxTargets > 0 && len(targets) > config.MaxTargets {
		targets = targets[:config.MaxTargets]
	}
	// Update result with target count
	result.TargetsAffected = len(targets)
	// Update metrics
	affectedResources.WithLabelValues(config.Name, config.TargetType, config.Namespace).Set(float64(len(targets)))
	// Run pre-checks
	preChecksPassed, preCheckResults := c.runChecks(ctx, config.PreChecks, "pre")
	for name, checkResult := range preCheckResults {
		result.CheckResults[name] = checkResult
	}
	if !preChecksPassed {
		result.Status = "failed"
		result.Logs = append(result.Logs, LogEntry{
			Timestamp: time.Now(),
			Level:     "error",
			Component: "framework",
			Message:   "Pre-checks failed",
		})
		experimentsTotal.WithLabelValues(config.Name, config.TargetType, config.Namespace, "failed").Inc()
		return result, fmt.Errorf("pre-checks failed")
	}
	// Send start notifications
	c.sendNotifications(config.Notifications, "start", config, result)
	// Run actions
	if !config.DryRun {
		actionResults, err := c.runActions(ctx, config.Actions, targets, config)
		if err != nil {
			result.Status = "failed"
			result.Logs = append(result.Logs, LogEntry{
				Timestamp: time.Now(),
				Level:     "error",
				Component: "framework",
				Message:   fmt.Sprintf("Failed to run actions: %v", err),
			})
			// Send failure notifications
			c.sendNotifications(config.Notifications, "failure", config, result)
			experimentsTotal.WithLabelValues(config.Name, config.TargetType, config.Namespace, "failed").Inc()
			return result, fmt.Errorf("failed to run actions: %w", err)
		}
		for name, actionResult := range actionResults {
			result.ActionResults[name] = actionResult
		}
	} else {
		result.Logs = append(result.Logs, LogEntry{
			Timestamp: time.Now(),
			Level:     "info",
			Component: "framework",
			Message:   "Dry run mode, skipping actions",
		})
	}
	// Wait for experiment duration
	select {
	case <-ctx.Done():
		result.Status = "interrupted"
		result.Logs = append(result.Logs, LogEntry{
			Timestamp: time.Now(),
			Level:     "info",
			Component: "framework",
			Message:   "Experiment interrupted",
		})
	case <-time.After(config.Duration):
		// Continue with post-checks
	}
	// Run post-checks
	postChecksPassed, postCheckResults := c.runChecks(ctx, config.PostChecks, "post")
	for name, checkResult := range postCheckResults {
		result.CheckResults[name] = checkResult
	}
	// Finalize result
	result.EndTime = time.Now()
	result.Duration = result.EndTime.Sub(result.StartTime)
	if result.Status == "running" {
		if postChecksPassed {
			result.Status = "completed"
		} else {
			result.Status = "failed"
		}
	}
	// Update metrics
	experimentsTotal.WithLabelValues(config.Name, config.TargetType, config.Namespace, result.Status).Inc()
	experimentDuration.WithLabelValues(config.Name, config.TargetType).Observe(result.Duration.Seconds())
	// Send completion notifications
	c.sendNotifications(config.Notifications, result.Status, config, result)
	c.logger.Printf("Completed chaos experiment: %s, status: %s, duration: %v", 
		config.Name, result.Status, result.Duration)
	return result, nil
}
// validateConfig validates the experiment configuration
func (c *ChaosFramework) validateConfig(config ExperimentConfig) error {
	if config.Name == "" {
		return fmt.Errorf("experiment name is required")
	}
	if config.TargetType == "" {
		return fmt.Errorf("target type is required")
	}
	if config.Namespace == "" {
		return fmt.Errorf("namespace is required")
	}
	if len(config.LabelSelectors) == 0 {
		return fmt.Errorf("at least one label selector is required")
	}
	if config.Duration <= 0 {
		return fmt.Errorf("duration must be greater than zero")
	}
	if len(config.Actions) == 0 {
		return fmt.Errorf("at least one action is required")
	}
	// Validate actions
	for _, action := range config.Actions {
		if action.Type == "" {
			return fmt.Errorf("action type is required")
		}
		// Validate action parameters based on type
		switch action.Type {
		case "pod-delete":
			// No specific parameters required
		case "pod-cpu-stress":
			if _, ok := action.Parameters["cpu"]; !ok {
				return fmt.Errorf("cpu parameter is required for pod-cpu-stress action")
			}
		case "pod-memory-stress":
			if _, ok := action.Parameters["memory"]; !ok {
				return fmt.Errorf("memory parameter is required for pod-memory-stress action")
			}
		case "pod-network-delay":
			if _, ok := action.Parameters["latency"]; !ok {
				return fmt.Errorf("latency parameter is required for pod-network-delay action")
			}
		case "pod-network-loss":
			if _, ok := action.Parameters["loss"]; !ok {
				return fmt.Errorf("loss parameter is required for pod-network-loss action")
			}
		case "node-cpu-stress":
			if _, ok := action.Parameters["cpu"]; !ok {
				return fmt.Errorf("cpu parameter is required for node-cpu-stress action")
			}
		case "node-memory-stress":
			if _, ok := action.Parameters["memory"]; !ok {
				return fmt.Errorf("memory parameter is required for node-memory-stress action")
			}
		case "node-network-delay":
			if _, ok := action.Parameters["latency"]; !ok {
				return fmt.Errorf("latency parameter is required for node-network-delay action")
			}
		case "node-network-loss":
			if _, ok := action.Parameters["loss"]; !ok {
				return fmt.Errorf("loss parameter is required for node-network-loss action")
			}
		default:
			return fmt.Errorf("unknown action type: %s", action.Type)
		}
	}
	return nil
}
// isScheduleAllowed checks if the experiment is allowed to run based on schedule
func (c *ChaosFramework) isScheduleAllowed(schedule *Schedule) bool {
	now := time.Now()
	// Check time zone
	loc, err := time.LoadLocation(schedule.TimeZone)
	if err != nil {
		c.logger.Printf("Failed to load time zone %s: %v", schedule.TimeZone, err)
		loc = time.UTC
	}
	// Convert to specified time zone
	now = now.In(loc)
	// Check allowed days
	if len(schedule.AllowedDays) > 0 {
		dayName := now.Weekday().String()
		allowed := false
		for _, day := range schedule.AllowedDays {
			if day == dayName {
				allowed = true
				break
			}
		}
		if !allowed {
			return false
		}
	}
	// Check allowed hours
	if len(schedule.AllowedHours) > 0 {
		hour := now.Hour()
		allowed := false
		for _, h := range schedule.AllowedHours {
			if h == hour {
				allowed = true
				break
			}
		}
		if !allowed {
			return false
		}
	}
	// Check excluded dates
	if len(schedule.ExcludedDates) > 0 {
		dateStr := now.Format("2006-01-02")
		for _, date := range schedule.ExcludedDates {
			if date == dateStr {
				return false
			}
		}
	}
	return true
}
// runSafetyChecks runs all safety checks
func (c *ChaosFramework) runSafetyChecks(ctx context.Context, config ExperimentConfig) error {
	for _, check := range config.SafetyChecks {
		passed, err := c.runSafetyCheck(ctx, check, config)
		if err != nil {
			return fmt.Errorf("safety check %s failed: %w", check.Type, err)
		}
		if !passed && check.StopOnFail {
			return fmt.Errorf("safety check %s failed", check.Type)
		}
	}
	return nil
}
// runSafetyCheck runs a single safety check
func (c *ChaosFramework) runSafetyCheck(ctx context.Context, check SafetyCheck, config ExperimentConfig) (bool, error) {
	switch check.Type {
	case "target-count-limit":
		maxTargets, ok := check.Parameters["max_targets"].(int)
		if !ok {
			return false, fmt.Errorf("max_targets parameter must be an integer")
		}
		targets, err := c.getTargets(ctx, config)
		if err != nil {
			return false, fmt.Errorf("failed to get targets: %w", err)
		}
		return len(targets) <= maxTargets, nil
	case "target-percentage-limit":
		maxPercentage, ok := check.Parameters["max_percentage"].(float64)
		if !ok {
			return false, fmt.Errorf("max_percentage parameter must be a float")
		}
		targets, err := c.getTargets(ctx, config)
		if err != nil {
			return false, fmt.Errorf("failed to get targets: %w", err)
		}
		// Get total resources
		var totalCount int
		switch config.TargetType {
		case "pod":
			pods, err := c.clientset.CoreV1().Pods(config.Namespace).List(ctx, metav1.ListOptions{})
			if err != nil {
				return false, fmt.Errorf("failed to list pods: %w", err)
			}
			totalCount = len(pods.Items)
		case "node":
			nodes, err := c.clientset.CoreV1().Nodes().List(ctx, metav1.ListOptions{})
			if err != nil {
				return false, fmt.Errorf("failed to list nodes: %w", err)
			}
			totalCount = len(nodes.Items)
		default:
			return false, fmt.Errorf("unsupported target type for percentage check: %s", config.TargetType)
		}
		if totalCount == 0 {
			return false, fmt.Errorf("no resources found of type %s", config.TargetType)
		}
		percentage := float64(len(targets)) / float64(totalCount) * 100
		return percentage <= maxPercentage, nil
	case "critical-service-check":
		serviceName, ok := check.Parameters["service_name"].(string)
		if !ok {
			return false, fmt.Errorf("service_name parameter must be a string")
		}
		namespace, ok := check.Parameters["namespace"].(string)
		if !ok {
			namespace = "default"
		}
		// Check if service exists and is healthy
		service, err := c.clientset.CoreV1().Services(namespace).Get(ctx, serviceName, metav1.GetOptions{})
		if err != nil {
			return false, fmt.Errorf("failed to get service %s: %w", serviceName, err)
		}
		// Check if service has endpoints
		endpoints, err := c.clientset.CoreV1().Endpoints(namespace).Get(ctx, serviceName, metav1.GetOptions{})
		if err != nil {
			return false, fmt.Errorf("failed to get endpoints for service %s: %w", serviceName, err)
		}
		// Check if service has at least one endpoint
		hasEndpoints := false
		for _, subset := range endpoints.Subsets {
			if len(subset.Addresses) > 0 {
				hasEndpoints = true
				break
			}
		}
		return service != nil && hasEndpoints, nil
	case "prometheus-query":
		query, ok := check.Parameters["query"].(string)
		if !ok {
			return false, fmt.Errorf("query parameter must be a string")
		}
		threshold, ok := check.Parameters["threshold"].(float64)
		if !ok {
			return false, fmt.Errorf("threshold parameter must be a float")
		}
		operator, ok := check.Parameters["operator"].(string)
		if !ok {
			operator = "lt" // less than
		}
		// TODO: Implement Prometheus query execution
		// This would require a Prometheus client
		// For now, just return true
		return true, nil
	default:
		return false, fmt.Errorf("unknown safety check type: %s", check.Type)
	}
}
// getTargets gets the targets for the experiment
func (c *ChaosFramework) getTargets(ctx context.Context, config ExperimentConfig) ([]string, error) {
	var targets []string
	switch config.TargetType {
	case "pod":
		// Build label selector string
		var labelSelectors []string
		for key, value := range config.LabelSelectors {
			labelSelectors = append(labelSelectors, fmt.Sprintf("%s=%s", key, value))
		}
		// Get pods matching label selectors
		pods, err := c.clientset.CoreV1().Pods(config.Namespace).List(ctx, metav1.ListOptions{
			LabelSelector: metav1.FormatLabelSelector(&metav1.LabelSelector{
				MatchLabels: config.LabelSelectors,
			}),
		})
		if err != nil {
			return nil, fmt.Errorf("failed to list pods: %w", err)
		}
		// Filter out pods with exclude labels
		for _, pod := range pods.Items {
			exclude := false
			for key, value := range config.ExcludeLabels {
				if podValue, ok := pod.Labels[key]; ok && podValue == value {
					exclude = true
					break
				}
			}
			if !exclude {
				targets = append(targets, pod.Name)
			}
		}
	case "node":
		// Build label selector string
		var labelSelectors []string
		for key, value := range config.LabelSelectors {
			labelSelectors = append(labelSelectors, fmt.Sprintf("%s=%s", key, value))
		}
		// Get nodes matching label selectors
		nodes, err := c.clientset.CoreV1().Nodes().List(ctx, metav1.ListOptions{
			LabelSelector: metav1.FormatLabelSelector(&metav1.LabelSelector{
				MatchLabels: config.LabelSelectors,
			}),
		})
		if err != nil {
			return nil, fmt.Errorf("failed to list nodes: %w", err)
		}
		// Filter out nodes with exclude labels
		for _, node := range nodes.Items {
			exclude := false
			for key, value := range config.ExcludeLabels {
				if nodeValue, ok := node.Labels[key]; ok && nodeValue == value {
					exclude = true
					break
				}
			}
			if !exclude {
				targets = append(targets, node.Name)
			}
		}
	default:
		return nil, fmt.Errorf("unsupported target type: %s", config.TargetType)
	}
	return targets, nil
}
// runChecks runs all checks of a specific type (pre or post)
func (c *ChaosFramework) runChecks(ctx context.Context, checks []Check, checkType string) (bool, map[string]CheckResult) {
	results := make(map[string]CheckResult)
	allPassed := true
	for i, check := range checks {
		checkName := fmt.Sprintf("%s-check-%d-%s", checkType, i, check.Type)
		result := CheckResult{
			CheckType: check.Type,
			StartTime: time.Now(),
			Status:    "running",
		}
		// Run check with timeout
		checkCtx, cancel := context.WithTimeout(ctx, check.Timeout)
		defer cancel()
		passed, err := c.runCheck(checkCtx, check)
		result.EndTime = time.Now()
		result.Duration = result.EndTime.Sub(result.StartTime)
		if err != nil {
			result.Status = "error"
			result.Error = err.Error()
			allPassed = false
		} else if !passed {
			result.Status = "failed"
			allPassed = false
		} else {
			result.Status = "passed"
		}
		results[checkName] = result
	}
	return allPassed, results
}
// runCheck runs a single check
func (c *ChaosFramework) runCheck(ctx context.Context, check Check) (bool, error) {
	switch check.Type {
	case "pod-running":
		podName, ok := check.Parameters["pod_name"].(string)
		if !ok {
			return false, fmt.Errorf("pod_name parameter must be a string")
		}
		namespace, ok := check.Parameters["namespace"].(string)
		if !ok {
			namespace = "default"
		}
		pod, err := c.clientset.CoreV1().Pods(namespace).Get(ctx, podName, metav1.GetOptions{})
		if err != nil {
			return false, fmt.Errorf("failed to get pod %s: %w", podName, err)
		}
		return pod.Status.Phase == "Running", nil
	case "service-available":
		serviceName, ok := check.Parameters["service_name"].(string)
		if !ok {
			return false, fmt.Errorf("service_name parameter must be a string")
		}
		namespace, ok := check.Parameters["namespace"].(string)
		if !ok {
			namespace = "default"
		}
		// Check if service exists
		service, err := c.clientset.CoreV1().Services(namespace).Get(ctx, serviceName, metav1.GetOptions{})
		if err != nil {
			return false, fmt.Errorf("failed to get service %s: %w", serviceName, err)
		}
		// Check if service has endpoints
		endpoints, err := c.clientset.CoreV1().Endpoints(namespace).Get(ctx, serviceName, metav1.GetOptions{})
		if err != nil {
			return false, fmt.Errorf("failed to get endpoints for service %s: %w", serviceName, err)
		}
		// Check if service has at least one endpoint
		for _, subset := range endpoints.Subsets {
			if len(subset.Addresses) > 0 {
				return true, nil
			}
		}
		return false, nil
	case "deployment-available":
		deploymentName, ok := check.Parameters["deployment_name"].(string)
		if !ok {
			return false, fmt.Errorf("deployment_name parameter must be a string")
		}
		namespace, ok := check.Parameters["namespace"].(string)
		if !ok {
			namespace = "default"
		}
		// Check if deployment exists
		deployment, err := c.clientset.AppsV1().Deployments(namespace).Get(ctx, deploymentName, metav1.GetOptions{})
		if err != nil {
			return false, fmt.Errorf("failed to get deployment %s: %w", deploymentName, err)
		}
		// Check if deployment is available
		for _, condition := range deployment.Status.Conditions {
			if condition.Type == "Available" && condition.Status == "True" {
				return true, nil
			}
		}
		return false, nil
	case "http-check":
		url, ok := check.Parameters["url"].(string)
		if !ok {
			return false, fmt.Errorf("url parameter must be a string")
		}
		expectedStatus, ok := check.Parameters["expected_status"].(int)
		if !ok {
			expectedStatus = 200
		}
		// TODO: Implement HTTP check
		// This would require an HTTP client
		// For now, just return true
		return true, nil
	default:
		return false, fmt.Errorf("unknown check type: %s", check.Type)
	}
}
// runActions runs all actions on the targets
func (c *ChaosFramework) runActions(ctx context.Context, actions []Action, targets []string, config ExperimentConfig) (map[string]ActionResult, error) {
	results := make(map[string]ActionResult)
	for i, action := range actions {
		actionName := fmt.Sprintf("action-%d-%s", i, action.Type)
		result := ActionResult{
			ActionType:  action.Type,
			StartTime:   time.Now(),
			Status:      "running",
			TargetCount: len(targets),
		}
		// Run action
		err := c.runAction(ctx, action, targets, config)
		result.EndTime = time.Now()
		result.Duration = result.EndTime.Sub(result.StartTime)
		if err != nil {
			result.Status = "failed"
			result.Error = err.Error()
			return results, fmt.Errorf("action %s failed: %w", actionName, err)
		} else {
			result.Status = "completed"
		}
		results[actionName] = result
	}
	return results, nil
}
// runAction runs a single action on the targets
func (c *ChaosFramework) runAction(ctx context.Context, action Action, targets []string, config ExperimentConfig) error {
	switch action.Type {
	case "pod-delete":
		// Delete pods
		for _, podName := range targets {
			if err := c.clientset.CoreV1().Pods(config.Namespace).Delete(ctx, podName, metav1.DeleteOptions{}); err != nil {
				return fmt.Errorf("failed to delete pod %s: %w", podName, err)
			}
		}
	case "pod-cpu-stress":
		// TODO: Implement pod CPU stress
		// This would require executing commands in the pods
	case "pod-memory-stress":
		// TODO: Implement pod memory stress
		// This would require executing commands in the pods
	case "pod-network-delay":
		// TODO: Implement pod network delay
		// This would require executing commands in the pods
	case "pod-network-loss":
		// TODO: Implement pod network loss
		// This would require executing commands in the pods
	case "node-cpu-stress":
		// TODO: Implement node CPU stress
		// This would require executing commands on the nodes
	case "node-memory-stress":
		// TODO: Implement node memory stress
		// This would require executing commands on the nodes
	case "node-network-delay":
		// TODO: Implement node network delay
		// This would require executing commands on the nodes
	case "node-network-loss":
		// TODO: Implement node network loss
		// This would require executing commands on the nodes
	default:
		return fmt.Errorf("unknown action type: %s", action.Type)
	}
	return nil
}
// sendNotifications sends notifications for an experiment
func (c *ChaosFramework) sendNotifications(notifications []Notification, event string, config ExperimentConfig, result *ExperimentResult) {
	for _, notification := range notifications {
		// Check if this notification should be sent for this event
		shouldSend := false
		for _, notificationEvent := range notification.Events {
			if notificationEvent == event || notificationEvent == "all" {
				shouldSend = true
				break
			}
		}
		if !shouldSend {
			continue
		}
		// Send notification
		switch notification.Type {
		case "slack":
			// TODO: Implement Slack notification
		case "email":
			// TODO: Implement email notification
		case "webhook":
			// TODO: Implement webhook notification
		default:
			c.logger.Printf("Unknown notification type: %s", notification.Type)
		}
	}
}

• Created a comprehensive chaos engineering policy:


# chaos_policy.yaml - Chaos Engineering Policy
apiVersion: chaos.example.com/v1
kind: ChaosPolicy
metadata:
  name: chaos-engineering-policy
  namespace: chaos-system
spec:
  # Global settings
  global:
    # Default blast radius limits
    blastRadius:
      maxTargets: 5
      maxPercentage: 10
      targetLimits:
        pod: 5
        node: 2
        deployment: 2
        statefulset: 1
    # Default scheduling constraints
    scheduling:
      allowedDays:
        - Monday
        - Tuesday
        - Wednesday
        - Thursday
      allowedHours:
        - 10
        - 11
        - 14
        - 15
      excludedDates:
        - "2023-12-25"
        - "2024-01-01"
      timeZone: "UTC"
    # Default notification settings
    notifications:
      - type: slack
        parameters:
          channel: "#chaos-engineering"
          username: "Chaos Bot"
        events:
          - start
          - failure
          - completed
      - type: email
        parameters:
          recipients:
            - "sre-team@example.com"
        events:
          - failure
    # Default safety checks
    safetyChecks:
      - type: critical-service-check
        parameters:
          service_name: "payment-service"
          namespace: "production"
        stopOnFail: true
      - type: prometheus-query
        parameters:
          query: "sum(rate(http_requests_total{code=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
          threshold: 5
          operator: "lt"
        stopOnFail: true
  # Environment-specific settings
  environments:
    - name: production
      # Override global settings for production
      blastRadius:
        maxTargets: 2
        maxPercentage: 5
      scheduling:
        requireApproval: true
      safetyChecks:
        - type: target-percentage-limit
          parameters:
            max_percentage: 5
          stopOnFail: true
    - name: staging
      # Less restrictive settings for staging
      blastRadius:
        maxTargets: 10
        maxPercentage: 20
      scheduling:
        requireApproval: false
    - name: development
      # Least restrictive settings for development
      blastRadius:
        maxTargets: 20
        maxPercentage: 50
      scheduling:
        requireApproval: false
  # Target-specific settings
  targets:
    - type: pod
      selectors:
        app: "frontend"
      actions:
        - type: pod-delete
          parameters: {}
        - type: pod-cpu-stress
          parameters:
            cpu: 80
        - type: pod-memory-stress
          parameters:
            memory: 80
        - type: pod-network-delay
          parameters:
            latency: "100ms"
    - type: node
      selectors:
        role: "worker"
      actions:
        - type: node-cpu-stress
          parameters:
            cpu: 80
        - type: node-memory-stress
          parameters:
            memory: 80
        - type: node-network-delay
          parameters:
            latency: "100ms"
    - type: deployment
      selectors:
        app: "backend"
      actions:
        - type: pod-delete
          parameters:
            gracePeriod: 0
    - type: statefulset
      selectors:
        app: "database"
      actions:
        - type: pod-delete
          parameters:
            gracePeriod: 30
  # Experiment templates
  templates:
    - name: pod-failure
      description: "Simulates pod failures"
      targetType: "pod"
      duration: "5m"
      actions:
        - type: pod-delete
          parameters: {}
    - name: network-latency
      description: "Simulates network latency"
      targetType: "pod"
      duration: "10m"
      actions:
        - type: pod-network-delay
          parameters:
            latency: "200ms"
            jitter: "50ms"
    - name: cpu-stress
      description: "Simulates CPU stress"
      targetType: "pod"
      duration: "15m"
      actions:
        - type: pod-cpu-stress
          parameters:
            cpu: 80
    - name: memory-stress
      description: "Simulates memory stress"
      targetType: "pod"
      duration: "15m"
      actions:
        - type: pod-memory-stress
          parameters:
            memory: 80
    - name: disk-stress
      description: "Simulates disk stress"
      targetType: "pod"
      duration: "15m"
      actions:
        - type: pod-disk-stress
          parameters:
            workers: 4
            size: "1GB"
    - name: node-failure
      description: "Simulates node failures"
      targetType: "node"
      duration: "5m"
      actions:
        - type: node-poweroff
          parameters: {}

Lessons Learned:

Chaos engineering requires careful planning, strict blast radius controls, and comprehensive monitoring.

How to Avoid:

Implement strict blast radius controls for all chaos experiments.
Create a formal approval process for chaos experiments.
Test all experiments in non-production environments first.
Implement comprehensive monitoring for all experiments.
Schedule experiments during low-traffic periods.

Answer 6

output:

Site Reliability Engineering (SRE) Cloud-native application, Kubernetes, Prometheus, Service Level Objectives (SLOs)

Summary:

No summary provided

What Happened:

A company implemented an SRE practice with error budgets based on Service Level Objectives (SLOs). One product team consistently pushed new features without adequate testing, resulting in multiple production incidents. After three weeks into the quarter, they depleted their entire error budget, triggering an automatic feature freeze per the established policy. This led to significant tension between the development team (who wanted to continue shipping features) and the SRE team (who were enforcing the error budget policy).

Diagnosis Steps:

Analyzed SLO violations and error budget consumption patterns.
Reviewed recent deployments and their impact on reliability.
Examined post-incident reviews for recurring themes.
Interviewed development and SRE team members.
Assessed the effectiveness of the error budget policy implementation.

Root Cause:

The investigation revealed multiple issues with the error budget implementation: 1. The development team didn't fully understand the implications of the error budget policy 2. Error budget consumption wasn't visible enough during the development process 3. Testing practices were inadequate for catching reliability issues before production 4. The feature freeze policy was implemented without clear escalation paths 5. There was no graduated response to error budget consumption (it was all-or-nothing)

Fix/Workaround:

• Implemented improved error budget visibility and alerting with Prometheus rules

• Created a Go-based error budget tracking service with real-time dashboards

• Established a graduated response to error budget consumption (50%, 75%, 90%, 100%)

• Implemented a formal exception process for the feature freeze policy

• Provided training on SRE principles and error budgets for all teams

Lessons Learned:

Error budgets are only effective when teams understand them and have visibility into consumption.

How to Avoid:

Implement graduated responses to error budget consumption, not just all-or-nothing.
Ensure error budget visibility throughout the development process.
Provide clear documentation and training on error budget policies.
Establish escalation paths for exceptional circumstances.
Focus on the collaborative aspect of error budgets, not just enforcement.

Answer 7

output:

Site Reliability Engineering (SRE) Kubernetes, Prometheus, Alertmanager, Production microservices

Summary:

No summary provided

What Happened:

An SRE team implemented SLOs for their microservices platform using Prometheus and Alertmanager. Within days of deployment, on-call engineers were overwhelmed with alerts, many of which were for minor, transient issues that self-resolved. The constant stream of alerts led to alert fatigue, with some critical alerts being missed or ignored. Team morale declined, and the SLO implementation was at risk of being abandoned.

Diagnosis Steps:

Analyzed alert frequency and patterns over time.
Reviewed SLO definitions and error budget calculations.
Examined alert thresholds and burn rate configurations.
Collected feedback from on-call engineers about alert relevance.
Compared alerting patterns with actual customer impact.

Root Cause:

The investigation revealed multiple issues with the SLO implementation: 1. Error budgets were too small, triggering alerts for minor fluctuations 2. Alert thresholds were set at 100% SLO achievement rather than using error budgets 3. No distinction between fast and slow burn rates for error budget consumption 4. Alerting windows were too short, causing alerts for transient issues 5. No alert severity differentiation based on customer impact

Fix/Workaround:

• Redesigned SLO implementation with more realistic error budgets

• Implemented multi-window alert thresholds based on burn rates

• Created tiered alerting based on impact severity

• Added alert suppression for known issues and maintenance windows

• Developed a feedback loop for continuous improvement of alerts

Lessons Learned:

Effective SLO implementation requires careful balance between reliability goals and operational burden.

How to Avoid:

Start with conservative error budgets and refine over time.
Implement multi-window alert thresholds to distinguish between fast and slow burns.
Create tiered alerting based on customer impact and business priority.
Include alert suppression mechanisms for maintenance and known issues.
Establish a regular review process for SLOs and alerting effectiveness.

Answer 8

output:

Site Reliability Engineering (SRE) Multi-team organization, Microservices architecture, Production environment

Summary:

No summary provided

What Happened:

A technology company decided to implement SRE practices, including error budgets, to balance reliability and innovation. They defined service level objectives (SLOs) and created error budget policies that would halt feature development when budgets were exhausted. However, when the first service exhausted its error budget, the development team pushed back against stopping feature work. Leadership overrode the policy, creating confusion about the purpose and enforcement of error budgets. The incident highlighted deeper organizational challenges in adopting SRE practices.

Diagnosis Steps:

Analyzed the error budget consumption patterns.
Reviewed the communication and decision-making process during the incident.
Interviewed stakeholders about their understanding of error budgets.
Examined the error budget policy documentation and communication.
Assessed the alignment between team incentives and reliability goals.

Root Cause:

The investigation revealed multiple organizational issues: 1. Misalignment between team incentives (feature delivery vs. reliability) 2. Lack of executive understanding and buy-in for error budget concepts 3. Insufficient communication about the purpose and benefits of error budgets 4. No clear escalation path or decision-making process for policy exceptions 5. Error budget policies created without input from development teams

Fix/Workaround:

• Facilitated workshops to build shared understanding of error budgets

• Created a joint task force with representatives from all stakeholders

• Developed a more nuanced error budget policy with graduated responses

• Aligned team incentives and performance metrics with reliability goals

• Established clear governance for error budget policy exceptions

Lessons Learned:

Successful SRE implementation requires organizational alignment and cultural change, not just technical solutions.

How to Avoid:

Involve all stakeholders in developing error budget policies.
Ensure executive understanding and support for SRE principles.
Align team incentives and performance metrics with reliability goals.
Create graduated responses to error budget depletion, not just binary halts.
Establish clear governance and decision-making processes for exceptions.

Answer 9

output:

Site Reliability Engineering (SRE) Kubernetes, Prometheus, Grafana, Production environment

Summary:

No summary provided

What Happened:

A financial services company had implemented SLOs for their critical payment processing service, with targets for availability, latency, and error rates. Despite having dashboards and alerts configured, the service experienced degraded performance over several days that violated the SLOs but did not trigger any alerts. The issue was only discovered during a routine weekly review when an engineer noticed that the service had been operating below its SLO thresholds for nearly a week, affecting thousands of customers.

Diagnosis Steps:

Analyzed historical performance data for the service.
Reviewed SLO definitions and alert configurations.
Examined the monitoring system's data collection and aggregation.
Checked alert routing and notification channels.
Investigated recent changes to the service and monitoring infrastructure.

Root Cause:

The investigation revealed multiple issues with the SLO implementation: 1. The SLO was defined using averaged metrics over long time windows, masking short-term degradations 2. Alert thresholds were set too low, allowing significant degradation before alerting 3. The error budget calculation did not account for all types of failures 4. Some critical user journeys were not covered by the SLOs 5. Recent changes to the service introduced new failure modes not captured by existing SLIs

Fix/Workaround:

• Implemented immediate improvements to SLO monitoring

• Redefined SLOs with appropriate time windows and aggregation methods

• Created multi-level alerting with early warning thresholds

• Expanded SLIs to cover all critical user journeys

• Implemented synthetic monitoring for end-to-end validation


# Improved SLO Configuration in Prometheus
# File: improved-slo-config.yaml
groups:
- name: payment-service-slos
  rules:
  # Availability SLI - Percentage of successful requests
  - record: sli:availability:ratio_rate5m
    expr: |
      sum(rate(http_requests_total{service="payment-service", status=~"2.."}[5m]))
      /
      sum(rate(http_requests_total{service="payment-service"}[5m]))
  # Latency SLI - Percentage of requests faster than threshold
  - record: sli:latency:ratio_rate5m
    expr: |
      sum(rate(http_request_duration_seconds_bucket{service="payment-service", le="0.3"}[5m]))
      /
      sum(rate(http_request_duration_seconds_count{service="payment-service"}[5m]))
  # Error SLI - Percentage of requests without application errors
  - record: sli:errors:ratio_rate5m
    expr: |
      1 - 
      sum(rate(application_errors_total{service="payment-service"}[5m]))
      /
      sum(rate(http_requests_total{service="payment-service"}[5m]))
  # Combined SLI - All three objectives must be met
  - record: sli:combined:ratio_rate5m
    expr: |
      min(
        sli:availability:ratio_rate5m,
        sli:latency:ratio_rate5m,
        sli:errors:ratio_rate5m
      )
  # 1-hour SLO
  - record: slo:payment_service:1h
    expr: avg_over_time(sli:combined:ratio_rate5m[1h])
  # 6-hour SLO
  - record: slo:payment_service:6h
    expr: avg_over_time(sli:combined:ratio_rate5m[6h])
  # 24-hour SLO
  - record: slo:payment_service:24h
    expr: avg_over_time(sli:combined:ratio_rate5m[24h])
  # 30-day SLO (for error budget calculation)
  - record: slo:payment_service:30d
    expr: avg_over_time(sli:combined:ratio_rate5m[30d])
  # Error budget consumption rate
  - record: error_budget:payment_service:consumption_rate
    expr: |
      (1 - slo:payment_service:24h) / (1 - 0.995) # 99.5% is our SLO target
# Multi-level alerting
- name: payment-service-slo-alerts
  rules:
  # Early warning - 2x normal error budget burn rate
  - alert: PaymentServiceErrorBudgetBurning
    expr: error_budget:payment_service:consumption_rate > 2
    for: 10m
    labels:
      severity: warning
      team: payments
    annotations:
      summary: "Payment service error budget burning 2x faster than expected"
      description: "The payment service is consuming error budget at {{ $value }}x the expected rate."
  # Critical alert - 10x normal error budget burn rate
  - alert: PaymentServiceErrorBudgetCritical
    expr: error_budget:payment_service:consumption_rate > 10
    for: 5m
    labels:
      severity: critical
      team: payments
    annotations:
      summary: "Payment service error budget burning at critical rate"
      description: "The payment service is consuming error budget at {{ $value }}x the expected rate. Immediate action required."
  # SLO breach alert
  - alert: PaymentServiceSLOBreach
    expr: slo:payment_service:1h < 0.995
    for: 5m
    labels:
      severity: critical
      team: payments
    annotations:
      summary: "Payment service SLO breach"
      description: "The payment service has breached its 99.5% SLO. Current value: {{ $value | humanizePercentage }}."


// Synthetic Monitoring for Critical User Journeys
// File: synthetic_monitor.go
package main
import (
	"context"
	"fmt"
	"log"
	"net/http"
	"time"
	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
	journeyDuration = prometheus.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "synthetic_journey_duration_seconds",
			Help:    "Duration of synthetic user journeys",
			Buckets: prometheus.ExponentialBuckets(0.1, 1.5, 10),
		},
		[]string{"journey", "status"},
	)
	journeyErrors = prometheus.NewCounterVec(
		prometheus.CounterOpts{
			Name: "synthetic_journey_errors_total",
			Help: "Total errors encountered during synthetic user journeys",
		},
		[]string{"journey", "error_type"},
	)
)
func init() {
	prometheus.MustRegister(journeyDuration)
	prometheus.MustRegister(journeyErrors)
}
// PaymentJourney simulates a complete payment flow
func PaymentJourney(ctx context.Context) error {
	startTime := time.Now()
	status := "success"
	defer func() {
		journeyDuration.WithLabelValues("payment_flow", status).Observe(time.Since(startTime).Seconds())
	}()
	// Step 1: Create payment
	err := createPayment(ctx)
	if err != nil {
		journeyErrors.WithLabelValues("payment_flow", "create_payment").Inc()
		status = "failure"
		return fmt.Errorf("create payment failed: %w", err)
	}
	// Step 2: Authorize payment
	err = authorizePayment(ctx)
	if err != nil {
		journeyErrors.WithLabelValues("payment_flow", "authorize_payment").Inc()
		status = "failure"
		return fmt.Errorf("authorize payment failed: %w", err)
	}
	// Step 3: Capture payment
	err = capturePayment(ctx)
	if err != nil {
		journeyErrors.WithLabelValues("payment_flow", "capture_payment").Inc()
		status = "failure"
		return fmt.Errorf("capture payment failed: %w", err)
	}
	// Step 4: Verify receipt
	err = verifyReceipt(ctx)
	if err != nil {
		journeyErrors.WithLabelValues("payment_flow", "verify_receipt").Inc()
		status = "failure"
		return fmt.Errorf("verify receipt failed: %w", err)
	}
	return nil
}
// Implementation of individual steps
func createPayment(ctx context.Context) error {
	// Actual implementation would make API calls to the payment service
	// This is a simplified example
	req, err := http.NewRequestWithContext(ctx, "POST", "https://payment-service/api/payments", nil)
	if err != nil {
		return err
	}
	client := &http.Client{Timeout: 5 * time.Second}
	resp, err := client.Do(req)
	if err != nil {
		return err
	}
	defer resp.Body.Close()
	if resp.StatusCode != http.StatusCreated {
		return fmt.Errorf("unexpected status code: %d", resp.StatusCode)
	}
	return nil
}
func authorizePayment(ctx context.Context) error {
	// Implementation omitted for brevity
	return nil
}
func capturePayment(ctx context.Context) error {
	// Implementation omitted for brevity
	return nil
}
func verifyReceipt(ctx context.Context) error {
	// Implementation omitted for brevity
	return nil
}
func runSyntheticMonitoring() {
	ticker := time.NewTicker(1 * time.Minute)
	defer ticker.Stop()
	for {
		select {
		case <-ticker.C:
			ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
			err := PaymentJourney(ctx)
			if err != nil {
				log.Printf("Payment journey failed: %v", err)
			}
			cancel()
		}
	}
}
func main() {
	// Start synthetic monitoring in a goroutine
	go runSyntheticMonitoring()
	// Expose Prometheus metrics
	http.Handle("/metrics", promhttp.Handler())
	log.Fatal(http.ListenAndServe(":8080", nil))
}

Lessons Learned:

Effective SLO implementation requires careful consideration of measurement methods, alert thresholds, and comprehensive coverage of user journeys.

How to Avoid:

Define SLOs with appropriate time windows and aggregation methods.
Implement multi-level alerting with early warning thresholds.
Ensure SLIs cover all critical user journeys and failure modes.
Validate SLO implementations with synthetic monitoring.
Regularly review and refine SLOs based on service evolution.

Answer 10

output:

Site Reliability Engineering (SRE) Microservices architecture, Istio service mesh, Kubernetes, Production environment

Summary:

No summary provided

What Happened:

A large e-commerce platform experienced a complete outage of their checkout and payment services during a high-traffic sales event. The incident began when a database service started experiencing performance degradation due to an unexpected query pattern. Instead of isolating the failure, the misconfigured circuit breakers in the service mesh allowed the degradation to cascade through dependent services. Within minutes, the entire checkout flow was unavailable, resulting in significant revenue loss and customer frustration.

Diagnosis Steps:

Analyzed service mesh telemetry to identify the failure patterns.
Examined circuit breaker configurations across services.
Reviewed traffic patterns and service dependencies.
Checked database performance metrics and query patterns.
Tested circuit breaker behavior in a controlled environment.

Root Cause:

The investigation revealed multiple issues with the circuit breaker implementation: 1. Circuit breakers were configured with thresholds that were too high 2. Timeout settings did not account for the actual service performance characteristics 3. Retry policies were too aggressive, causing additional load on degraded services 4. Fallback mechanisms were not properly implemented for critical services 5. Health check criteria did not accurately reflect service health

Fix/Workaround:

• Implemented immediate improvements to circuit breaker configuration

• Adjusted timeout and retry settings based on service performance profiles

• Implemented proper fallback mechanisms for critical services

• Enhanced health check criteria to better reflect service health

• Established a comprehensive testing framework for resilience patterns

Lessons Learned:

Circuit breakers and other resilience patterns require careful configuration and testing to effectively prevent cascading failures.

How to Avoid:

Configure circuit breakers based on empirical service performance data.
Implement comprehensive fallback mechanisms for critical services.
Test resilience patterns under realistic failure scenarios.
Establish clear ownership for resilience configuration.
Regularly review and update circuit breaker settings as services evolve.

# Site Reliability Engineering (SRE) Scenarios

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid: