During a database slowdown, one service began experiencing timeouts. Instead of failing gracefully, the service continued to accept requests but responded slowly. This caused a ripple effect as dependent services also slowed down, eventually bringing down the entire system despite only one component having issues.
# Site Reliability Engineering (SRE) Scenarios
No summary provided
What Happened:
Diagnosis Steps:
Analyzed service dependency graph to understand the failure propagation.
Reviewed logs to identify the initial failure point.
Examined circuit breaker configurations across services.
Tested service behavior under simulated failure conditions.
Reviewed recent changes to resilience configurations.
Root Cause:
The circuit breaker in the database access service was misconfigured with too high a failure threshold (90%) and too long a timeout period (30 seconds). This allowed the service to continue accepting requests despite being unable to process them in a timely manner. Additionally, dependent services had no circuit breakers configured, allowing the failure to propagate.
Fix/Workaround:
• Short-term: Implemented proper circuit breaker configuration:
// Before: Misconfigured circuit breaker
@Bean
public Customizer<Resilience4JCircuitBreakerFactory> defaultCustomizer() {
return factory -> factory.configureDefault(id -> new Resilience4JConfigBuilder(id)
.circuitBreakerConfig(CircuitBreakerConfig.custom()
.failureRateThreshold(90.0f)
.waitDurationInOpenState(Duration.ofSeconds(60))
.slidingWindowSize(100)
.build())
.timeLimiterConfig(TimeLimiterConfig.custom()
.timeoutDuration(Duration.ofSeconds(30))
.build())
.build());
}
// After: Properly configured circuit breaker
@Bean
public Customizer<Resilience4JCircuitBreakerFactory> defaultCustomizer() {
return factory -> factory.configureDefault(id -> new Resilience4JConfigBuilder(id)
.circuitBreakerConfig(CircuitBreakerConfig.custom()
.failureRateThreshold(50.0f)
.waitDurationInOpenState(Duration.ofSeconds(30))
.slidingWindowSize(10)
.permittedNumberOfCallsInHalfOpenState(5)
.automaticTransitionFromOpenToHalfOpenEnabled(true)
.build())
.timeLimiterConfig(TimeLimiterConfig.custom()
.timeoutDuration(Duration.ofSeconds(5))
.build())
.build());
}
• Long-term: Implemented a comprehensive resilience strategy:
# Service Resilience Configuration (resilience.yaml)
global:
circuitBreaker:
failureRateThreshold: 50.0
slowCallRateThreshold: 50.0
slowCallDurationThreshold: 2s
waitDurationInOpenState: 30s
permittedNumberOfCallsInHalfOpenState: 5
slidingWindowSize: 10
slidingWindowType: COUNT_BASED
minimumNumberOfCalls: 5
automaticTransitionFromOpenToHalfOpenEnabled: true
timeLimiter:
timeoutDuration: 5s
cancelRunningFuture: true
bulkhead:
maxConcurrentCalls: 25
maxWaitDuration: 1s
retry:
maxAttempts: 3
waitDuration: 500ms
retryExceptions:
- java.io.IOException
- java.net.SocketTimeoutException
ignoreExceptions:
- java.lang.IllegalArgumentException
services:
database-service:
circuitBreaker:
failureRateThreshold: 30.0
waitDurationInOpenState: 45s
timeLimiter:
timeoutDuration: 3s
payment-service:
circuitBreaker:
failureRateThreshold: 20.0
bulkhead:
maxConcurrentCalls: 10
inventory-service:
retry:
maxAttempts: 5
waitDuration: 1s
• Implemented service degradation strategies:
// Graceful degradation with fallbacks
@Service
public class ProductServiceImpl implements ProductService {
private final ProductRepository repository;
private final RecommendationService recommendationService;
private final InventoryService inventoryService;
private final CircuitBreakerFactory circuitBreakerFactory;
public ProductServiceImpl(ProductRepository repository,
RecommendationService recommendationService,
InventoryService inventoryService,
CircuitBreakerFactory circuitBreakerFactory) {
this.repository = repository;
this.recommendationService = recommendationService;
this.inventoryService = inventoryService;
this.circuitBreakerFactory = circuitBreakerFactory;
}
@Override
public ProductDetails getProductDetails(String productId) {
Product product = repository.findById(productId)
.orElseThrow(() -> new ProductNotFoundException(productId));
// Get recommendations with circuit breaker and fallback
List<Product> recommendations = circuitBreakerFactory
.create("recommendations")
.run(() -> recommendationService.getRecommendations(productId),
throwable -> getDefaultRecommendations(product.getCategory()));
// Get inventory status with circuit breaker and fallback
InventoryStatus inventoryStatus = circuitBreakerFactory
.create("inventory")
.run(() -> inventoryService.getInventoryStatus(productId),
throwable -> InventoryStatus.UNKNOWN);
return new ProductDetails(product, recommendations, inventoryStatus);
}
private List<Product> getDefaultRecommendations(String category) {
// Fallback to cached or popular products in the same category
return repository.findTopByCategory(category, 5);
}
}
• Implemented a service mesh for centralized resilience management:
# Istio VirtualService with resilience policies
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: database-service
spec:
hosts:
- database-service
http:
- route:
- destination:
host: database-service
subset: v1
timeout: 3s
retries:
attempts: 3
perTryTimeout: 1s
retryOn: gateway-error,connect-failure,refused-stream,unavailable,cancelled,resource-exhausted
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: database-service
spec:
host: database-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 10
maxRequestsPerConnection: 10
outlierDetection:
consecutive5xxErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 100
subsets:
- name: v1
labels:
version: v1
Lessons Learned:
Circuit breakers must be properly configured to prevent cascading failures.
How to Avoid:
Implement circuit breakers with appropriate thresholds and timeouts.
Test failure scenarios regularly through chaos engineering.
Configure all services with appropriate resilience patterns.
Monitor circuit breaker states and failure rates.
Implement fallback mechanisms for critical services.
No summary provided
What Happened:
A web application was migrated to Kubernetes with Horizontal Pod Autoscaler (HPA) configured to scale based on CPU utilization. During traffic spikes, users experienced increased error rates and latency despite the system scaling up as configured.
Diagnosis Steps:
Analyzed metrics during scaling events.
Reviewed pod startup logs and timing.
Examined database connection patterns during scaling.
Tested scaling behavior in a controlled environment.
Profiled application performance under load.
Root Cause:
Multiple issues contributed to the problem: 1. New pods took too long to become fully operational (slow startup time) 2. The application didn't properly handle connection pooling, creating a database connection storm during scaling events 3. The readiness probe was too permissive, allowing pods to receive traffic before they were truly ready 4. The scaling threshold was too high (80% CPU), causing scaling to begin only after performance had already degraded
Fix/Workaround:
• Short-term: Implemented more conservative scaling parameters:
# Before: Aggressive scaling with high threshold
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
# After: More conservative scaling with lower threshold and better behavior
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 5
maxReplicas: 30
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 60
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: 1000
behavior:
scaleUp:
policies:
- type: Pods
value: 4
periodSeconds: 60
- type: Percent
value: 100
periodSeconds: 60
selectPolicy: Max
stabilizationWindowSeconds: 0
scaleDown:
policies:
- type: Percent
value: 10
periodSeconds: 60
selectPolicy: Max
stabilizationWindowSeconds: 300
• Long-term: Improved application startup and connection handling:
// Connection pool configuration with proper handling for scaling
@Configuration
public class DatabaseConfig {
@Value("${db.min-pool-size:5}")
private int minPoolSize;
@Value("${db.max-pool-size:20}")
private int maxPoolSize;
@Value("${db.connection-timeout:5000}")
private int connectionTimeout;
@Value("${db.idle-timeout:600000}")
private int idleTimeout;
@Value("${db.max-lifetime:1800000}")
private int maxLifetime;
@Value("${db.pool-name:app-connection-pool}")
private String poolName;
@Bean(destroyMethod = "close")
public DataSource dataSource() {
HikariConfig config = new HikariConfig();
config.setJdbcUrl(System.getenv("DB_URL"));
config.setUsername(System.getenv("DB_USER"));
config.setPassword(System.getenv("DB_PASSWORD"));
config.setDriverClassName("org.postgresql.Driver");
// Connection pool settings optimized for scaling
config.setMinimumIdle(minPoolSize);
config.setMaximumPoolSize(maxPoolSize);
config.setConnectionTimeout(connectionTimeout);
config.setIdleTimeout(idleTimeout);
config.setMaxLifetime(maxLifetime);
config.setPoolName(poolName);
// Add metrics for monitoring
config.setMetricRegistry(metricRegistry());
// Add health checks
config.setHealthCheckRegistry(healthCheckRegistry());
// Connection initialization
config.setInitializationFailTimeout(0); // Don't fail startup if DB is unavailable
config.setConnectionInitSql("SELECT 1"); // Validate connections
// Connection testing
config.setConnectionTestQuery("SELECT 1");
config.setValidationTimeout(5000);
return new HikariDataSource(config);
}
@Bean
public MetricRegistry metricRegistry() {
return new MetricRegistry();
}
@Bean
public HealthCheckRegistry healthCheckRegistry() {
return new HealthCheckRegistry();
}
}
• Improved readiness probe implementation:
// health.go - Improved readiness probe
package main
import (
"context"
"database/sql"
"encoding/json"
"log"
"net/http"
"os"
"sync"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
isReady bool
readyMutex sync.RWMutex
startupTime time.Time
db *sql.DB
// Prometheus metrics
healthCheckDuration = prometheus.NewHistogram(prometheus.HistogramOpts{
Name: "health_check_duration_seconds",
Help: "Duration of health check in seconds",
Buckets: prometheus.LinearBuckets(0.01, 0.05, 10),
})
databaseLatency = prometheus.NewHistogram(prometheus.HistogramOpts{
Name: "database_query_duration_seconds",
Help: "Duration of database queries in seconds",
Buckets: prometheus.ExponentialBuckets(0.001, 2, 10),
})
)
func init() {
startupTime = time.Now()
prometheus.MustRegister(healthCheckDuration, databaseLatency)
}
type HealthStatus struct {
Status string `json:"status"`
Components map[string]string `json:"components"`
StartupDuration float64 `json:"startupDurationSeconds,omitempty"`
UptimeSeconds float64 `json:"uptimeSeconds"`
ResourceUtilization ResourceStats `json:"resourceUtilization"`
}
type ResourceStats struct {
CPUUsagePercent float64 `json:"cpuUsagePercent"`
MemoryUsagePercent float64 `json:"memoryUsagePercent"`
ConnectionCount int `json:"connectionCount"`
}
func readinessHandler(w http.ResponseWriter, r *http.Request) {
timer := prometheus.NewTimer(healthCheckDuration)
defer timer.ObserveDuration()
readyMutex.RLock()
ready := isReady
readyMutex.RUnlock()
if !ready {
// Check if minimum startup time has elapsed
if time.Since(startupTime) < 10*time.Second {
w.WriteHeader(http.StatusServiceUnavailable)
w.Write([]byte("Service is still initializing"))
return
}
// Check database connectivity
dbReady := checkDatabaseConnectivity()
if !dbReady {
w.WriteHeader(http.StatusServiceUnavailable)
w.Write([]byte("Database connectivity check failed"))
return
}
// Check cache warmup
cacheReady := checkCacheWarmup()
if !cacheReady {
w.WriteHeader(http.StatusServiceUnavailable)
w.Write([]byte("Cache warmup incomplete"))
return
}
// All checks passed, service is ready
readyMutex.Lock()
isReady = true
readyMutex.Unlock()
}
// Prepare health status response
status := HealthStatus{
Status: "UP",
Components: map[string]string{
"database": "UP",
"cache": "UP",
"disk": "UP",
},
StartupDuration: time.Since(startupTime).Seconds(),
UptimeSeconds: time.Since(startupTime).Seconds(),
ResourceUtilization: ResourceStats{
CPUUsagePercent: getCPUUsage(),
MemoryUsagePercent: getMemoryUsage(),
ConnectionCount: getConnectionCount(),
},
}
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
json.NewEncoder(w).Encode(status)
}
func checkDatabaseConnectivity() bool {
ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
defer cancel()
timer := prometheus.NewTimer(databaseLatency)
defer timer.ObserveDuration()
err := db.PingContext(ctx)
if err != nil {
log.Printf("Database connectivity check failed: %v", err)
return false
}
// Execute a simple query to verify functionality
var result int
err = db.QueryRowContext(ctx, "SELECT 1").Scan(&result)
if err != nil || result != 1 {
log.Printf("Database query check failed: %v", err)
return false
}
return true
}
func checkCacheWarmup() bool {
// Implementation of cache warmup check
return true
}
func getCPUUsage() float64 {
// Implementation to get CPU usage
return 0.0
}
func getMemoryUsage() float64 {
// Implementation to get memory usage
return 0.0
}
func getConnectionCount() int {
// Implementation to get connection count
return 0
}
func main() {
// Initialize database connection
var err error
db, err = sql.Open("postgres", os.Getenv("DATABASE_URL"))
if err != nil {
log.Fatalf("Failed to connect to database: %v", err)
}
defer db.Close()
// Configure connection pool
db.SetMaxOpenConns(25)
db.SetMaxIdleConns(5)
db.SetConnMaxLifetime(5 * time.Minute)
db.SetConnMaxIdleTime(10 * time.Minute)
// Set up HTTP server
http.HandleFunc("/health/readiness", readinessHandler)
http.HandleFunc("/health/liveness", livenessHandler)
http.Handle("/metrics", promhttp.Handler())
// Start HTTP server
log.Fatal(http.ListenAndServe(":8080", nil))
}
func livenessHandler(w http.ResponseWriter, r *http.Request) {
// Simple liveness check
w.WriteHeader(http.StatusOK)
w.Write([]byte("OK"))
}
• Implemented predictive scaling:
# predictive_scaler.py
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from kubernetes import client, config
import time
import logging
from datetime import datetime, timedelta
import schedule
import os
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger('predictive_scaler')
# Load Kubernetes configuration
try:
config.load_incluster_config() # When running inside cluster
except:
config.load_kube_config() # When running locally
# Initialize Kubernetes clients
api_client = client.ApiClient()
custom_api = client.CustomObjectsApi(api_client)
apps_api = client.AppsV1Api(api_client)
# Configuration
NAMESPACE = os.environ.get('NAMESPACE', 'production')
DEPLOYMENT_NAME = os.environ.get('DEPLOYMENT_NAME', 'web-app')
HPA_NAME = os.environ.get('HPA_NAME', 'web-app')
PROMETHEUS_URL = os.environ.get('PROMETHEUS_URL', 'http://prometheus:9090')
PREDICTION_WINDOW_HOURS = int(os.environ.get('PREDICTION_WINDOW_HOURS', '1'))
HISTORY_WINDOW_DAYS = int(os.environ.get('HISTORY_WINDOW_DAYS', '14'))
MIN_REPLICAS = int(os.environ.get('MIN_REPLICAS', '5'))
MAX_REPLICAS = int(os.environ.get('MAX_REPLICAS', '30'))
class PredictiveScaler:
def __init__(self):
self.model = RandomForestRegressor(n_estimators=100, random_state=42)
self.scaler = StandardScaler()
self.is_model_trained = False
def fetch_historical_metrics(self):
"""Fetch historical metrics from Prometheus"""
end_time = datetime.now()
start_time = end_time - timedelta(days=HISTORY_WINDOW_DAYS)
# Query Prometheus for historical data
# This is a simplified example - in production, use the Prometheus API
# to fetch real metrics like request rate, CPU usage, etc.
# For this example, we'll create synthetic data
logger.info(f"Fetching historical metrics from {start_time} to {end_time}")
# Create a date range with hourly intervals
date_range = pd.date_range(start=start_time, end=end_time, freq='H')
# Create synthetic metrics
data = {
'timestamp': date_range,
'hour_of_day': [ts.hour for ts in date_range],
'day_of_week': [ts.dayofweek for ts in date_range],
'is_weekend': [(ts.dayofweek >= 5) for ts in date_range],
'request_rate': np.random.normal(1000, 300, size=len(date_range)),
'cpu_usage': np.random.normal(50, 15, size=len(date_range)),
'memory_usage': np.random.normal(40, 10, size=len(date_range)),
'error_rate': np.random.normal(0.5, 0.2, size=len(date_range)),
'latency_p95': np.random.normal(200, 50, size=len(date_range)),
'required_replicas': np.random.randint(MIN_REPLICAS, MAX_REPLICAS, size=len(date_range))
}
# Add seasonal patterns
# Increase load during business hours
for i, hour in enumerate(data['hour_of_day']):
if 9 <= hour <= 17 and data['day_of_week'][i] < 5: # Business hours on weekdays
data['request_rate'][i] *= 1.5
data['cpu_usage'][i] *= 1.3
data['required_replicas'][i] = max(MIN_REPLICAS, min(MAX_REPLICAS, int(data['required_replicas'][i] * 1.5)))
df = pd.DataFrame(data)
return df
def train_model(self):
"""Train the predictive model using historical data"""
logger.info("Training predictive scaling model")
# Fetch historical data
df = self.fetch_historical_metrics()
# Prepare features and target
features = df[['hour_of_day', 'day_of_week', 'is_weekend',
'request_rate', 'cpu_usage', 'memory_usage',
'error_rate', 'latency_p95']]
target = df['required_replicas']
# Scale features
scaled_features = self.scaler.fit_transform(features)
# Train model
self.model.fit(scaled_features, target)
self.is_model_trained = True
logger.info("Model training completed")
# Evaluate model
predictions = self.model.predict(scaled_features)
mse = np.mean((predictions - target) ** 2)
logger.info(f"Model MSE on training data: {mse:.2f}")
return mse
def predict_required_replicas(self):
"""Predict the required number of replicas for the next time window"""
if not self.is_model_trained:
logger.warning("Model not trained yet, training now")
self.train_model()
# Generate features for prediction window
prediction_window = []
current_time = datetime.now()
for hour_offset in range(PREDICTION_WINDOW_HOURS):
future_time = current_time + timedelta(hours=hour_offset)
# Fetch current metrics
# In production, get these from Prometheus or other monitoring system
current_metrics = {
'hour_of_day': future_time.hour,
'day_of_week': future_time.dayofweek,
'is_weekend': future_time.dayofweek >= 5,
'request_rate': 1000, # Example value
'cpu_usage': 50, # Example value
'memory_usage': 40, # Example value
'error_rate': 0.5, # Example value
'latency_p95': 200 # Example value
}
prediction_window.append(current_metrics)
# Convert to DataFrame and scale
prediction_df = pd.DataFrame(prediction_window)
scaled_prediction = self.scaler.transform(prediction_df)
# Make prediction
predicted_replicas = self.model.predict(scaled_prediction)
# Take the maximum predicted value for the window
max_predicted_replicas = int(np.ceil(max(predicted_replicas)))
# Ensure within bounds
max_predicted_replicas = max(MIN_REPLICAS, min(MAX_REPLICAS, max_predicted_replicas))
logger.info(f"Predicted required replicas for next {PREDICTION_WINDOW_HOURS} hour(s): {max_predicted_replicas}")
return max_predicted_replicas
def update_hpa(self, predicted_replicas):
"""Update the HPA min replicas based on prediction"""
try:
# Get current HPA
hpa = custom_api.get_namespaced_custom_object(
group="autoscaling",
version="v2",
namespace=NAMESPACE,
plural="horizontalpodautoscalers",
name=HPA_NAME
)
current_min_replicas = hpa['spec']['minReplicas']
if current_min_replicas != predicted_replicas:
logger.info(f"Updating HPA min replicas from {current_min_replicas} to {predicted_replicas}")
# Update HPA
hpa['spec']['minReplicas'] = predicted_replicas
custom_api.patch_namespaced_custom_object(
group="autoscaling",
version="v2",
namespace=NAMESPACE,
plural="horizontalpodautoscalers",
name=HPA_NAME,
body=hpa
)
logger.info("HPA updated successfully")
else:
logger.info(f"No update needed, current min replicas already set to {current_min_replicas}")
except Exception as e:
logger.error(f"Error updating HPA: {e}")
def run_scaling_cycle(self):
"""Run a complete predictive scaling cycle"""
try:
logger.info("Starting predictive scaling cycle")
# Predict required replicas
predicted_replicas = self.predict_required_replicas()
# Update HPA
self.update_hpa(predicted_replicas)
logger.info("Predictive scaling cycle completed")
except Exception as e:
logger.error(f"Error in scaling cycle: {e}")
def main():
scaler = PredictiveScaler()
# Initial training
scaler.train_model()
# Run immediately
scaler.run_scaling_cycle()
# Schedule regular runs
schedule.every(15).minutes.do(scaler.run_scaling_cycle)
schedule.every(6).hours.do(scaler.train_model) # Retrain model periodically
logger.info("Predictive scaler started, running every 15 minutes")
while True:
schedule.run_pending()
time.sleep(60)
if __name__ == "__main__":
main()
Lessons Learned:
Auto-scaling requires careful configuration and application design to be effective.
How to Avoid:
Optimize application startup time to reduce scaling lag.
Implement proper connection pooling and resource management.
Configure more sophisticated readiness probes.
Set more conservative scaling thresholds.
Consider predictive scaling for workloads with predictable patterns.
No summary provided
What Happened:
A database performance issue in one region cascaded into a global service outage. Despite having SRE teams in multiple regions, the incident response was uncoordinated, leading to conflicting remediation attempts and significantly extended downtime.
Diagnosis Steps:
Reviewed incident timeline and communication logs.
Analyzed actions taken by each regional SRE team.
Examined monitoring data and alerts across regions.
Interviewed team members about decision-making processes.
Reviewed existing incident response procedures.
Root Cause:
Multiple issues contributed to the coordination failure: 1. No clear incident command structure across regions 2. Inconsistent tooling and procedures between regional teams 3. Communication channels were fragmented and not centralized 4. No shared visibility into remediation actions being taken 5. Lack of predefined escalation paths and decision authority
Fix/Workaround:
• Short-term: Implemented an emergency incident command structure:
# incident_roles.yaml - Emergency incident response roles
roles:
incident_commander:
description: "Single point of decision-making authority during incidents"
responsibilities:
- Declare incident severity and manage overall response
- Assign roles to team members
- Make final decisions on remediation actions
- Coordinate communication across teams
- Declare incident resolution
rotation:
- name: "Alex Chen"
region: "us-west"
contact: "alex.chen@example.com, +1-555-123-4567"
- name: "Priya Sharma"
region: "eu-central"
contact: "priya.sharma@example.com, +44-555-987-6543"
- name: "Jamal Washington"
region: "ap-southeast"
contact: "jamal.washington@example.com, +65-555-789-0123"
communications_lead:
description: "Manages all external and internal communications"
responsibilities:
- Provide regular status updates to stakeholders
- Coordinate with customer support
- Draft and send customer communications
- Document timeline of events
rotation:
- name: "Sarah Johnson"
region: "us-west"
contact: "sarah.johnson@example.com, +1-555-234-5678"
- name: "Marco Rossi"
region: "eu-central"
contact: "marco.rossi@example.com, +39-555-876-5432"
- name: "Lin Wei"
region: "ap-southeast"
contact: "lin.wei@example.com, +86-555-678-9012"
operations_lead:
description: "Coordinates technical response actions"
responsibilities:
- Direct technical remediation efforts
- Coordinate between engineering teams
- Manage resource allocation during incident
- Ensure technical actions are documented
rotation:
- name: "David Kim"
region: "us-west"
contact: "david.kim@example.com, +1-555-345-6789"
- name: "Sophia Mueller"
region: "eu-central"
contact: "sophia.mueller@example.com, +49-555-765-4321"
- name: "Raj Patel"
region: "ap-southeast"
contact: "raj.patel@example.com, +91-555-567-8901"
• Created a centralized incident response tool:
// incident_manager.go
package main
import (
"context"
"encoding/json"
"fmt"
"log"
"net/http"
"sync"
"time"
"github.com/gorilla/mux"
"github.com/gorilla/websocket"
"go.mongodb.org/mongo-driver/bson"
"go.mongodb.org/mongo-driver/mongo"
"go.mongodb.org/mongo-driver/mongo/options"
)
// Incident represents an ongoing incident
type Incident struct {
ID string `json:"id" bson:"_id"`
Title string `json:"title" bson:"title"`
Description string `json:"description" bson:"description"`
Status string `json:"status" bson:"status"` // "active", "mitigated", "resolved"
Severity int `json:"severity" bson:"severity"` // 1-5, with 1 being most severe
StartTime time.Time `json:"startTime" bson:"startTime"`
MitigationTime *time.Time `json:"mitigationTime,omitempty" bson:"mitigationTime,omitempty"`
ResolutionTime *time.Time `json:"resolutionTime,omitempty" bson:"resolutionTime,omitempty"`
AffectedServices []string `json:"affectedServices" bson:"affectedServices"`
AffectedRegions []string `json:"affectedRegions" bson:"affectedRegions"`
Commander string `json:"commander" bson:"commander"`
CommsLead string `json:"commsLead" bson:"commsLead"`
OpsLead string `json:"opsLead" bson:"opsLead"`
Timeline []TimelineEntry `json:"timeline" bson:"timeline"`
Actions []RemediationAction `json:"actions" bson:"actions"`
Metadata map[string]interface{} `json:"metadata" bson:"metadata"`
}
// TimelineEntry represents an entry in the incident timeline
type TimelineEntry struct {
Timestamp time.Time `json:"timestamp" bson:"timestamp"`
Message string `json:"message" bson:"message"`
Author string `json:"author" bson:"author"`
Type string `json:"type" bson:"type"` // "info", "action", "decision", "milestone"
}
// RemediationAction represents an action taken to remediate the incident
type RemediationAction struct {
ID string `json:"id" bson:"id"`
Description string `json:"description" bson:"description"`
Status string `json:"status" bson:"status"` // "proposed", "in-progress", "completed", "abandoned"
Owner string `json:"owner" bson:"owner"`
Region string `json:"region" bson:"region"`
CreatedAt time.Time `json:"createdAt" bson:"createdAt"`
StartedAt *time.Time `json:"startedAt,omitempty" bson:"startedAt,omitempty"`
CompletedAt *time.Time `json:"completedAt,omitempty" bson:"completedAt,omitempty"`
Impact string `json:"impact" bson:"impact"` // Description of potential impact
Approved bool `json:"approved" bson:"approved"`
ApprovedBy string `json:"approvedBy,omitempty" bson:"approvedBy,omitempty"`
}
// IncidentManager manages incidents
type IncidentManager struct {
db *mongo.Database
incidents *mongo.Collection
activeIncident *Incident
mutex sync.RWMutex
clients map[*websocket.Conn]bool
broadcast chan []byte
upgrader websocket.Upgrader
}
// NewIncidentManager creates a new incident manager
func NewIncidentManager(mongoURI string) (*IncidentManager, error) {
// Connect to MongoDB
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
client, err := mongo.Connect(ctx, options.Client().ApplyURI(mongoURI))
if err != nil {
return nil, fmt.Errorf("failed to connect to MongoDB: %w", err)
}
// Ping the database
if err := client.Ping(ctx, nil); err != nil {
return nil, fmt.Errorf("failed to ping MongoDB: %w", err)
}
db := client.Database("incident_manager")
incidents := db.Collection("incidents")
// Create indexes
indexModel := mongo.IndexModel{
Keys: bson.D{
{Key: "status", Value: 1},
{Key: "startTime", Value: -1},
},
}
_, err = incidents.Indexes().CreateOne(ctx, indexModel)
if err != nil {
return nil, fmt.Errorf("failed to create index: %w", err)
}
return &IncidentManager{
db: db,
incidents: incidents,
clients: make(map[*websocket.Conn]bool),
broadcast: make(chan []byte),
upgrader: websocket.Upgrader{
ReadBufferSize: 1024,
WriteBufferSize: 1024,
CheckOrigin: func(r *http.Request) bool {
return true // Allow all connections in this example
},
},
}, nil
}
// Start starts the incident manager
func (im *IncidentManager) Start() {
// Start the broadcaster
go im.broadcaster()
// Load active incident if any
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
var incident Incident
err := im.incidents.FindOne(ctx, bson.M{"status": "active"}).Decode(&incident)
if err == nil {
im.mutex.Lock()
im.activeIncident = &incident
im.mutex.Unlock()
log.Printf("Loaded active incident: %s", incident.ID)
} else if err != mongo.ErrNoDocuments {
log.Printf("Error loading active incident: %v", err)
}
}
// broadcaster broadcasts messages to all connected clients
func (im *IncidentManager) broadcaster() {
for {
message := <-im.broadcast
for client := range im.clients {
err := client.WriteMessage(websocket.TextMessage, message)
if err != nil {
log.Printf("Error broadcasting message: %v", err)
client.Close()
delete(im.clients, client)
}
}
}
}
// handleWebSocket handles WebSocket connections
func (im *IncidentManager) handleWebSocket(w http.ResponseWriter, r *http.Request) {
conn, err := im.upgrader.Upgrade(w, r, nil)
if err != nil {
log.Printf("Error upgrading to WebSocket: %v", err)
return
}
// Register client
im.clients[conn] = true
// Send current incident state
im.mutex.RLock()
if im.activeIncident != nil {
data, err := json.Marshal(im.activeIncident)
if err == nil {
conn.WriteMessage(websocket.TextMessage, data)
}
}
im.mutex.RUnlock()
// Handle incoming messages
go func() {
defer func() {
conn.Close()
delete(im.clients, conn)
}()
for {
_, message, err := conn.ReadMessage()
if err != nil {
if websocket.IsUnexpectedCloseError(err, websocket.CloseGoingAway, websocket.CloseAbnormalClosure) {
log.Printf("WebSocket error: %v", err)
}
break
}
// Process message (e.g., update incident, add timeline entry, etc.)
var data map[string]interface{}
if err := json.Unmarshal(message, &data); err != nil {
log.Printf("Error unmarshaling message: %v", err)
continue
}
// Handle different message types
messageType, ok := data["type"].(string)
if !ok {
continue
}
switch messageType {
case "add_timeline":
im.handleAddTimeline(data)
case "add_action":
im.handleAddAction(data)
case "update_action":
im.handleUpdateAction(data)
case "update_incident":
im.handleUpdateIncident(data)
}
}
}()
}
// handleAddTimeline handles adding a timeline entry
func (im *IncidentManager) handleAddTimeline(data map[string]interface{}) {
// Implementation details omitted for brevity
}
// handleAddAction handles adding a remediation action
func (im *IncidentManager) handleAddAction(data map[string]interface{}) {
// Implementation details omitted for brevity
}
// handleUpdateAction handles updating a remediation action
func (im *IncidentManager) handleUpdateAction(data map[string]interface{}) {
// Implementation details omitted for brevity
}
// handleUpdateIncident handles updating incident details
func (im *IncidentManager) handleUpdateIncident(data map[string]interface{}) {
// Implementation details omitted for brevity
}
// createIncident creates a new incident
func (im *IncidentManager) createIncident(w http.ResponseWriter, r *http.Request) {
var incident Incident
if err := json.NewDecoder(r.Body).Decode(&incident); err != nil {
http.Error(w, err.Error(), http.StatusBadRequest)
return
}
// Set incident fields
incident.ID = fmt.Sprintf("INC-%d", time.Now().Unix())
incident.Status = "active"
incident.StartTime = time.Now()
incident.Timeline = []TimelineEntry{
{
Timestamp: time.Now(),
Message: "Incident created",
Author: "System",
Type: "milestone",
},
}
incident.Actions = []RemediationAction{}
// Save to database
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
_, err := im.incidents.InsertOne(ctx, incident)
if err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
// Set as active incident
im.mutex.Lock()
im.activeIncident = &incident
im.mutex.Unlock()
// Broadcast to all clients
data, _ := json.Marshal(incident)
im.broadcast <- data
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(incident)
}
// getIncident gets an incident by ID
func (im *IncidentManager) getIncident(w http.ResponseWriter, r *http.Request) {
vars := mux.Vars(r)
id := vars["id"]
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
var incident Incident
err := im.incidents.FindOne(ctx, bson.M{"_id": id}).Decode(&incident)
if err != nil {
if err == mongo.ErrNoDocuments {
http.Error(w, "Incident not found", http.StatusNotFound)
} else {
http.Error(w, err.Error(), http.StatusInternalServerError)
}
return
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(incident)
}
// listIncidents lists all incidents
func (im *IncidentManager) listIncidents(w http.ResponseWriter, r *http.Request) {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
findOptions := options.Find()
findOptions.SetSort(bson.D{{Key: "startTime", Value: -1}})
findOptions.SetLimit(100)
cursor, err := im.incidents.Find(ctx, bson.M{}, findOptions)
if err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
defer cursor.Close(ctx)
var incidents []Incident
if err := cursor.All(ctx, &incidents); err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(incidents)
}
// addTimelineEntry adds a timeline entry to an incident
func (im *IncidentManager) addTimelineEntry(w http.ResponseWriter, r *http.Request) {
vars := mux.Vars(r)
id := vars["id"]
var entry TimelineEntry
if err := json.NewDecoder(r.Body).Decode(&entry); err != nil {
http.Error(w, err.Error(), http.StatusBadRequest)
return
}
entry.Timestamp = time.Now()
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
update := bson.M{
"$push": bson.M{
"timeline": entry,
},
}
result, err := im.incidents.UpdateOne(ctx, bson.M{"_id": id}, update)
if err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
if result.MatchedCount == 0 {
http.Error(w, "Incident not found", http.StatusNotFound)
return
}
// If this is the active incident, update it and broadcast
im.mutex.RLock()
isActive := im.activeIncident != nil && im.activeIncident.ID == id
im.mutex.RUnlock()
if isActive {
im.mutex.Lock()
im.activeIncident.Timeline = append(im.activeIncident.Timeline, entry)
incident := im.activeIncident
im.mutex.Unlock()
data, _ := json.Marshal(incident)
im.broadcast <- data
}
w.WriteHeader(http.StatusCreated)
}
// addRemediationAction adds a remediation action to an incident
func (im *IncidentManager) addRemediationAction(w http.ResponseWriter, r *http.Request) {
vars := mux.Vars(r)
id := vars["id"]
var action RemediationAction
if err := json.NewDecoder(r.Body).Decode(&action); err != nil {
http.Error(w, err.Error(), http.StatusBadRequest)
return
}
action.ID = fmt.Sprintf("ACT-%d", time.Now().Unix())
action.Status = "proposed"
action.CreatedAt = time.Now()
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
update := bson.M{
"$push": bson.M{
"actions": action,
},
}
result, err := im.incidents.UpdateOne(ctx, bson.M{"_id": id}, update)
if err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
if result.MatchedCount == 0 {
http.Error(w, "Incident not found", http.StatusNotFound)
return
}
// If this is the active incident, update it and broadcast
im.mutex.RLock()
isActive := im.activeIncident != nil && im.activeIncident.ID == id
im.mutex.RUnlock()
if isActive {
im.mutex.Lock()
im.activeIncident.Actions = append(im.activeIncident.Actions, action)
incident := im.activeIncident
im.mutex.Unlock()
data, _ := json.Marshal(incident)
im.broadcast <- data
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(action)
}
// updateRemediationAction updates a remediation action
func (im *IncidentManager) updateRemediationAction(w http.ResponseWriter, r *http.Request) {
vars := mux.Vars(r)
id := vars["id"]
actionID := vars["actionID"]
var updateData map[string]interface{}
if err := json.NewDecoder(r.Body).Decode(&updateData); err != nil {
http.Error(w, err.Error(), http.StatusBadRequest)
return
}
// Build update document
updateFields := bson.M{}
for key, value := range updateData {
updateFields["actions.$."+key] = value
}
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
update := bson.M{
"$set": updateFields,
}
result, err := im.incidents.UpdateOne(
ctx,
bson.M{
"_id": id,
"actions.id": actionID,
},
update,
)
if err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
if result.MatchedCount == 0 {
http.Error(w, "Incident or action not found", http.StatusNotFound)
return
}
// If this is the active incident, update it and broadcast
im.mutex.RLock()
isActive := im.activeIncident != nil && im.activeIncident.ID == id
im.mutex.RUnlock()
if isActive {
// Reload the incident
var incident Incident
err := im.incidents.FindOne(ctx, bson.M{"_id": id}).Decode(&incident)
if err == nil {
im.mutex.Lock()
im.activeIncident = &incident
im.mutex.Unlock()
data, _ := json.Marshal(incident)
im.broadcast <- data
}
}
w.WriteHeader(http.StatusOK)
}
// updateIncidentStatus updates the status of an incident
func (im *IncidentManager) updateIncidentStatus(w http.ResponseWriter, r *http.Request) {
vars := mux.Vars(r)
id := vars["id"]
var updateData struct {
Status string `json:"status"`
}
if err := json.NewDecoder(r.Body).Decode(&updateData); err != nil {
http.Error(w, err.Error(), http.StatusBadRequest)
return
}
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
update := bson.M{
"$set": bson.M{
"status": updateData.Status,
},
}
// Add appropriate timestamp based on status
now := time.Now()
if updateData.Status == "mitigated" {
update["$set"].(bson.M)["mitigationTime"] = now
} else if updateData.Status == "resolved" {
update["$set"].(bson.M)["resolutionTime"] = now
}
result, err := im.incidents.UpdateOne(ctx, bson.M{"_id": id}, update)
if err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
if result.MatchedCount == 0 {
http.Error(w, "Incident not found", http.StatusNotFound)
return
}
// If this is the active incident and it's being resolved, clear it
im.mutex.RLock()
isActive := im.activeIncident != nil && im.activeIncident.ID == id
im.mutex.RUnlock()
if isActive {
if updateData.Status == "mitigated" {
im.mutex.Lock()
mitigationTime := now
im.activeIncident.Status = "mitigated"
im.activeIncident.MitigationTime = &mitigationTime
incident := im.activeIncident
im.mutex.Unlock()
data, _ := json.Marshal(incident)
im.broadcast <- data
} else if updateData.Status == "resolved" {
im.mutex.Lock()
im.activeIncident = nil
im.mutex.Unlock()
// Broadcast incident closure
data, _ := json.Marshal(map[string]interface{}{
"type": "incident_closed",
"id": id,
"message": "Incident has been resolved",
})
im.broadcast <- data
}
}
w.WriteHeader(http.StatusOK)
}
func main() {
// Create incident manager
im, err := NewIncidentManager("mongodb://localhost:27017")
if err != nil {
log.Fatalf("Failed to create incident manager: %v", err)
}
im.Start()
// Create router
r := mux.NewRouter()
// API routes
r.HandleFunc("/api/incidents", im.createIncident).Methods("POST")
r.HandleFunc("/api/incidents", im.listIncidents).Methods("GET")
r.HandleFunc("/api/incidents/{id}", im.getIncident).Methods("GET")
r.HandleFunc("/api/incidents/{id}/timeline", im.addTimelineEntry).Methods("POST")
r.HandleFunc("/api/incidents/{id}/actions", im.addRemediationAction).Methods("POST")
r.HandleFunc("/api/incidents/{id}/actions/{actionID}", im.updateRemediationAction).Methods("PUT")
r.HandleFunc("/api/incidents/{id}/status", im.updateIncidentStatus).Methods("PUT")
// WebSocket endpoint
r.HandleFunc("/ws", im.handleWebSocket)
// Serve static files
r.PathPrefix("/").Handler(http.FileServer(http.Dir("./static")))
// Start server
log.Println("Starting server on :8080")
log.Fatal(http.ListenAndServe(":8080", r))
}
• Long-term: Implemented a comprehensive incident management framework:
- Established a global incident command structure with clear roles and responsibilities
- Standardized tooling and procedures across all regions
- Created a centralized incident management platform for real-time coordination
- Implemented regular cross-region incident response drills
- Developed a post-incident review process with actionable improvements
Lessons Learned:
Effective incident response requires clear command structure and coordination across teams.
How to Avoid:
Establish a clear incident command structure with defined roles.
Standardize tools and procedures across all teams and regions.
Implement centralized communication channels for incident coordination.
Conduct regular cross-region incident response drills.
Document and share lessons learned from each incident.
No summary provided
What Happened:
An SRE team implemented a chaos engineering experiment to test the resilience of a critical microservices application. The experiment was designed to randomly terminate pods to ensure the system could handle unexpected failures. However, when executed, it caused a cascading failure that took down the entire production environment for over an hour, affecting thousands of users.
Diagnosis Steps:
Analyzed chaos experiment configuration and execution logs.
Reviewed system metrics and alerts during the incident.
Examined application logs for error patterns.
Checked Kubernetes events and pod status changes.
Investigated network and service mesh configurations.
Root Cause:
The investigation revealed multiple issues: 1. The chaos experiment targeted critical infrastructure components that weren't properly excluded 2. The blast radius wasn't properly limited, allowing too many simultaneous failures 3. The application's retry mechanisms created a thundering herd problem 4. Circuit breakers were misconfigured and didn't trip as expected 5. Monitoring didn't detect the early warning signs of the cascade
Fix/Workaround:
• Short-term: Implemented immediate safeguards for chaos experiments:
# Chaos Mesh experiment with proper safeguards
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-failure-demo
namespace: chaos-testing
spec:
action: pod-failure
mode: one
duration: "30s"
selector:
namespaces:
- app
labelSelectors:
"app.kubernetes.io/component": "worker"
scheduler:
cron: "@every 5m"
# Critical safeguards
containerNames: ["worker"]
gracePeriod: 30
maxPercentage: 20
excludedPods:
- app-leader-0
- app-database-0
excludedNamespaces:
- kube-system
- monitoring
- istio-system
annotations:
experiment.chaos-mesh.org/pause: "false"
experiment.chaos-mesh.org/duration: "30s"
• Long-term: Implemented a comprehensive chaos engineering framework in Go:
// chaos_framework.go - Safe chaos engineering framework
package chaos
import (
"context"
"fmt"
"math/rand"
"sync"
"time"
"github.com/sirupsen/logrus"
"k8s.io/apimachinery/pkg/api/errors"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/rest"
)
// ExperimentConfig defines the configuration for a chaos experiment
type ExperimentConfig struct {
// Name of the experiment
Name string `json:"name"`
// Description of what the experiment tests
Description string `json:"description"`
// Target namespaces to include in the experiment
TargetNamespaces []string `json:"targetNamespaces"`
// Namespaces to exclude from the experiment
ExcludedNamespaces []string `json:"excludedNamespaces"`
// Label selectors to target specific resources
LabelSelectors map[string]string `json:"labelSelectors"`
// Resources to exclude by name
ExcludedResources []string `json:"excludedResources"`
// Maximum percentage of resources to affect at once (0-100)
MaxPercentage int `json:"maxPercentage"`
// Duration of the experiment
Duration time.Duration `json:"duration"`
// Interval between actions
Interval time.Duration `json:"interval"`
// Whether to run the experiment in dry-run mode
DryRun bool `json:"dryRun"`
// Critical service indicators to monitor during experiment
CSIThresholds map[string]float64 `json:"csiThresholds"`
// Automatic rollback if thresholds are exceeded
AutoRollback bool `json:"autoRollback"`
// Notification channels for experiment events
NotificationChannels []string `json:"notificationChannels"`
// Experiment actions to perform
Actions []ExperimentAction `json:"actions"`
}
// ExperimentAction defines a specific chaos action
type ExperimentAction struct {
// Type of action (pod-kill, network-delay, cpu-stress, etc.)
Type string `json:"type"`
// Action-specific parameters
Parameters map[string]interface{} `json:"parameters"`
// Weight of this action (for random selection)
Weight int `json:"weight"`
}
// ExperimentResult contains the results of a chaos experiment
type ExperimentResult struct {
// Name of the experiment
ExperimentName string `json:"experimentName"`
// Start time of the experiment
StartTime time.Time `json:"startTime"`
// End time of the experiment
EndTime time.Time `json:"endTime"`
// Whether the experiment completed successfully
Successful bool `json:"successful"`
// Reason for failure if not successful
FailureReason string `json:"failureReason,omitempty"`
// Resources affected by the experiment
AffectedResources []string `json:"affectedResources"`
// Metrics collected during the experiment
Metrics map[string][]MetricDataPoint `json:"metrics"`
// Events that occurred during the experiment
Events []ExperimentEvent `json:"events"`
}
// MetricDataPoint represents a single metric measurement
type MetricDataPoint struct {
// Timestamp of the measurement
Timestamp time.Time `json:"timestamp"`
// Value of the metric
Value float64 `json:"value"`
}
// ExperimentEvent represents a significant event during the experiment
type ExperimentEvent struct {
// Timestamp of the event
Timestamp time.Time `json:"timestamp"`
// Type of event
Type string `json:"type"`
// Description of the event
Description string `json:"description"`
// Severity of the event (info, warning, error)
Severity string `json:"severity"`
}
// ChaosFramework orchestrates chaos engineering experiments
type ChaosFramework struct {
// Kubernetes client
client *kubernetes.Clientset
// Logger
logger *logrus.Logger
// Metrics collector
metricsCollector MetricsCollector
// Active experiments
activeExperiments map[string]*runningExperiment
// Mutex for thread safety
mu sync.Mutex
}
// runningExperiment tracks an active experiment
type runningExperiment struct {
// Configuration of the experiment
config ExperimentConfig
// Context for cancellation
ctx context.Context
// Cancel function
cancel context.CancelFunc
// Start time
startTime time.Time
// Affected resources
affectedResources []string
// Events
events []ExperimentEvent
}
// MetricsCollector interface for collecting metrics
type MetricsCollector interface {
// CollectMetrics collects metrics for the given resources
CollectMetrics(ctx context.Context, resources []string) (map[string]float64, error)
// StartCollection starts continuous metric collection
StartCollection(ctx context.Context, resources []string, interval time.Duration) error
// StopCollection stops continuous metric collection
StopCollection() (map[string][]MetricDataPoint, error)
}
// NewChaosFramework creates a new chaos engineering framework
func NewChaosFramework(metricsCollector MetricsCollector) (*ChaosFramework, error) {
// Create Kubernetes client
config, err := rest.InClusterConfig()
if err != nil {
return nil, fmt.Errorf("failed to get cluster config: %w", err)
}
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
return nil, fmt.Errorf("failed to create Kubernetes client: %w", err)
}
// Create logger
logger := logrus.New()
logger.SetFormatter(&logrus.JSONFormatter{})
return &ChaosFramework{
client: clientset,
logger: logger,
metricsCollector: metricsCollector,
activeExperiments: make(map[string]*runningExperiment),
}, nil
}
// RunExperiment runs a chaos experiment
func (c *ChaosFramework) RunExperiment(config ExperimentConfig) (*ExperimentResult, error) {
c.mu.Lock()
defer c.mu.Unlock()
// Validate experiment configuration
if err := c.validateExperimentConfig(config); err != nil {
return nil, fmt.Errorf("invalid experiment configuration: %w", err)
}
// Check if experiment is already running
if _, exists := c.activeExperiments[config.Name]; exists {
return nil, fmt.Errorf("experiment %s is already running", config.Name)
}
// Create context with cancellation
ctx, cancel := context.WithTimeout(context.Background(), config.Duration)
// Create running experiment
experiment := &runningExperiment{
config: config,
ctx: ctx,
cancel: cancel,
startTime: time.Now(),
events: []ExperimentEvent{
{
Timestamp: time.Now(),
Type: "ExperimentStarted",
Description: fmt.Sprintf("Started experiment %s", config.Name),
Severity: "info",
},
},
}
// Store active experiment
c.activeExperiments[config.Name] = experiment
// Log experiment start
c.logger.WithFields(logrus.Fields{
"experiment": config.Name,
"duration": config.Duration.String(),
"dry_run": config.DryRun,
}).Info("Starting chaos experiment")
// Start experiment in a goroutine
go func() {
defer cancel()
// Start metrics collection
targetResources, err := c.getTargetResources(config)
if err != nil {
c.recordEvent(config.Name, "ResourceSelectionFailed", err.Error(), "error")
return
}
if err := c.metricsCollector.StartCollection(ctx, targetResources, 10*time.Second); err != nil {
c.recordEvent(config.Name, "MetricsCollectionFailed", err.Error(), "warning")
}
// Run experiment actions
c.runExperimentActions(experiment, targetResources)
// Wait for experiment completion or cancellation
<-ctx.Done()
// Stop metrics collection
metrics, _ := c.metricsCollector.StopCollection()
// Create experiment result
result := &ExperimentResult{
ExperimentName: config.Name,
StartTime: experiment.startTime,
EndTime: time.Now(),
Successful: ctx.Err() == context.DeadlineExceeded, // Completed normally
AffectedResources: experiment.affectedResources,
Metrics: metrics,
Events: experiment.events,
}
if ctx.Err() != context.DeadlineExceeded {
result.FailureReason = ctx.Err().Error()
}
// Log experiment completion
c.logger.WithFields(logrus.Fields{
"experiment": config.Name,
"duration": result.EndTime.Sub(result.StartTime).String(),
"successful": result.Successful,
}).Info("Completed chaos experiment")
// Remove from active experiments
c.mu.Lock()
delete(c.activeExperiments, config.Name)
c.mu.Unlock()
}()
return &ExperimentResult{
ExperimentName: config.Name,
StartTime: experiment.startTime,
}, nil
}
// StopExperiment stops a running experiment
func (c *ChaosFramework) StopExperiment(name string) error {
c.mu.Lock()
defer c.mu.Unlock()
experiment, exists := c.activeExperiments[name]
if !exists {
return fmt.Errorf("experiment %s is not running", name)
}
// Cancel experiment
experiment.cancel()
// Record event
c.recordEvent(name, "ExperimentStopped", "Experiment was manually stopped", "warning")
return nil
}
// GetActiveExperiments returns the names of all active experiments
func (c *ChaosFramework) GetActiveExperiments() []string {
c.mu.Lock()
defer c.mu.Unlock()
names := make([]string, 0, len(c.activeExperiments))
for name := range c.activeExperiments {
names = append(names, name)
}
return names
}
// validateExperimentConfig validates the experiment configuration
func (c *ChaosFramework) validateExperimentConfig(config ExperimentConfig) error {
// Check required fields
if config.Name == "" {
return fmt.Errorf("experiment name is required")
}
if len(config.TargetNamespaces) == 0 && len(config.LabelSelectors) == 0 {
return fmt.Errorf("at least one target namespace or label selector is required")
}
if config.Duration <= 0 {
return fmt.Errorf("experiment duration must be positive")
}
if config.MaxPercentage <= 0 || config.MaxPercentage > 100 {
return fmt.Errorf("max percentage must be between 1 and 100")
}
// Validate actions
if len(config.Actions) == 0 {
return fmt.Errorf("at least one action is required")
}
for i, action := range config.Actions {
if action.Type == "" {
return fmt.Errorf("action %d has no type", i)
}
if action.Weight <= 0 {
return fmt.Errorf("action %d has invalid weight", i)
}
// Validate action-specific parameters
switch action.Type {
case "pod-kill":
// No specific parameters required
case "network-delay":
if _, ok := action.Parameters["latency"]; !ok {
return fmt.Errorf("network-delay action requires latency parameter")
}
case "cpu-stress":
if _, ok := action.Parameters["load"]; !ok {
return fmt.Errorf("cpu-stress action requires load parameter")
}
default:
return fmt.Errorf("unknown action type: %s", action.Type)
}
}
// Check for critical namespaces
criticalNamespaces := []string{"kube-system", "kube-public", "kube-node-lease"}
for _, ns := range config.TargetNamespaces {
for _, critical := range criticalNamespaces {
if ns == critical && !contains(config.ExcludedNamespaces, critical) {
return fmt.Errorf("targeting critical namespace %s without excluding it", critical)
}
}
}
return nil
}
// getTargetResources gets the resources targeted by the experiment
func (c *ChaosFramework) getTargetResources(config ExperimentConfig) ([]string, error) {
var resources []string
// Get pods matching the criteria
for _, namespace := range config.TargetNamespaces {
// Skip excluded namespaces
if contains(config.ExcludedNamespaces, namespace) {
continue
}
// Create label selector string
var labelSelector string
for k, v := range config.LabelSelectors {
if labelSelector != "" {
labelSelector += ","
}
labelSelector += fmt.Sprintf("%s=%s", k, v)
}
// List pods
pods, err := c.client.CoreV1().Pods(namespace).List(context.Background(), metav1.ListOptions{
LabelSelector: labelSelector,
})
if err != nil {
return nil, fmt.Errorf("failed to list pods in namespace %s: %w", namespace, err)
}
// Add pods to resources
for _, pod := range pods.Items {
// Skip excluded resources
if contains(config.ExcludedResources, pod.Name) {
continue
}
resources = append(resources, fmt.Sprintf("%s/%s", namespace, pod.Name))
}
}
// Apply max percentage
maxCount := len(resources) * config.MaxPercentage / 100
if maxCount < 1 {
maxCount = 1
}
// Shuffle and limit resources
rand.Shuffle(len(resources), func(i, j int) {
resources[i], resources[j] = resources[j], resources[i]
})
if len(resources) > maxCount {
resources = resources[:maxCount]
}
return resources, nil
}
// runExperimentActions runs the experiment actions
func (c *ChaosFramework) runExperimentActions(experiment *runningExperiment, targetResources []string) {
config := experiment.config
ctx := experiment.ctx
// Calculate total weight
totalWeight := 0
for _, action := range config.Actions {
totalWeight += action.Weight
}
// Run actions at intervals
ticker := time.NewTicker(config.Interval)
defer ticker.Stop()
for {
select {
case <-ticker.C:
// Select random action based on weight
actionIndex := -1
randomWeight := rand.Intn(totalWeight)
weightSum := 0
for i, action := range config.Actions {
weightSum += action.Weight
if randomWeight < weightSum {
actionIndex = i
break
}
}
if actionIndex == -1 {
continue
}
action := config.Actions[actionIndex]
// Select random resource
if len(targetResources) == 0 {
continue
}
resourceIndex := rand.Intn(len(targetResources))
resource := targetResources[resourceIndex]
// Execute action
if err := c.executeAction(ctx, action, resource, config.DryRun); err != nil {
c.recordEvent(config.Name, "ActionFailed",
fmt.Sprintf("Failed to execute action %s on resource %s: %v", action.Type, resource, err),
"error")
continue
}
// Record affected resource
experiment.affectedResources = append(experiment.affectedResources, resource)
// Record event
c.recordEvent(config.Name, "ActionExecuted",
fmt.Sprintf("Executed action %s on resource %s", action.Type, resource),
"info")
// Check metrics for abort conditions
metrics, err := c.metricsCollector.CollectMetrics(ctx, []string{resource})
if err != nil {
c.logger.WithError(err).Warn("Failed to collect metrics")
continue
}
// Check thresholds
for metric, threshold := range config.CSIThresholds {
if value, ok := metrics[metric]; ok && value > threshold {
c.logger.WithFields(logrus.Fields{
"metric": metric,
"value": value,
"threshold": threshold,
}).Warn("Metric threshold exceeded")
if config.AutoRollback {
c.recordEvent(config.Name, "ThresholdExceeded",
fmt.Sprintf("Metric %s exceeded threshold (%.2f > %.2f)", metric, value, threshold),
"warning")
// Cancel experiment
experiment.cancel()
return
}
}
}
case <-ctx.Done():
return
}
}
}
// executeAction executes a chaos action on a resource
func (c *ChaosFramework) executeAction(ctx context.Context, action ExperimentAction, resource string, dryRun bool) error {
// Parse namespace and name
parts := splitResource(resource)
if len(parts) != 2 {
return fmt.Errorf("invalid resource format: %s", resource)
}
namespace, name := parts[0], parts[1]
// Execute action based on type
switch action.Type {
case "pod-kill":
if dryRun {
c.logger.WithFields(logrus.Fields{
"action": "pod-kill",
"namespace": namespace,
"pod": name,
"dry_run": true,
}).Info("Would delete pod")
return nil
}
return c.client.CoreV1().Pods(namespace).Delete(ctx, name, metav1.DeleteOptions{})
case "network-delay":
latency, ok := action.Parameters["latency"].(string)
if !ok {
return fmt.Errorf("invalid latency parameter")
}
if dryRun {
c.logger.WithFields(logrus.Fields{
"action": "network-delay",
"namespace": namespace,
"pod": name,
"latency": latency,
"dry_run": true,
}).Info("Would add network delay")
return nil
}
// In a real implementation, this would use a CNI plugin or sidecar to add network delay
c.logger.WithFields(logrus.Fields{
"action": "network-delay",
"namespace": namespace,
"pod": name,
"latency": latency,
}).Info("Adding network delay")
return nil
case "cpu-stress":
load, ok := action.Parameters["load"].(float64)
if !ok {
return fmt.Errorf("invalid load parameter")
}
if dryRun {
c.logger.WithFields(logrus.Fields{
"action": "cpu-stress",
"namespace": namespace,
"pod": name,
"load": load,
"dry_run": true,
}).Info("Would add CPU stress")
return nil
}
// In a real implementation, this would inject a stress test into the pod
c.logger.WithFields(logrus.Fields{
"action": "cpu-stress",
"namespace": namespace,
"pod": name,
"load": load,
}).Info("Adding CPU stress")
return nil
default:
return fmt.Errorf("unknown action type: %s", action.Type)
}
}
// recordEvent records an experiment event
func (c *ChaosFramework) recordEvent(experimentName, eventType, description, severity string) {
c.mu.Lock()
defer c.mu.Unlock()
experiment, exists := c.activeExperiments[experimentName]
if !exists {
return
}
event := ExperimentEvent{
Timestamp: time.Now(),
Type: eventType,
Description: description,
Severity: severity,
}
experiment.events = append(experiment.events, event)
c.logger.WithFields(logrus.Fields{
"experiment": experimentName,
"event_type": eventType,
"severity": severity,
}).Info(description)
}
// Helper functions
// contains checks if a string is in a slice
func contains(slice []string, item string) bool {
for _, s := range slice {
if s == item {
return true
}
}
return false
}
// splitResource splits a resource string into namespace and name
func splitResource(resource string) []string {
return strings.Split(resource, "/")
}
• Created a comprehensive chaos testing policy:
# chaos-testing-policy.yaml
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: chaos-testing-psp
annotations:
seccomp.security.alpha.kubernetes.io/allowedProfileNames: 'docker/default,runtime/default'
seccomp.security.alpha.kubernetes.io/defaultProfileName: 'runtime/default'
spec:
privileged: false
allowPrivilegeEscalation: false
requiredDropCapabilities:
- ALL
volumes:
- 'configMap'
- 'emptyDir'
- 'projected'
- 'secret'
- 'downwardAPI'
hostNetwork: false
hostIPC: false
hostPID: false
runAsUser:
rule: 'MustRunAsNonRoot'
seLinux:
rule: 'RunAsAny'
supplementalGroups:
rule: 'MustRunAs'
ranges:
- min: 1
max: 65535
fsGroup:
rule: 'MustRunAs'
ranges:
- min: 1
max: 65535
readOnlyRootFilesystem: true
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: chaos-testing-role
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch", "delete"]
- apiGroups: [""]
resources: ["events"]
verbs: ["create", "patch"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "list"]
- apiGroups: ["metrics.k8s.io"]
resources: ["pods"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: chaos-testing-binding
subjects:
- kind: ServiceAccount
name: chaos-testing-sa
namespace: chaos-testing
roleRef:
kind: ClusterRole
name: chaos-testing-role
apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: Namespace
metadata:
name: chaos-testing
labels:
name: chaos-testing
purpose: chaos-engineering
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: chaos-testing-sa
namespace: chaos-testing
---
apiVersion: v1
kind: ConfigMap
metadata:
name: chaos-testing-config
namespace: chaos-testing
data:
config.yaml: |
# Global chaos testing configuration
defaultExperimentDuration: 10m
defaultInterval: 30s
maxConcurrentExperiments: 1
notificationChannels:
- slack
- email
slackWebhook: "${SLACK_WEBHOOK}"
emailRecipients:
- sre-team@example.com
# Default safeguards
defaultSafeguards:
excludedNamespaces:
- kube-system
- kube-public
- kube-node-lease
- istio-system
- monitoring
- logging
excludedLabels:
"chaos-exempt": "true"
"criticality": "high"
maxPercentage: 20
autoRollback: true
csiThresholds:
"error_rate": 5.0
"latency_p99": 1000
"cpu_usage": 90.0
Lessons Learned:
Chaos engineering requires careful planning, proper safeguards, and controlled blast radius.
How to Avoid:
Implement proper safeguards for all chaos experiments.
Start with non-production environments and gradually expand.
Define clear abort conditions and automatic rollbacks.
Ensure monitoring can detect early warning signs of cascading failures.
Educate teams about chaos engineering principles and practices.
No summary provided
What Happened:
An SRE team implemented a chaos engineering program to improve system resilience. During a scheduled experiment to test database failover, the chaos experiment unexpectedly affected critical production services outside the intended blast radius. The experiment was supposed to terminate a single database pod in a non-critical service, but due to misconfiguration, it terminated multiple database pods across several services, including the company's payment processing system. This resulted in a 45-minute outage for critical business functions.
Diagnosis Steps:
Analyzed chaos experiment configuration and execution logs.
Reviewed Kubernetes resource selectors and labels used in the experiment.
Examined the blast radius containment mechanisms.
Checked monitoring alerts and system behavior during the incident.
Reviewed the experiment approval and execution process.
Root Cause:
The investigation revealed multiple issues: 1. The chaos experiment used overly broad pod selectors that matched more resources than intended 2. The blast radius containment mechanism failed due to incorrect namespace configuration 3. Pre-experiment verification steps were skipped due to time pressure 4. Monitoring for critical services affected by the experiment was inadequate 5. The experiment was run during a high-traffic period without proper scheduling controls
Fix/Workaround:
• Short-term: Implemented immediate improvements to chaos engineering practices:
- Added strict label selectors for all experiments
- Implemented mandatory pre-experiment verification
- Created a chaos engineering approval workflow
• Long-term: Developed a comprehensive chaos engineering framework in Go:
// chaos_framework.go
package chaos
import (
"context"
"fmt"
"log"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/rest"
)
var (
// Prometheus metrics
experimentsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "chaos_experiments_total",
Help: "Total number of chaos experiments executed",
},
[]string{"name", "target_type", "namespace", "status"},
)
experimentDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "chaos_experiment_duration_seconds",
Help: "Duration of chaos experiments in seconds",
Buckets: prometheus.LinearBuckets(10, 30, 10), // 10s, 40s, 70s, ...
},
[]string{"name", "target_type"},
)
affectedResources = promauto.NewGaugeVec(
prometheus.GaugeOpts{
Name: "chaos_affected_resources",
Help: "Number of resources affected by chaos experiments",
},
[]string{"name", "target_type", "namespace"},
)
)
// ExperimentConfig defines the configuration for a chaos experiment
type ExperimentConfig struct {
Name string `json:"name"`
Description string `json:"description"`
TargetType string `json:"target_type"` // pod, node, service, etc.
Namespace string `json:"namespace"`
LabelSelectors map[string]string `json:"label_selectors"`
ExcludeLabels map[string]string `json:"exclude_labels"`
Duration time.Duration `json:"duration"`
MaxTargets int `json:"max_targets"`
ConcurrentTargets int `json:"concurrent_targets"`
DryRun bool `json:"dry_run"`
Actions []Action `json:"actions"`
PreChecks []Check `json:"pre_checks"`
PostChecks []Check `json:"post_checks"`
Schedule *Schedule `json:"schedule"`
Notifications []Notification `json:"notifications"`
SafetyChecks []SafetyCheck `json:"safety_checks"`
}
// Action defines a chaos action to be performed
type Action struct {
Type string `json:"type"`
Parameters map[string]interface{} `json:"parameters"`
}
// Check defines a check to be performed before or after an experiment
type Check struct {
Type string `json:"type"`
Parameters map[string]interface{} `json:"parameters"`
Timeout time.Duration `json:"timeout"`
}
// Schedule defines when an experiment should be executed
type Schedule struct {
Cron string `json:"cron"`
AllowedDays []string `json:"allowed_days"`
AllowedHours []int `json:"allowed_hours"`
ExcludedDates []string `json:"excluded_dates"`
TimeZone string `json:"time_zone"`
MaxConcurrent int `json:"max_concurrent"`
RequireApproval bool `json:"require_approval"`
}
// Notification defines how to notify about experiment execution
type Notification struct {
Type string `json:"type"`
Parameters map[string]interface{} `json:"parameters"`
Events []string `json:"events"`
}
// SafetyCheck defines a safety check to prevent unintended damage
type SafetyCheck struct {
Type string `json:"type"`
Parameters map[string]interface{} `json:"parameters"`
StopOnFail bool `json:"stop_on_fail"`
}
// ExperimentResult contains the result of a chaos experiment
type ExperimentResult struct {
ExperimentName string `json:"experiment_name"`
StartTime time.Time `json:"start_time"`
EndTime time.Time `json:"end_time"`
Duration time.Duration `json:"duration"`
Status string `json:"status"`
TargetsAffected int `json:"targets_affected"`
ActionResults map[string]ActionResult `json:"action_results"`
CheckResults map[string]CheckResult `json:"check_results"`
Logs []LogEntry `json:"logs"`
Metrics map[string]float64 `json:"metrics"`
}
// ActionResult contains the result of a chaos action
type ActionResult struct {
ActionType string `json:"action_type"`
StartTime time.Time `json:"start_time"`
EndTime time.Time `json:"end_time"`
Duration time.Duration `json:"duration"`
Status string `json:"status"`
TargetCount int `json:"target_count"`
Error string `json:"error,omitempty"`
}
// CheckResult contains the result of a pre or post check
type CheckResult struct {
CheckType string `json:"check_type"`
StartTime time.Time `json:"start_time"`
EndTime time.Time `json:"end_time"`
Duration time.Duration `json:"duration"`
Status string `json:"status"`
Error string `json:"error,omitempty"`
}
// LogEntry represents a log entry from the experiment
type LogEntry struct {
Timestamp time.Time `json:"timestamp"`
Level string `json:"level"`
Component string `json:"component"`
Message string `json:"message"`
}
// ChaosFramework is the main entry point for chaos experiments
type ChaosFramework struct {
clientset *kubernetes.Clientset
logger *log.Logger
config *rest.Config
}
// NewChaosFramework creates a new chaos framework
func NewChaosFramework(logger *log.Logger) (*ChaosFramework, error) {
// Get in-cluster config
config, err := rest.InClusterConfig()
if err != nil {
return nil, fmt.Errorf("failed to get in-cluster config: %w", err)
}
// Create clientset
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
return nil, fmt.Errorf("failed to create clientset: %w", err)
}
return &ChaosFramework{
clientset: clientset,
logger: logger,
config: config,
}, nil
}
// RunExperiment runs a chaos experiment
func (c *ChaosFramework) RunExperiment(ctx context.Context, config ExperimentConfig) (*ExperimentResult, error) {
c.logger.Printf("Starting chaos experiment: %s", config.Name)
// Initialize result
result := &ExperimentResult{
ExperimentName: config.Name,
StartTime: time.Now(),
Status: "running",
ActionResults: make(map[string]ActionResult),
CheckResults: make(map[string]CheckResult),
Logs: []LogEntry{},
Metrics: make(map[string]float64),
}
// Add initial log entry
result.Logs = append(result.Logs, LogEntry{
Timestamp: time.Now(),
Level: "info",
Component: "framework",
Message: fmt.Sprintf("Starting experiment: %s", config.Name),
})
// Validate configuration
if err := c.validateConfig(config); err != nil {
result.Status = "failed"
result.Logs = append(result.Logs, LogEntry{
Timestamp: time.Now(),
Level: "error",
Component: "framework",
Message: fmt.Sprintf("Configuration validation failed: %v", err),
})
experimentsTotal.WithLabelValues(config.Name, config.TargetType, config.Namespace, "failed").Inc()
return result, fmt.Errorf("invalid configuration: %w", err)
}
// Check if experiment is allowed to run now
if config.Schedule != nil && !c.isScheduleAllowed(config.Schedule) {
result.Status = "skipped"
result.Logs = append(result.Logs, LogEntry{
Timestamp: time.Now(),
Level: "info",
Component: "framework",
Message: "Experiment skipped due to schedule constraints",
})
experimentsTotal.WithLabelValues(config.Name, config.TargetType, config.Namespace, "skipped").Inc()
return result, nil
}
// Run safety checks
if err := c.runSafetyChecks(ctx, config); err != nil {
result.Status = "aborted"
result.Logs = append(result.Logs, LogEntry{
Timestamp: time.Now(),
Level: "error",
Component: "framework",
Message: fmt.Sprintf("Safety check failed: %v", err),
})
experimentsTotal.WithLabelValues(config.Name, config.TargetType, config.Namespace, "aborted").Inc()
return result, fmt.Errorf("safety check failed: %w", err)
}
// Get targets
targets, err := c.getTargets(ctx, config)
if err != nil {
result.Status = "failed"
result.Logs = append(result.Logs, LogEntry{
Timestamp: time.Now(),
Level: "error",
Component: "framework",
Message: fmt.Sprintf("Failed to get targets: %v", err),
})
experimentsTotal.WithLabelValues(config.Name, config.TargetType, config.Namespace, "failed").Inc()
return result, fmt.Errorf("failed to get targets: %w", err)
}
// Check if we have targets
if len(targets) == 0 {
result.Status = "skipped"
result.Logs = append(result.Logs, LogEntry{
Timestamp: time.Now(),
Level: "info",
Component: "framework",
Message: "No targets found for experiment",
})
experimentsTotal.WithLabelValues(config.Name, config.TargetType, config.Namespace, "skipped").Inc()
return result, nil
}
// Limit targets if needed
if config.MaxTargets > 0 && len(targets) > config.MaxTargets {
targets = targets[:config.MaxTargets]
}
// Update result with target count
result.TargetsAffected = len(targets)
// Update metrics
affectedResources.WithLabelValues(config.Name, config.TargetType, config.Namespace).Set(float64(len(targets)))
// Run pre-checks
preChecksPassed, preCheckResults := c.runChecks(ctx, config.PreChecks, "pre")
for name, checkResult := range preCheckResults {
result.CheckResults[name] = checkResult
}
if !preChecksPassed {
result.Status = "failed"
result.Logs = append(result.Logs, LogEntry{
Timestamp: time.Now(),
Level: "error",
Component: "framework",
Message: "Pre-checks failed",
})
experimentsTotal.WithLabelValues(config.Name, config.TargetType, config.Namespace, "failed").Inc()
return result, fmt.Errorf("pre-checks failed")
}
// Send start notifications
c.sendNotifications(config.Notifications, "start", config, result)
// Run actions
if !config.DryRun {
actionResults, err := c.runActions(ctx, config.Actions, targets, config)
if err != nil {
result.Status = "failed"
result.Logs = append(result.Logs, LogEntry{
Timestamp: time.Now(),
Level: "error",
Component: "framework",
Message: fmt.Sprintf("Failed to run actions: %v", err),
})
// Send failure notifications
c.sendNotifications(config.Notifications, "failure", config, result)
experimentsTotal.WithLabelValues(config.Name, config.TargetType, config.Namespace, "failed").Inc()
return result, fmt.Errorf("failed to run actions: %w", err)
}
for name, actionResult := range actionResults {
result.ActionResults[name] = actionResult
}
} else {
result.Logs = append(result.Logs, LogEntry{
Timestamp: time.Now(),
Level: "info",
Component: "framework",
Message: "Dry run mode, skipping actions",
})
}
// Wait for experiment duration
select {
case <-ctx.Done():
result.Status = "interrupted"
result.Logs = append(result.Logs, LogEntry{
Timestamp: time.Now(),
Level: "info",
Component: "framework",
Message: "Experiment interrupted",
})
case <-time.After(config.Duration):
// Continue with post-checks
}
// Run post-checks
postChecksPassed, postCheckResults := c.runChecks(ctx, config.PostChecks, "post")
for name, checkResult := range postCheckResults {
result.CheckResults[name] = checkResult
}
// Finalize result
result.EndTime = time.Now()
result.Duration = result.EndTime.Sub(result.StartTime)
if result.Status == "running" {
if postChecksPassed {
result.Status = "completed"
} else {
result.Status = "failed"
}
}
// Update metrics
experimentsTotal.WithLabelValues(config.Name, config.TargetType, config.Namespace, result.Status).Inc()
experimentDuration.WithLabelValues(config.Name, config.TargetType).Observe(result.Duration.Seconds())
// Send completion notifications
c.sendNotifications(config.Notifications, result.Status, config, result)
c.logger.Printf("Completed chaos experiment: %s, status: %s, duration: %v",
config.Name, result.Status, result.Duration)
return result, nil
}
// validateConfig validates the experiment configuration
func (c *ChaosFramework) validateConfig(config ExperimentConfig) error {
if config.Name == "" {
return fmt.Errorf("experiment name is required")
}
if config.TargetType == "" {
return fmt.Errorf("target type is required")
}
if config.Namespace == "" {
return fmt.Errorf("namespace is required")
}
if len(config.LabelSelectors) == 0 {
return fmt.Errorf("at least one label selector is required")
}
if config.Duration <= 0 {
return fmt.Errorf("duration must be greater than zero")
}
if len(config.Actions) == 0 {
return fmt.Errorf("at least one action is required")
}
// Validate actions
for _, action := range config.Actions {
if action.Type == "" {
return fmt.Errorf("action type is required")
}
// Validate action parameters based on type
switch action.Type {
case "pod-delete":
// No specific parameters required
case "pod-cpu-stress":
if _, ok := action.Parameters["cpu"]; !ok {
return fmt.Errorf("cpu parameter is required for pod-cpu-stress action")
}
case "pod-memory-stress":
if _, ok := action.Parameters["memory"]; !ok {
return fmt.Errorf("memory parameter is required for pod-memory-stress action")
}
case "pod-network-delay":
if _, ok := action.Parameters["latency"]; !ok {
return fmt.Errorf("latency parameter is required for pod-network-delay action")
}
case "pod-network-loss":
if _, ok := action.Parameters["loss"]; !ok {
return fmt.Errorf("loss parameter is required for pod-network-loss action")
}
case "node-cpu-stress":
if _, ok := action.Parameters["cpu"]; !ok {
return fmt.Errorf("cpu parameter is required for node-cpu-stress action")
}
case "node-memory-stress":
if _, ok := action.Parameters["memory"]; !ok {
return fmt.Errorf("memory parameter is required for node-memory-stress action")
}
case "node-network-delay":
if _, ok := action.Parameters["latency"]; !ok {
return fmt.Errorf("latency parameter is required for node-network-delay action")
}
case "node-network-loss":
if _, ok := action.Parameters["loss"]; !ok {
return fmt.Errorf("loss parameter is required for node-network-loss action")
}
default:
return fmt.Errorf("unknown action type: %s", action.Type)
}
}
return nil
}
// isScheduleAllowed checks if the experiment is allowed to run based on schedule
func (c *ChaosFramework) isScheduleAllowed(schedule *Schedule) bool {
now := time.Now()
// Check time zone
loc, err := time.LoadLocation(schedule.TimeZone)
if err != nil {
c.logger.Printf("Failed to load time zone %s: %v", schedule.TimeZone, err)
loc = time.UTC
}
// Convert to specified time zone
now = now.In(loc)
// Check allowed days
if len(schedule.AllowedDays) > 0 {
dayName := now.Weekday().String()
allowed := false
for _, day := range schedule.AllowedDays {
if day == dayName {
allowed = true
break
}
}
if !allowed {
return false
}
}
// Check allowed hours
if len(schedule.AllowedHours) > 0 {
hour := now.Hour()
allowed := false
for _, h := range schedule.AllowedHours {
if h == hour {
allowed = true
break
}
}
if !allowed {
return false
}
}
// Check excluded dates
if len(schedule.ExcludedDates) > 0 {
dateStr := now.Format("2006-01-02")
for _, date := range schedule.ExcludedDates {
if date == dateStr {
return false
}
}
}
return true
}
// runSafetyChecks runs all safety checks
func (c *ChaosFramework) runSafetyChecks(ctx context.Context, config ExperimentConfig) error {
for _, check := range config.SafetyChecks {
passed, err := c.runSafetyCheck(ctx, check, config)
if err != nil {
return fmt.Errorf("safety check %s failed: %w", check.Type, err)
}
if !passed && check.StopOnFail {
return fmt.Errorf("safety check %s failed", check.Type)
}
}
return nil
}
// runSafetyCheck runs a single safety check
func (c *ChaosFramework) runSafetyCheck(ctx context.Context, check SafetyCheck, config ExperimentConfig) (bool, error) {
switch check.Type {
case "target-count-limit":
maxTargets, ok := check.Parameters["max_targets"].(int)
if !ok {
return false, fmt.Errorf("max_targets parameter must be an integer")
}
targets, err := c.getTargets(ctx, config)
if err != nil {
return false, fmt.Errorf("failed to get targets: %w", err)
}
return len(targets) <= maxTargets, nil
case "target-percentage-limit":
maxPercentage, ok := check.Parameters["max_percentage"].(float64)
if !ok {
return false, fmt.Errorf("max_percentage parameter must be a float")
}
targets, err := c.getTargets(ctx, config)
if err != nil {
return false, fmt.Errorf("failed to get targets: %w", err)
}
// Get total resources
var totalCount int
switch config.TargetType {
case "pod":
pods, err := c.clientset.CoreV1().Pods(config.Namespace).List(ctx, metav1.ListOptions{})
if err != nil {
return false, fmt.Errorf("failed to list pods: %w", err)
}
totalCount = len(pods.Items)
case "node":
nodes, err := c.clientset.CoreV1().Nodes().List(ctx, metav1.ListOptions{})
if err != nil {
return false, fmt.Errorf("failed to list nodes: %w", err)
}
totalCount = len(nodes.Items)
default:
return false, fmt.Errorf("unsupported target type for percentage check: %s", config.TargetType)
}
if totalCount == 0 {
return false, fmt.Errorf("no resources found of type %s", config.TargetType)
}
percentage := float64(len(targets)) / float64(totalCount) * 100
return percentage <= maxPercentage, nil
case "critical-service-check":
serviceName, ok := check.Parameters["service_name"].(string)
if !ok {
return false, fmt.Errorf("service_name parameter must be a string")
}
namespace, ok := check.Parameters["namespace"].(string)
if !ok {
namespace = "default"
}
// Check if service exists and is healthy
service, err := c.clientset.CoreV1().Services(namespace).Get(ctx, serviceName, metav1.GetOptions{})
if err != nil {
return false, fmt.Errorf("failed to get service %s: %w", serviceName, err)
}
// Check if service has endpoints
endpoints, err := c.clientset.CoreV1().Endpoints(namespace).Get(ctx, serviceName, metav1.GetOptions{})
if err != nil {
return false, fmt.Errorf("failed to get endpoints for service %s: %w", serviceName, err)
}
// Check if service has at least one endpoint
hasEndpoints := false
for _, subset := range endpoints.Subsets {
if len(subset.Addresses) > 0 {
hasEndpoints = true
break
}
}
return service != nil && hasEndpoints, nil
case "prometheus-query":
query, ok := check.Parameters["query"].(string)
if !ok {
return false, fmt.Errorf("query parameter must be a string")
}
threshold, ok := check.Parameters["threshold"].(float64)
if !ok {
return false, fmt.Errorf("threshold parameter must be a float")
}
operator, ok := check.Parameters["operator"].(string)
if !ok {
operator = "lt" // less than
}
// TODO: Implement Prometheus query execution
// This would require a Prometheus client
// For now, just return true
return true, nil
default:
return false, fmt.Errorf("unknown safety check type: %s", check.Type)
}
}
// getTargets gets the targets for the experiment
func (c *ChaosFramework) getTargets(ctx context.Context, config ExperimentConfig) ([]string, error) {
var targets []string
switch config.TargetType {
case "pod":
// Build label selector string
var labelSelectors []string
for key, value := range config.LabelSelectors {
labelSelectors = append(labelSelectors, fmt.Sprintf("%s=%s", key, value))
}
// Get pods matching label selectors
pods, err := c.clientset.CoreV1().Pods(config.Namespace).List(ctx, metav1.ListOptions{
LabelSelector: metav1.FormatLabelSelector(&metav1.LabelSelector{
MatchLabels: config.LabelSelectors,
}),
})
if err != nil {
return nil, fmt.Errorf("failed to list pods: %w", err)
}
// Filter out pods with exclude labels
for _, pod := range pods.Items {
exclude := false
for key, value := range config.ExcludeLabels {
if podValue, ok := pod.Labels[key]; ok && podValue == value {
exclude = true
break
}
}
if !exclude {
targets = append(targets, pod.Name)
}
}
case "node":
// Build label selector string
var labelSelectors []string
for key, value := range config.LabelSelectors {
labelSelectors = append(labelSelectors, fmt.Sprintf("%s=%s", key, value))
}
// Get nodes matching label selectors
nodes, err := c.clientset.CoreV1().Nodes().List(ctx, metav1.ListOptions{
LabelSelector: metav1.FormatLabelSelector(&metav1.LabelSelector{
MatchLabels: config.LabelSelectors,
}),
})
if err != nil {
return nil, fmt.Errorf("failed to list nodes: %w", err)
}
// Filter out nodes with exclude labels
for _, node := range nodes.Items {
exclude := false
for key, value := range config.ExcludeLabels {
if nodeValue, ok := node.Labels[key]; ok && nodeValue == value {
exclude = true
break
}
}
if !exclude {
targets = append(targets, node.Name)
}
}
default:
return nil, fmt.Errorf("unsupported target type: %s", config.TargetType)
}
return targets, nil
}
// runChecks runs all checks of a specific type (pre or post)
func (c *ChaosFramework) runChecks(ctx context.Context, checks []Check, checkType string) (bool, map[string]CheckResult) {
results := make(map[string]CheckResult)
allPassed := true
for i, check := range checks {
checkName := fmt.Sprintf("%s-check-%d-%s", checkType, i, check.Type)
result := CheckResult{
CheckType: check.Type,
StartTime: time.Now(),
Status: "running",
}
// Run check with timeout
checkCtx, cancel := context.WithTimeout(ctx, check.Timeout)
defer cancel()
passed, err := c.runCheck(checkCtx, check)
result.EndTime = time.Now()
result.Duration = result.EndTime.Sub(result.StartTime)
if err != nil {
result.Status = "error"
result.Error = err.Error()
allPassed = false
} else if !passed {
result.Status = "failed"
allPassed = false
} else {
result.Status = "passed"
}
results[checkName] = result
}
return allPassed, results
}
// runCheck runs a single check
func (c *ChaosFramework) runCheck(ctx context.Context, check Check) (bool, error) {
switch check.Type {
case "pod-running":
podName, ok := check.Parameters["pod_name"].(string)
if !ok {
return false, fmt.Errorf("pod_name parameter must be a string")
}
namespace, ok := check.Parameters["namespace"].(string)
if !ok {
namespace = "default"
}
pod, err := c.clientset.CoreV1().Pods(namespace).Get(ctx, podName, metav1.GetOptions{})
if err != nil {
return false, fmt.Errorf("failed to get pod %s: %w", podName, err)
}
return pod.Status.Phase == "Running", nil
case "service-available":
serviceName, ok := check.Parameters["service_name"].(string)
if !ok {
return false, fmt.Errorf("service_name parameter must be a string")
}
namespace, ok := check.Parameters["namespace"].(string)
if !ok {
namespace = "default"
}
// Check if service exists
service, err := c.clientset.CoreV1().Services(namespace).Get(ctx, serviceName, metav1.GetOptions{})
if err != nil {
return false, fmt.Errorf("failed to get service %s: %w", serviceName, err)
}
// Check if service has endpoints
endpoints, err := c.clientset.CoreV1().Endpoints(namespace).Get(ctx, serviceName, metav1.GetOptions{})
if err != nil {
return false, fmt.Errorf("failed to get endpoints for service %s: %w", serviceName, err)
}
// Check if service has at least one endpoint
for _, subset := range endpoints.Subsets {
if len(subset.Addresses) > 0 {
return true, nil
}
}
return false, nil
case "deployment-available":
deploymentName, ok := check.Parameters["deployment_name"].(string)
if !ok {
return false, fmt.Errorf("deployment_name parameter must be a string")
}
namespace, ok := check.Parameters["namespace"].(string)
if !ok {
namespace = "default"
}
// Check if deployment exists
deployment, err := c.clientset.AppsV1().Deployments(namespace).Get(ctx, deploymentName, metav1.GetOptions{})
if err != nil {
return false, fmt.Errorf("failed to get deployment %s: %w", deploymentName, err)
}
// Check if deployment is available
for _, condition := range deployment.Status.Conditions {
if condition.Type == "Available" && condition.Status == "True" {
return true, nil
}
}
return false, nil
case "http-check":
url, ok := check.Parameters["url"].(string)
if !ok {
return false, fmt.Errorf("url parameter must be a string")
}
expectedStatus, ok := check.Parameters["expected_status"].(int)
if !ok {
expectedStatus = 200
}
// TODO: Implement HTTP check
// This would require an HTTP client
// For now, just return true
return true, nil
default:
return false, fmt.Errorf("unknown check type: %s", check.Type)
}
}
// runActions runs all actions on the targets
func (c *ChaosFramework) runActions(ctx context.Context, actions []Action, targets []string, config ExperimentConfig) (map[string]ActionResult, error) {
results := make(map[string]ActionResult)
for i, action := range actions {
actionName := fmt.Sprintf("action-%d-%s", i, action.Type)
result := ActionResult{
ActionType: action.Type,
StartTime: time.Now(),
Status: "running",
TargetCount: len(targets),
}
// Run action
err := c.runAction(ctx, action, targets, config)
result.EndTime = time.Now()
result.Duration = result.EndTime.Sub(result.StartTime)
if err != nil {
result.Status = "failed"
result.Error = err.Error()
return results, fmt.Errorf("action %s failed: %w", actionName, err)
} else {
result.Status = "completed"
}
results[actionName] = result
}
return results, nil
}
// runAction runs a single action on the targets
func (c *ChaosFramework) runAction(ctx context.Context, action Action, targets []string, config ExperimentConfig) error {
switch action.Type {
case "pod-delete":
// Delete pods
for _, podName := range targets {
if err := c.clientset.CoreV1().Pods(config.Namespace).Delete(ctx, podName, metav1.DeleteOptions{}); err != nil {
return fmt.Errorf("failed to delete pod %s: %w", podName, err)
}
}
case "pod-cpu-stress":
// TODO: Implement pod CPU stress
// This would require executing commands in the pods
case "pod-memory-stress":
// TODO: Implement pod memory stress
// This would require executing commands in the pods
case "pod-network-delay":
// TODO: Implement pod network delay
// This would require executing commands in the pods
case "pod-network-loss":
// TODO: Implement pod network loss
// This would require executing commands in the pods
case "node-cpu-stress":
// TODO: Implement node CPU stress
// This would require executing commands on the nodes
case "node-memory-stress":
// TODO: Implement node memory stress
// This would require executing commands on the nodes
case "node-network-delay":
// TODO: Implement node network delay
// This would require executing commands on the nodes
case "node-network-loss":
// TODO: Implement node network loss
// This would require executing commands on the nodes
default:
return fmt.Errorf("unknown action type: %s", action.Type)
}
return nil
}
// sendNotifications sends notifications for an experiment
func (c *ChaosFramework) sendNotifications(notifications []Notification, event string, config ExperimentConfig, result *ExperimentResult) {
for _, notification := range notifications {
// Check if this notification should be sent for this event
shouldSend := false
for _, notificationEvent := range notification.Events {
if notificationEvent == event || notificationEvent == "all" {
shouldSend = true
break
}
}
if !shouldSend {
continue
}
// Send notification
switch notification.Type {
case "slack":
// TODO: Implement Slack notification
case "email":
// TODO: Implement email notification
case "webhook":
// TODO: Implement webhook notification
default:
c.logger.Printf("Unknown notification type: %s", notification.Type)
}
}
}
• Created a comprehensive chaos engineering policy:
# chaos_policy.yaml - Chaos Engineering Policy
apiVersion: chaos.example.com/v1
kind: ChaosPolicy
metadata:
name: chaos-engineering-policy
namespace: chaos-system
spec:
# Global settings
global:
# Default blast radius limits
blastRadius:
maxTargets: 5
maxPercentage: 10
targetLimits:
pod: 5
node: 2
deployment: 2
statefulset: 1
# Default scheduling constraints
scheduling:
allowedDays:
- Monday
- Tuesday
- Wednesday
- Thursday
allowedHours:
- 10
- 11
- 14
- 15
excludedDates:
- "2023-12-25"
- "2024-01-01"
timeZone: "UTC"
# Default notification settings
notifications:
- type: slack
parameters:
channel: "#chaos-engineering"
username: "Chaos Bot"
events:
- start
- failure
- completed
- type: email
parameters:
recipients:
- "sre-team@example.com"
events:
- failure
# Default safety checks
safetyChecks:
- type: critical-service-check
parameters:
service_name: "payment-service"
namespace: "production"
stopOnFail: true
- type: prometheus-query
parameters:
query: "sum(rate(http_requests_total{code=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
threshold: 5
operator: "lt"
stopOnFail: true
# Environment-specific settings
environments:
- name: production
# Override global settings for production
blastRadius:
maxTargets: 2
maxPercentage: 5
scheduling:
requireApproval: true
safetyChecks:
- type: target-percentage-limit
parameters:
max_percentage: 5
stopOnFail: true
- name: staging
# Less restrictive settings for staging
blastRadius:
maxTargets: 10
maxPercentage: 20
scheduling:
requireApproval: false
- name: development
# Least restrictive settings for development
blastRadius:
maxTargets: 20
maxPercentage: 50
scheduling:
requireApproval: false
# Target-specific settings
targets:
- type: pod
selectors:
app: "frontend"
actions:
- type: pod-delete
parameters: {}
- type: pod-cpu-stress
parameters:
cpu: 80
- type: pod-memory-stress
parameters:
memory: 80
- type: pod-network-delay
parameters:
latency: "100ms"
- type: node
selectors:
role: "worker"
actions:
- type: node-cpu-stress
parameters:
cpu: 80
- type: node-memory-stress
parameters:
memory: 80
- type: node-network-delay
parameters:
latency: "100ms"
- type: deployment
selectors:
app: "backend"
actions:
- type: pod-delete
parameters:
gracePeriod: 0
- type: statefulset
selectors:
app: "database"
actions:
- type: pod-delete
parameters:
gracePeriod: 30
# Experiment templates
templates:
- name: pod-failure
description: "Simulates pod failures"
targetType: "pod"
duration: "5m"
actions:
- type: pod-delete
parameters: {}
- name: network-latency
description: "Simulates network latency"
targetType: "pod"
duration: "10m"
actions:
- type: pod-network-delay
parameters:
latency: "200ms"
jitter: "50ms"
- name: cpu-stress
description: "Simulates CPU stress"
targetType: "pod"
duration: "15m"
actions:
- type: pod-cpu-stress
parameters:
cpu: 80
- name: memory-stress
description: "Simulates memory stress"
targetType: "pod"
duration: "15m"
actions:
- type: pod-memory-stress
parameters:
memory: 80
- name: disk-stress
description: "Simulates disk stress"
targetType: "pod"
duration: "15m"
actions:
- type: pod-disk-stress
parameters:
workers: 4
size: "1GB"
- name: node-failure
description: "Simulates node failures"
targetType: "node"
duration: "5m"
actions:
- type: node-poweroff
parameters: {}
Lessons Learned:
Chaos engineering requires careful planning, strict blast radius controls, and comprehensive monitoring.
How to Avoid:
Implement strict blast radius controls for all chaos experiments.
Create a formal approval process for chaos experiments.
Test all experiments in non-production environments first.
Implement comprehensive monitoring for all experiments.
Schedule experiments during low-traffic periods.
No summary provided
What Happened:
A company implemented an SRE practice with error budgets based on Service Level Objectives (SLOs). One product team consistently pushed new features without adequate testing, resulting in multiple production incidents. After three weeks into the quarter, they depleted their entire error budget, triggering an automatic feature freeze per the established policy. This led to significant tension between the development team (who wanted to continue shipping features) and the SRE team (who were enforcing the error budget policy).
Diagnosis Steps:
Analyzed SLO violations and error budget consumption patterns.
Reviewed recent deployments and their impact on reliability.
Examined post-incident reviews for recurring themes.
Interviewed development and SRE team members.
Assessed the effectiveness of the error budget policy implementation.
Root Cause:
The investigation revealed multiple issues with the error budget implementation: 1. The development team didn't fully understand the implications of the error budget policy 2. Error budget consumption wasn't visible enough during the development process 3. Testing practices were inadequate for catching reliability issues before production 4. The feature freeze policy was implemented without clear escalation paths 5. There was no graduated response to error budget consumption (it was all-or-nothing)
Fix/Workaround:
• Implemented improved error budget visibility and alerting with Prometheus rules
• Created a Go-based error budget tracking service with real-time dashboards
• Established a graduated response to error budget consumption (50%, 75%, 90%, 100%)
• Implemented a formal exception process for the feature freeze policy
• Provided training on SRE principles and error budgets for all teams
Lessons Learned:
Error budgets are only effective when teams understand them and have visibility into consumption.
How to Avoid:
Implement graduated responses to error budget consumption, not just all-or-nothing.
Ensure error budget visibility throughout the development process.
Provide clear documentation and training on error budget policies.
Establish escalation paths for exceptional circumstances.
Focus on the collaborative aspect of error budgets, not just enforcement.
No summary provided
What Happened:
An SRE team implemented SLOs for their microservices platform using Prometheus and Alertmanager. Within days of deployment, on-call engineers were overwhelmed with alerts, many of which were for minor, transient issues that self-resolved. The constant stream of alerts led to alert fatigue, with some critical alerts being missed or ignored. Team morale declined, and the SLO implementation was at risk of being abandoned.
Diagnosis Steps:
Analyzed alert frequency and patterns over time.
Reviewed SLO definitions and error budget calculations.
Examined alert thresholds and burn rate configurations.
Collected feedback from on-call engineers about alert relevance.
Compared alerting patterns with actual customer impact.
Root Cause:
The investigation revealed multiple issues with the SLO implementation: 1. Error budgets were too small, triggering alerts for minor fluctuations 2. Alert thresholds were set at 100% SLO achievement rather than using error budgets 3. No distinction between fast and slow burn rates for error budget consumption 4. Alerting windows were too short, causing alerts for transient issues 5. No alert severity differentiation based on customer impact
Fix/Workaround:
• Redesigned SLO implementation with more realistic error budgets
• Implemented multi-window alert thresholds based on burn rates
• Created tiered alerting based on impact severity
• Added alert suppression for known issues and maintenance windows
• Developed a feedback loop for continuous improvement of alerts
Lessons Learned:
Effective SLO implementation requires careful balance between reliability goals and operational burden.
How to Avoid:
Start with conservative error budgets and refine over time.
Implement multi-window alert thresholds to distinguish between fast and slow burns.
Create tiered alerting based on customer impact and business priority.
Include alert suppression mechanisms for maintenance and known issues.
Establish a regular review process for SLOs and alerting effectiveness.
No summary provided
What Happened:
A technology company decided to implement SRE practices, including error budgets, to balance reliability and innovation. They defined service level objectives (SLOs) and created error budget policies that would halt feature development when budgets were exhausted. However, when the first service exhausted its error budget, the development team pushed back against stopping feature work. Leadership overrode the policy, creating confusion about the purpose and enforcement of error budgets. The incident highlighted deeper organizational challenges in adopting SRE practices.
Diagnosis Steps:
Analyzed the error budget consumption patterns.
Reviewed the communication and decision-making process during the incident.
Interviewed stakeholders about their understanding of error budgets.
Examined the error budget policy documentation and communication.
Assessed the alignment between team incentives and reliability goals.
Root Cause:
The investigation revealed multiple organizational issues: 1. Misalignment between team incentives (feature delivery vs. reliability) 2. Lack of executive understanding and buy-in for error budget concepts 3. Insufficient communication about the purpose and benefits of error budgets 4. No clear escalation path or decision-making process for policy exceptions 5. Error budget policies created without input from development teams
Fix/Workaround:
• Facilitated workshops to build shared understanding of error budgets
• Created a joint task force with representatives from all stakeholders
• Developed a more nuanced error budget policy with graduated responses
• Aligned team incentives and performance metrics with reliability goals
• Established clear governance for error budget policy exceptions
Lessons Learned:
Successful SRE implementation requires organizational alignment and cultural change, not just technical solutions.
How to Avoid:
Involve all stakeholders in developing error budget policies.
Ensure executive understanding and support for SRE principles.
Align team incentives and performance metrics with reliability goals.
Create graduated responses to error budget depletion, not just binary halts.
Establish clear governance and decision-making processes for exceptions.
No summary provided
What Happened:
A financial services company had implemented SLOs for their critical payment processing service, with targets for availability, latency, and error rates. Despite having dashboards and alerts configured, the service experienced degraded performance over several days that violated the SLOs but did not trigger any alerts. The issue was only discovered during a routine weekly review when an engineer noticed that the service had been operating below its SLO thresholds for nearly a week, affecting thousands of customers.
Diagnosis Steps:
Analyzed historical performance data for the service.
Reviewed SLO definitions and alert configurations.
Examined the monitoring system's data collection and aggregation.
Checked alert routing and notification channels.
Investigated recent changes to the service and monitoring infrastructure.
Root Cause:
The investigation revealed multiple issues with the SLO implementation: 1. The SLO was defined using averaged metrics over long time windows, masking short-term degradations 2. Alert thresholds were set too low, allowing significant degradation before alerting 3. The error budget calculation did not account for all types of failures 4. Some critical user journeys were not covered by the SLOs 5. Recent changes to the service introduced new failure modes not captured by existing SLIs
Fix/Workaround:
• Implemented immediate improvements to SLO monitoring
• Redefined SLOs with appropriate time windows and aggregation methods
• Created multi-level alerting with early warning thresholds
• Expanded SLIs to cover all critical user journeys
• Implemented synthetic monitoring for end-to-end validation
# Improved SLO Configuration in Prometheus
# File: improved-slo-config.yaml
groups:
- name: payment-service-slos
rules:
# Availability SLI - Percentage of successful requests
- record: sli:availability:ratio_rate5m
expr: |
sum(rate(http_requests_total{service="payment-service", status=~"2.."}[5m]))
/
sum(rate(http_requests_total{service="payment-service"}[5m]))
# Latency SLI - Percentage of requests faster than threshold
- record: sli:latency:ratio_rate5m
expr: |
sum(rate(http_request_duration_seconds_bucket{service="payment-service", le="0.3"}[5m]))
/
sum(rate(http_request_duration_seconds_count{service="payment-service"}[5m]))
# Error SLI - Percentage of requests without application errors
- record: sli:errors:ratio_rate5m
expr: |
1 -
sum(rate(application_errors_total{service="payment-service"}[5m]))
/
sum(rate(http_requests_total{service="payment-service"}[5m]))
# Combined SLI - All three objectives must be met
- record: sli:combined:ratio_rate5m
expr: |
min(
sli:availability:ratio_rate5m,
sli:latency:ratio_rate5m,
sli:errors:ratio_rate5m
)
# 1-hour SLO
- record: slo:payment_service:1h
expr: avg_over_time(sli:combined:ratio_rate5m[1h])
# 6-hour SLO
- record: slo:payment_service:6h
expr: avg_over_time(sli:combined:ratio_rate5m[6h])
# 24-hour SLO
- record: slo:payment_service:24h
expr: avg_over_time(sli:combined:ratio_rate5m[24h])
# 30-day SLO (for error budget calculation)
- record: slo:payment_service:30d
expr: avg_over_time(sli:combined:ratio_rate5m[30d])
# Error budget consumption rate
- record: error_budget:payment_service:consumption_rate
expr: |
(1 - slo:payment_service:24h) / (1 - 0.995) # 99.5% is our SLO target
# Multi-level alerting
- name: payment-service-slo-alerts
rules:
# Early warning - 2x normal error budget burn rate
- alert: PaymentServiceErrorBudgetBurning
expr: error_budget:payment_service:consumption_rate > 2
for: 10m
labels:
severity: warning
team: payments
annotations:
summary: "Payment service error budget burning 2x faster than expected"
description: "The payment service is consuming error budget at {{ $value }}x the expected rate."
# Critical alert - 10x normal error budget burn rate
- alert: PaymentServiceErrorBudgetCritical
expr: error_budget:payment_service:consumption_rate > 10
for: 5m
labels:
severity: critical
team: payments
annotations:
summary: "Payment service error budget burning at critical rate"
description: "The payment service is consuming error budget at {{ $value }}x the expected rate. Immediate action required."
# SLO breach alert
- alert: PaymentServiceSLOBreach
expr: slo:payment_service:1h < 0.995
for: 5m
labels:
severity: critical
team: payments
annotations:
summary: "Payment service SLO breach"
description: "The payment service has breached its 99.5% SLO. Current value: {{ $value | humanizePercentage }}."
// Synthetic Monitoring for Critical User Journeys
// File: synthetic_monitor.go
package main
import (
"context"
"fmt"
"log"
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
journeyDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "synthetic_journey_duration_seconds",
Help: "Duration of synthetic user journeys",
Buckets: prometheus.ExponentialBuckets(0.1, 1.5, 10),
},
[]string{"journey", "status"},
)
journeyErrors = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "synthetic_journey_errors_total",
Help: "Total errors encountered during synthetic user journeys",
},
[]string{"journey", "error_type"},
)
)
func init() {
prometheus.MustRegister(journeyDuration)
prometheus.MustRegister(journeyErrors)
}
// PaymentJourney simulates a complete payment flow
func PaymentJourney(ctx context.Context) error {
startTime := time.Now()
status := "success"
defer func() {
journeyDuration.WithLabelValues("payment_flow", status).Observe(time.Since(startTime).Seconds())
}()
// Step 1: Create payment
err := createPayment(ctx)
if err != nil {
journeyErrors.WithLabelValues("payment_flow", "create_payment").Inc()
status = "failure"
return fmt.Errorf("create payment failed: %w", err)
}
// Step 2: Authorize payment
err = authorizePayment(ctx)
if err != nil {
journeyErrors.WithLabelValues("payment_flow", "authorize_payment").Inc()
status = "failure"
return fmt.Errorf("authorize payment failed: %w", err)
}
// Step 3: Capture payment
err = capturePayment(ctx)
if err != nil {
journeyErrors.WithLabelValues("payment_flow", "capture_payment").Inc()
status = "failure"
return fmt.Errorf("capture payment failed: %w", err)
}
// Step 4: Verify receipt
err = verifyReceipt(ctx)
if err != nil {
journeyErrors.WithLabelValues("payment_flow", "verify_receipt").Inc()
status = "failure"
return fmt.Errorf("verify receipt failed: %w", err)
}
return nil
}
// Implementation of individual steps
func createPayment(ctx context.Context) error {
// Actual implementation would make API calls to the payment service
// This is a simplified example
req, err := http.NewRequestWithContext(ctx, "POST", "https://payment-service/api/payments", nil)
if err != nil {
return err
}
client := &http.Client{Timeout: 5 * time.Second}
resp, err := client.Do(req)
if err != nil {
return err
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusCreated {
return fmt.Errorf("unexpected status code: %d", resp.StatusCode)
}
return nil
}
func authorizePayment(ctx context.Context) error {
// Implementation omitted for brevity
return nil
}
func capturePayment(ctx context.Context) error {
// Implementation omitted for brevity
return nil
}
func verifyReceipt(ctx context.Context) error {
// Implementation omitted for brevity
return nil
}
func runSyntheticMonitoring() {
ticker := time.NewTicker(1 * time.Minute)
defer ticker.Stop()
for {
select {
case <-ticker.C:
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
err := PaymentJourney(ctx)
if err != nil {
log.Printf("Payment journey failed: %v", err)
}
cancel()
}
}
}
func main() {
// Start synthetic monitoring in a goroutine
go runSyntheticMonitoring()
// Expose Prometheus metrics
http.Handle("/metrics", promhttp.Handler())
log.Fatal(http.ListenAndServe(":8080", nil))
}
Lessons Learned:
Effective SLO implementation requires careful consideration of measurement methods, alert thresholds, and comprehensive coverage of user journeys.
How to Avoid:
Define SLOs with appropriate time windows and aggregation methods.
Implement multi-level alerting with early warning thresholds.
Ensure SLIs cover all critical user journeys and failure modes.
Validate SLO implementations with synthetic monitoring.
Regularly review and refine SLOs based on service evolution.
No summary provided
What Happened:
A large e-commerce platform experienced a complete outage of their checkout and payment services during a high-traffic sales event. The incident began when a database service started experiencing performance degradation due to an unexpected query pattern. Instead of isolating the failure, the misconfigured circuit breakers in the service mesh allowed the degradation to cascade through dependent services. Within minutes, the entire checkout flow was unavailable, resulting in significant revenue loss and customer frustration.
Diagnosis Steps:
Analyzed service mesh telemetry to identify the failure patterns.
Examined circuit breaker configurations across services.
Reviewed traffic patterns and service dependencies.
Checked database performance metrics and query patterns.
Tested circuit breaker behavior in a controlled environment.
Root Cause:
The investigation revealed multiple issues with the circuit breaker implementation: 1. Circuit breakers were configured with thresholds that were too high 2. Timeout settings did not account for the actual service performance characteristics 3. Retry policies were too aggressive, causing additional load on degraded services 4. Fallback mechanisms were not properly implemented for critical services 5. Health check criteria did not accurately reflect service health
Fix/Workaround:
• Implemented immediate improvements to circuit breaker configuration
• Adjusted timeout and retry settings based on service performance profiles
• Implemented proper fallback mechanisms for critical services
• Enhanced health check criteria to better reflect service health
• Established a comprehensive testing framework for resilience patterns
Lessons Learned:
Circuit breakers and other resilience patterns require careful configuration and testing to effectively prevent cascading failures.
How to Avoid:
Configure circuit breakers based on empirical service performance data.
Implement comprehensive fallback mechanisms for critical services.
Test resilience patterns under realistic failure scenarios.
Establish clear ownership for resilience configuration.
Regularly review and update circuit breaker settings as services evolve.