A CI/CD pipeline that automatically deployed infrastructure using Terraform began experiencing intermittent failures. Sometimes deployments would succeed, but other times they would fail with errors about resources already existing or state inconsistencies. The issue became more frequent over time, eventually causing all deployments to fail.
# Advanced DevOps Tools and Automation Scenarios
No summary provided
What Happened:
Diagnosis Steps:
Examined Jenkins build logs for failed and successful deployments.
Compared Terraform state files between successful and failed runs.
Analyzed the S3 bucket used for remote state storage.
Reviewed recent changes to the CI/CD pipeline configuration.
Tested Terraform operations with different concurrency settings.
Root Cause:
Multiple CI/CD jobs were running simultaneously and accessing the same Terraform state without proper locking. This occurred because: 1. The S3 backend was configured without DynamoDB state locking 2. The CI system was configured to allow parallel execution of the same job 3. Recent changes to the pipeline reduced job execution time, increasing the likelihood of overlapping jobs
Fix/Workaround:
• Short-term: Disabled parallel execution of infrastructure jobs:
// Jenkinsfile - Added resource lock
pipeline {
agent any
options {
// Prevent concurrent builds of the same branch
disableConcurrentBuilds()
}
stages {
stage('Checkout') {
steps {
checkout scm
}
}
stage('Terraform Init') {
steps {
sh 'terraform init'
}
}
stage('Terraform Plan') {
steps {
sh 'terraform plan -out=tfplan'
}
}
stage('Terraform Apply') {
steps {
// Use a global lock to prevent concurrent terraform operations
lock('terraform-state') {
sh 'terraform apply -auto-approve tfplan'
}
}
}
}
}
• Long-term: Implemented proper state locking with DynamoDB:
# backend.tf - Proper state configuration with locking
terraform {
backend "s3" {
bucket = "company-terraform-states"
key = "infrastructure/production/terraform.tfstate"
region = "us-west-2"
encrypt = true
dynamodb_table = "terraform-locks"
# Additional security measures
kms_key_id = "arn:aws:kms:us-west-2:123456789012:key/abcd1234-ab12-cd34-ef56-abcdef123456"
# Improved retry logic for state operations
skip_region_validation = false
skip_credentials_validation = false
skip_metadata_api_check = false
# Prevent accidental state deletion
force_path_style = false
}
}
# dynamodb_table.tf - Create the lock table
resource "aws_dynamodb_table" "terraform_locks" {
name = "terraform-locks"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
point_in_time_recovery {
enabled = true
}
server_side_encryption {
enabled = true
}
tags = {
Name = "Terraform State Lock Table"
Environment = "All"
Purpose = "Infrastructure"
Managed_by = "Terraform"
}
}
• Implemented a state management service in Go:
// terraform_state_manager.go
package main
import (
"context"
"encoding/json"
"fmt"
"log"
"net/http"
"os"
"os/signal"
"sync"
"syscall"
"time"
"github.com/aws/aws-sdk-go-v2/aws"
"github.com/aws/aws-sdk-go-v2/config"
"github.com/aws/aws-sdk-go-v2/service/dynamodb"
"github.com/aws/aws-sdk-go-v2/service/dynamodb/types"
"github.com/aws/aws-sdk-go-v2/service/s3"
"github.com/gorilla/mux"
)
type StateManager struct {
s3Client *s3.Client
dynamoDBClient *dynamodb.Client
stateBucket string
lockTable string
locks map[string]string
locksMutex sync.RWMutex
}
type LockRequest struct {
StatePath string `json:"state_path"`
LockID string `json:"lock_id"`
Operation string `json:"operation"`
Info string `json:"info"`
}
type UnlockRequest struct {
StatePath string `json:"state_path"`
LockID string `json:"lock_id"`
}
type StateResponse struct {
Version int `json:"version"`
StateID string `json:"state_id"`
StateData interface{} `json:"state_data"`
LastUpdate time.Time `json:"last_update"`
}
func NewStateManager(stateBucket, lockTable string) (*StateManager, error) {
// Load AWS configuration
cfg, err := config.LoadDefaultConfig(context.Background())
if err != nil {
return nil, fmt.Errorf("failed to load AWS config: %w", err)
}
// Create S3 and DynamoDB clients
s3Client := s3.NewFromConfig(cfg)
dynamoDBClient := dynamodb.NewFromConfig(cfg)
return &StateManager{
s3Client: s3Client,
dynamoDBClient: dynamoDBClient,
stateBucket: stateBucket,
lockTable: lockTable,
locks: make(map[string]string),
}, nil
}
func (sm *StateManager) LockState(w http.ResponseWriter, r *http.Request) {
var req LockRequest
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
http.Error(w, fmt.Sprintf("Invalid request: %v", err), http.StatusBadRequest)
return
}
// Check if state is already locked
sm.locksMutex.RLock()
existingLockID, locked := sm.locks[req.StatePath]
sm.locksMutex.RUnlock()
if locked && existingLockID != req.LockID {
http.Error(w, fmt.Sprintf("State %s is already locked by %s", req.StatePath, existingLockID), http.StatusConflict)
return
}
// Acquire lock in DynamoDB
ctx := context.Background()
_, err := sm.dynamoDBClient.PutItem(ctx, &dynamodb.PutItemInput{
TableName: aws.String(sm.lockTable),
Item: map[string]types.AttributeValue{
"LockID": &types.AttributeValueMemberS{
Value: fmt.Sprintf("%s/%s", sm.stateBucket, req.StatePath),
},
"Info": &types.AttributeValueMemberS{
Value: req.Info,
},
"Operation": &types.AttributeValueMemberS{
Value: req.Operation,
},
"LockTime": &types.AttributeValueMemberS{
Value: time.Now().UTC().Format(time.RFC3339),
},
},
ConditionExpression: aws.String("attribute_not_exists(LockID)"),
})
if err != nil {
http.Error(w, fmt.Sprintf("Failed to acquire lock: %v", err), http.StatusInternalServerError)
return
}
// Store lock in memory
sm.locksMutex.Lock()
sm.locks[req.StatePath] = req.LockID
sm.locksMutex.Unlock()
w.WriteHeader(http.StatusOK)
json.NewEncoder(w).Encode(map[string]string{
"status": "locked",
"lock_id": req.LockID,
})
}
func (sm *StateManager) UnlockState(w http.ResponseWriter, r *http.Request) {
var req UnlockRequest
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
http.Error(w, fmt.Sprintf("Invalid request: %v", err), http.StatusBadRequest)
return
}
// Check if state is locked by the requester
sm.locksMutex.RLock()
existingLockID, locked := sm.locks[req.StatePath]
sm.locksMutex.RUnlock()
if !locked {
http.Error(w, fmt.Sprintf("State %s is not locked", req.StatePath), http.StatusBadRequest)
return
}
if existingLockID != req.LockID {
http.Error(w, fmt.Sprintf("State %s is locked by %s, not %s", req.StatePath, existingLockID, req.LockID), http.StatusForbidden)
return
}
// Release lock in DynamoDB
ctx := context.Background()
_, err := sm.dynamoDBClient.DeleteItem(ctx, &dynamodb.DeleteItemInput{
TableName: aws.String(sm.lockTable),
Key: map[string]types.AttributeValue{
"LockID": &types.AttributeValueMemberS{
Value: fmt.Sprintf("%s/%s", sm.stateBucket, req.StatePath),
},
},
ConditionExpression: aws.String("attribute_exists(LockID)"),
})
if err != nil {
http.Error(w, fmt.Sprintf("Failed to release lock: %v", err), http.StatusInternalServerError)
return
}
// Remove lock from memory
sm.locksMutex.Lock()
delete(sm.locks, req.StatePath)
sm.locksMutex.Unlock()
w.WriteHeader(http.StatusOK)
json.NewEncoder(w).Encode(map[string]string{
"status": "unlocked",
})
}
func (sm *StateManager) GetState(w http.ResponseWriter, r *http.Request) {
vars := mux.Vars(r)
statePath := vars["path"]
// Get state from S3
ctx := context.Background()
result, err := sm.s3Client.GetObject(ctx, &s3.GetObjectInput{
Bucket: aws.String(sm.stateBucket),
Key: aws.String(statePath),
})
if err != nil {
http.Error(w, fmt.Sprintf("Failed to get state: %v", err), http.StatusInternalServerError)
return
}
defer result.Body.Close()
// Parse state
var stateData interface{}
if err := json.NewDecoder(result.Body).Decode(&stateData); err != nil {
http.Error(w, fmt.Sprintf("Failed to parse state: %v", err), http.StatusInternalServerError)
return
}
// Return state
response := StateResponse{
Version: 1,
StateID: statePath,
StateData: stateData,
LastUpdate: *result.LastModified,
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(response)
}
func (sm *StateManager) PutState(w http.ResponseWriter, r *http.Request) {
vars := mux.Vars(r)
statePath := vars["path"]
// Check if state is locked by the requester
lockID := r.Header.Get("X-Terraform-Lock")
if lockID == "" {
http.Error(w, "Lock ID not provided", http.StatusBadRequest)
return
}
sm.locksMutex.RLock()
existingLockID, locked := sm.locks[statePath]
sm.locksMutex.RUnlock()
if !locked {
http.Error(w, fmt.Sprintf("State %s is not locked", statePath), http.StatusBadRequest)
return
}
if existingLockID != lockID {
http.Error(w, fmt.Sprintf("State %s is locked by %s, not %s", statePath, existingLockID, lockID), http.StatusForbidden)
return
}
// Read state data
var stateData interface{}
if err := json.NewDecoder(r.Body).Decode(&stateData); err != nil {
http.Error(w, fmt.Sprintf("Invalid state data: %v", err), http.StatusBadRequest)
return
}
// Convert state data back to JSON
stateJSON, err := json.Marshal(stateData)
if err != nil {
http.Error(w, fmt.Sprintf("Failed to encode state: %v", err), http.StatusInternalServerError)
return
}
// Put state in S3
ctx := context.Background()
_, err = sm.s3Client.PutObject(ctx, &s3.PutObjectInput{
Bucket: aws.String(sm.stateBucket),
Key: aws.String(statePath),
Body: bytes.NewReader(stateJSON),
ServerSideEncryption: types.ServerSideEncryptionAes256,
})
if err != nil {
http.Error(w, fmt.Sprintf("Failed to save state: %v", err), http.StatusInternalServerError)
return
}
w.WriteHeader(http.StatusOK)
json.NewEncoder(w).Encode(map[string]string{
"status": "saved",
})
}
func main() {
// Get configuration from environment
stateBucket := os.Getenv("STATE_BUCKET")
if stateBucket == "" {
stateBucket = "terraform-states"
}
lockTable := os.Getenv("LOCK_TABLE")
if lockTable == "" {
lockTable = "terraform-locks"
}
port := os.Getenv("PORT")
if port == "" {
port = "8080"
}
// Create state manager
stateManager, err := NewStateManager(stateBucket, lockTable)
if err != nil {
log.Fatalf("Failed to create state manager: %v", err)
}
// Create router
router := mux.NewRouter()
router.HandleFunc("/lock", stateManager.LockState).Methods("POST")
router.HandleFunc("/unlock", stateManager.UnlockState).Methods("POST")
router.HandleFunc("/state/{path:.+}", stateManager.GetState).Methods("GET")
router.HandleFunc("/state/{path:.+}", stateManager.PutState).Methods("PUT")
// Create HTTP server
server := &http.Server{
Addr: ":" + port,
Handler: router,
ReadTimeout: 30 * time.Second,
WriteTimeout: 30 * time.Second,
IdleTimeout: 120 * time.Second,
}
// Start server
go func() {
log.Printf("Starting server on port %s", port)
if err := server.ListenAndServe(); err != nil && err != http.ErrServerClosed {
log.Fatalf("Failed to start server: %v", err)
}
}()
// Wait for interrupt signal
stop := make(chan os.Signal, 1)
signal.Notify(stop, os.Interrupt, syscall.SIGTERM)
<-stop
// Shutdown server
log.Println("Shutting down server...")
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
if err := server.Shutdown(ctx); err != nil {
log.Fatalf("Failed to shutdown server: %v", err)
}
}
• Created a CI/CD pipeline monitoring tool:
#!/usr/bin/env python3
# terraform_ci_monitor.py
import argparse
import boto3
import json
import logging
import os
import re
import subprocess
import sys
import time
from datetime import datetime, timedelta
from tabulate import tabulate
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger('terraform_ci_monitor')
class TerraformCIMonitor:
def __init__(self, state_bucket, lock_table, region):
self.state_bucket = state_bucket
self.lock_table = lock_table
self.region = region
# Initialize AWS clients
self.s3 = boto3.client('s3', region_name=region)
self.dynamodb = boto3.client('dynamodb', region_name=region)
def check_state_integrity(self, state_path):
"""Check the integrity of a Terraform state file"""
try:
# Get the state file from S3
response = self.s3.get_object(
Bucket=self.state_bucket,
Key=state_path
)
state_data = json.loads(response['Body'].read().decode('utf-8'))
# Check state version
if 'version' not in state_data:
return False, "State file is missing version field"
# Check for required fields
required_fields = ['terraform_version', 'lineage', 'resources', 'outputs']
for field in required_fields:
if field not in state_data:
return False, f"State file is missing required field: {field}"
# Check for resource consistency
resources = state_data.get('resources', [])
for resource in resources:
if 'mode' not in resource or 'type' not in resource or 'name' not in resource:
return False, "State file contains malformed resource entries"
return True, "State file integrity check passed"
except Exception as e:
return False, f"Failed to check state integrity: {str(e)}"
def list_active_locks(self):
"""List all active Terraform locks"""
try:
response = self.dynamodb.scan(
TableName=self.lock_table
)
locks = []
for item in response.get('Items', []):
lock_id = item.get('LockID', {}).get('S', '')
info = item.get('Info', {}).get('S', '')
operation = item.get('Operation', {}).get('S', '')
lock_time_str = item.get('LockTime', {}).get('S', '')
# Parse lock time
try:
lock_time = datetime.strptime(lock_time_str, '%Y-%m-%dT%H:%M:%SZ')
duration = datetime.utcnow() - lock_time
duration_str = str(duration).split('.')[0] # Remove microseconds
except:
duration_str = "Unknown"
# Extract state path from lock ID
state_path = lock_id.replace(f"{self.state_bucket}/", "")
locks.append({
'state_path': state_path,
'operation': operation,
'info': info,
'lock_time': lock_time_str,
'duration': duration_str
})
return locks
except Exception as e:
logger.error(f"Failed to list active locks: {str(e)}")
return []
def detect_stale_locks(self, max_age_hours=2):
"""Detect stale Terraform locks"""
locks = self.list_active_locks()
stale_locks = []
for lock in locks:
try:
lock_time = datetime.strptime(lock['lock_time'], '%Y-%m-%dT%H:%M:%SZ')
age = datetime.utcnow() - lock_time
if age > timedelta(hours=max_age_hours):
stale_locks.append(lock)
except:
# If we can't parse the time, consider it stale
stale_locks.append(lock)
return stale_locks
def force_unlock(self, state_path):
"""Force unlock a Terraform state"""
try:
# Create a temporary directory
temp_dir = f"/tmp/terraform-unlock-{int(time.time())}"
os.makedirs(temp_dir, exist_ok=True)
os.chdir(temp_dir)
# Create backend configuration
with open('backend.tf', 'w') as f:
f.write(f'''
terraform {{
backend "s3" {{
bucket = "{self.state_bucket}"
key = "{state_path}"
region = "{self.region}"
dynamodb_table = "{self.lock_table}"
}}
}}
''')
# Initialize Terraform
subprocess.run(['terraform', 'init'], check=True)
# Get the lock ID
locks = self.list_active_locks()
lock_id = None
for lock in locks:
if lock['state_path'] == state_path:
lock_id = f"{self.state_bucket}/{state_path}"
break
if not lock_id:
return False, "No lock found for the specified state path"
# Force unlock
result = subprocess.run(
['terraform', 'force-unlock', '-force', lock_id],
capture_output=True,
text=True
)
if result.returncode != 0:
return False, f"Failed to force unlock: {result.stderr}"
return True, "Successfully forced unlock"
except Exception as e:
return False, f"Failed to force unlock: {str(e)}"
finally:
# Clean up
if os.path.exists(temp_dir):
subprocess.run(['rm', '-rf', temp_dir])
def monitor_state_changes(self, hours=24):
"""Monitor state changes over the specified period"""
try:
# Get all state files
response = self.s3.list_objects_v2(
Bucket=self.state_bucket
)
state_files = []
for obj in response.get('Contents', []):
key = obj['Key']
last_modified = obj['LastModified']
size = obj['Size']
# Skip non-state files
if not key.endswith('.tfstate'):
continue
# Check if modified within the specified period
age = datetime.utcnow().replace(tzinfo=last_modified.tzinfo) - last_modified
if age <= timedelta(hours=hours):
state_files.append({
'path': key,
'last_modified': last_modified,
'size': size,
'age': str(age).split('.')[0] # Remove microseconds
})
return state_files
except Exception as e:
logger.error(f"Failed to monitor state changes: {str(e)}")
return []
def analyze_ci_logs(self, log_file):
"""Analyze CI logs for Terraform issues"""
try:
with open(log_file, 'r') as f:
logs = f.read()
issues = []
# Look for state locking issues
lock_errors = re.findall(r'Error acquiring the state lock', logs)
if lock_errors:
issues.append({
'type': 'State Lock',
'count': len(lock_errors),
'description': 'Failed to acquire state lock'
})
# Look for state corruption issues
corruption_errors = re.findall(r'Error loading state', logs)
if corruption_errors:
issues.append({
'type': 'State Corruption',
'count': len(corruption_errors),
'description': 'Error loading state'
})
# Look for backend configuration issues
backend_errors = re.findall(r'Error configuring the backend', logs)
if backend_errors:
issues.append({
'type': 'Backend Configuration',
'count': len(backend_errors),
'description': 'Error configuring the backend'
})
return issues
except Exception as e:
logger.error(f"Failed to analyze CI logs: {str(e)}")
return []
def main():
parser = argparse.ArgumentParser(description='Terraform CI Monitor')
parser.add_argument('--state-bucket', required=True, help='S3 bucket for Terraform state')
parser.add_argument('--lock-table', required=True, help='DynamoDB table for Terraform locks')
parser.add_argument('--region', default='us-west-2', help='AWS region')
parser.add_argument('--action', choices=['check-integrity', 'list-locks', 'detect-stale-locks', 'force-unlock', 'monitor-changes', 'analyze-logs'], required=True, help='Action to perform')
parser.add_argument('--state-path', help='Path to Terraform state file')
parser.add_argument('--max-age', type=int, default=2, help='Maximum age in hours for stale locks')
parser.add_argument('--hours', type=int, default=24, help='Hours to look back for state changes')
parser.add_argument('--log-file', help='CI log file to analyze')
args = parser.parse_args()
monitor = TerraformCIMonitor(args.state_bucket, args.lock_table, args.region)
if args.action == 'check-integrity':
if not args.state_path:
logger.error("--state-path is required for check-integrity action")
sys.exit(1)
success, message = monitor.check_state_integrity(args.state_path)
if success:
logger.info(message)
else:
logger.error(message)
sys.exit(1)
elif args.action == 'list-locks':
locks = monitor.list_active_locks()
if locks:
print(tabulate(locks, headers='keys', tablefmt='grid'))
else:
logger.info("No active locks found")
elif args.action == 'detect-stale-locks':
stale_locks = monitor.detect_stale_locks(args.max_age)
if stale_locks:
print(tabulate(stale_locks, headers='keys', tablefmt='grid'))
else:
logger.info("No stale locks found")
elif args.action == 'force-unlock':
if not args.state_path:
logger.error("--state-path is required for force-unlock action")
sys.exit(1)
success, message = monitor.force_unlock(args.state_path)
if success:
logger.info(message)
else:
logger.error(message)
sys.exit(1)
elif args.action == 'monitor-changes':
state_files = monitor.monitor_state_changes(args.hours)
if state_files:
print(tabulate(state_files, headers='keys', tablefmt='grid'))
else:
logger.info(f"No state changes in the last {args.hours} hours")
elif args.action == 'analyze-logs':
if not args.log_file:
logger.error("--log-file is required for analyze-logs action")
sys.exit(1)
issues = monitor.analyze_ci_logs(args.log_file)
if issues:
print(tabulate(issues, headers='keys', tablefmt='grid'))
else:
logger.info("No Terraform issues found in the logs")
if __name__ == '__main__':
main()
Lessons Learned:
Terraform state management requires careful coordination in CI/CD environments.
How to Avoid:
Always configure state locking with DynamoDB when using S3 backend.
Prevent concurrent execution of Terraform jobs in CI/CD pipelines.
Implement proper error handling and retry logic for state operations.
Monitor state access patterns and lock durations.
Consider using a centralized state management service for large teams.
No summary provided
What Happened:
After upgrading Jenkins from version 2.303 to 2.346, the team noticed that deployment pipelines managed by Spinnaker began failing intermittently. The failures occurred specifically during the deployment stage when Spinnaker attempted to interact with Jenkins to retrieve build artifacts. The issue affected multiple teams and projects, causing deployment delays and requiring manual interventions. The problem was particularly challenging because it only occurred for certain pipeline configurations and not others.
Diagnosis Steps:
Analyzed Jenkins and Spinnaker logs for error patterns.
Compared working and failing pipeline configurations.
Reviewed API changes between Jenkins versions.
Tested API interactions directly between components.
Examined network traffic between Jenkins and Spinnaker.
Root Cause:
The investigation revealed multiple issues with the tool integration: 1. The Jenkins API had breaking changes in the new version that affected how Spinnaker retrieved artifacts 2. The Spinnaker Jenkins integration plugin was not compatible with the new Jenkins version 3. Authentication between the systems was failing due to security changes in Jenkins 4. Custom scripts that bridged the tools were using deprecated endpoints 5. The API rate limiting configuration in Jenkins was affecting Spinnaker's parallel operations
Fix/Workaround:
• Updated the Spinnaker Jenkins integration plugin to a compatible version
• Modified custom integration scripts to use the new API endpoints
• Implemented proper authentication handling between systems
• Adjusted rate limiting configuration to accommodate Spinnaker's parallel operations
• Created a comprehensive testing strategy for future upgrades
Lessons Learned:
Complex DevOps toolchains require careful version management and integration testing.
How to Avoid:
Maintain a test environment that mirrors the production toolchain.
Test tool upgrades thoroughly before applying to production.
Document all integration points between DevOps tools.
Subscribe to release notes and breaking changes for all components.
Implement canary deployments for toolchain upgrades.
No summary provided
What Happened:
A technology company implemented a CI/CD pipeline using Jenkins for build automation, GitLab CI for code quality checks, and Kubernetes for deployment. The pipeline frequently failed at various stages, causing delays in release cycles and frustration among developers. The failures were inconsistent, with some builds passing and others failing without clear reasons. The operations team struggled to identify the root cause due to the complexity of the integration.
Diagnosis Steps:
Analyzed logs from Jenkins, GitLab CI, and Kubernetes for error patterns.
Reviewed pipeline configurations and integration points.
Examined network connectivity and authentication between tools.
Tested individual pipeline stages in isolation.
Collected feedback from developers about common failure scenarios.
Root Cause:
The investigation revealed multiple issues with the pipeline integration: 1. Inconsistent environment variables and secrets management across tools 2. Network connectivity issues between Jenkins and GitLab CI runners 3. Misconfigured Kubernetes deployment scripts causing intermittent failures 4. Lack of proper error handling and retry mechanisms in pipeline stages 5. No centralized monitoring or alerting for pipeline failures
Fix/Workaround:
• Implemented immediate fixes to stabilize the pipeline
• Standardized environment variables and secrets management
• Resolved network connectivity issues with proper routing and DNS
• Corrected Kubernetes deployment scripts and added error handling
• Implemented centralized monitoring and alerting for pipeline stages
Lessons Learned:
CI/CD pipeline integration requires careful coordination and consistent configuration across tools.
How to Avoid:
Standardize environment variables and secrets management across all tools.
Implement centralized monitoring and alerting for pipeline stages.
Test pipeline stages in isolation before full integration.
Ensure proper network connectivity and authentication between tools.
Create detailed documentation and runbooks for pipeline management.