Question

10_infrastructure_as_code_drift

Answer 1

output:

Infrastructure as Code Terraform 1.5.7, AWS, GitOps workflow

Summary:

No summary provided

What Happened:

A critical deployment failed because the actual infrastructure didn't match what Terraform expected. Manual changes had been made directly in the AWS console, causing state drift.

Diagnosis Steps:

Ran terraform plan and found numerous differences between state and reality.
Used terraform state pull to examine the current state file.
Compared resources in AWS console with Terraform state.
Reviewed CloudTrail logs to identify who made manual changes.

Root Cause:

Team members had made emergency fixes directly in the AWS console during an incident but didn't update the Terraform code afterward. The drift accumulated over time.

Fix/Workaround:

• Used terraform import to bring manually created resources into state.

• Updated Terraform code to match current infrastructure.

• For resources that couldn't be imported, documented the manual process to recreate them.

• Implemented drift detection in the CI/CD pipeline.

Lessons Learned:

Infrastructure as code is only effective when it's the single source of truth.

How to Avoid:

Implement GitOps workflow where all changes go through version control.
Use AWS Config or similar tools to detect and alert on manual changes.
Add drift detection to CI/CD pipelines.
Create break-glass procedures for emergencies that include updating IaC afterward.
Train team on the importance of maintaining infrastructure as code integrity.

Answer 2

output:

Infrastructure as Code Terraform 1.4.6, GitHub, Public repository

Summary:

No summary provided

What Happened:

A security researcher reported finding AWS access keys and database passwords in a public GitHub repository. The credentials were hardcoded in Terraform files and had been there for months.

Diagnosis Steps:

Searched repository for exposed secrets using tools like TruffleHog.
Reviewed git history to determine when secrets were committed and by whom.
Checked for any signs of unauthorized access using the exposed credentials.
Audited other repositories for similar issues.

Root Cause:

Developers hardcoded secrets in Terraform files for testing and forgot to remove them before committing. No pre-commit hooks or secret scanning was in place.

Fix/Workaround:

• Immediately rotated all exposed credentials.

• Removed secrets from git history using BFG Repo-Cleaner.

• Implemented HashiCorp Vault for secure secret management.

• Updated Terraform code to use variables and secure backends for state.

Lessons Learned:

Infrastructure as code requires careful handling of secrets and credentials.

How to Avoid:

Use secret management tools like HashiCorp Vault or AWS Secrets Manager.
Implement pre-commit hooks to prevent committing secrets.
Add secret scanning to CI/CD pipelines.
Store Terraform state in secure, encrypted backends.
Use environment variables or external secret providers for sensitive values.

Answer 3

output:

Infrastructure as Code Terraform 1.3.9, AWS S3 backend, DynamoDB

Summary:

No summary provided

What Happened:

A Terraform apply operation was interrupted abnormally, leaving the state file locked in DynamoDB. All subsequent deployments failed with a "state locked" error.

Diagnosis Steps:

Checked DynamoDB table for lock information.
Reviewed CloudTrail logs to identify the operation that created the lock.
Verified that no Terraform operations were actually running.
Examined the Terraform state file for corruption.

Root Cause:

A Terraform apply was killed abruptly without releasing the state lock. The lock remained in DynamoDB, preventing any further operations.

Fix/Workaround:

• Manually removed the lock from DynamoDB:


aws dynamodb delete-item --table-name terraform-state-locks --key '{"LockID":{"S":"terraform-state/environment/terraform.tfstate"}}'

• Verified state file integrity with terraform state list.

• Implemented proper error handling in CI/CD pipeline for Terraform operations.

Lessons Learned:

Terraform state locks are critical for preventing concurrent modifications but can cause issues if not properly managed.

How to Avoid:

Use -lock-timeout in Terraform commands to prevent indefinite blocking.
Implement monitoring for long-running locks.
Create automated processes for detecting and resolving stale locks.
Use remote operations (Terraform Cloud) for better lock management.

Answer 4

output:

Infrastructure as Code Terraform 1.5.0, AWS S3 backend, Multiple teams

Summary:

No summary provided

What Happened:

Multiple teams reported that their Terraform apply operations were failing with lock-related errors. The issue persisted even when no other team members were actively running Terraform.

Diagnosis Steps:

Examined Terraform error messages showing lock acquisition failures.
Checked the S3 bucket and DynamoDB table used for state storage and locking.
Reviewed CloudTrail logs for operations on the state bucket.
Analyzed recent Terraform runs that might have left stale locks.
Tested lock behavior with manual lock/unlock operations.

Root Cause:

A previous Terraform run had terminated abnormally due to network issues, leaving a stale lock in the DynamoDB table. The lock's lease time had not expired because the DynamoDB table was misconfigured without a proper Time-To-Live (TTL) attribute.

Fix/Workaround:

• Manually removed the stale lock from DynamoDB:


aws dynamodb delete-item \
  --table-name terraform-state-locks \
  --key '{"LockID": {"S": "terraform-state/environment/project/terraform.tfstate"}}'

• Implemented proper lock timeout configuration in Terraform:


terraform {
  backend "s3" {
    bucket         = "terraform-state-bucket"
    key            = "environment/project/terraform.tfstate"
    region         = "us-west-2"
    dynamodb_table = "terraform-state-locks"
    encrypt        = true
    lock_table     = "terraform-state-locks"
    # Add explicit timeout settings
    dynamodb_table_tags = {
      Name = "Terraform State Lock Table"
    }
  }
}

• Added a Lambda function to clean up stale locks automatically:


import boto3
import time
def lambda_handler(event, context):
    dynamodb = boto3.resource('dynamodb')
    table = dynamodb.Table('terraform-state-locks')
    # Get all locks
    response = table.scan()
    items = response['Items']
    # Check for stale locks (older than 24 hours)
    current_time = int(time.time())
    stale_threshold = current_time - (24 * 60 * 60)  # 24 hours in seconds
    for item in items:
        if 'Created' in item and int(item['Created']) < stale_threshold:
            print(f"Removing stale lock: {item['LockID']}")
            table.delete_item(Key={'LockID': item['LockID']})
    return {
        'statusCode': 200,
        'body': f"Processed {len(items)} locks"
    }

Lessons Learned:

Terraform state locking requires proper timeout mechanisms and monitoring.

How to Avoid:

Configure DynamoDB TTL attributes for lock tables.
Implement automated cleanup of stale locks.
Use Terraform Cloud or other managed solutions for state management.
Add monitoring for lock duration and alert on long-held locks.
Document procedures for manual lock resolution.

Answer 5

output:

Infrastructure as Code Terraform 1.4.6, AWS provider, Production environment

Summary:

No summary provided

What Happened:

After updating the AWS provider version in a Terraform configuration, all subsequent terraform apply operations failed with cryptic errors about incompatible arguments. The errors occurred in modules that had worked correctly before the update.

Diagnosis Steps:

Examined the exact error messages from Terraform.
Reviewed recent changes to the Terraform configuration.
Checked module version constraints in the configuration.
Compared module documentation for the old and new versions.
Tested with different provider and module versions.

Root Cause:

The AWS provider update introduced breaking changes that affected the modules. The module version constraints were too loose, allowing incompatible combinations of provider and module versions to be used together.

Fix/Workaround:

• Added explicit version constraints for both providers and modules:


terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 4.16.0"  # Pin to a specific minor version
    }
  }
  required_version = ">= 1.2.0, < 1.5.0"  # Compatible Terraform versions
}
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "3.14.2"  # Pin to exact version for stability
  # Module configuration
  name = "my-vpc"
  cidr = "10.0.0.0/16"
  # ...
}

• Created a version compatibility matrix for all modules and providers.

• Implemented a testing pipeline for version updates:


# .github/workflows/terraform-version-test.yml
name: Terraform Version Test
on:
  pull_request:
    paths:
      - '**/*.tf'
      - '.github/workflows/terraform-version-test.yml'
jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        terraform: ['1.2.0', '1.3.0', '1.4.0', '1.4.6']
    steps:
      - uses: actions/checkout@v3
      - uses: hashicorp/setup-terraform@v2
        with:
          terraform_version: ${{ matrix.terraform }}
      - name: Terraform Init
        run: terraform init -backend=false
      - name: Terraform Validate
        run: terraform validate

Lessons Learned:

Loose version constraints can lead to unexpected compatibility issues during updates.

How to Avoid:

Use specific version constraints for both providers and modules.
Document version compatibility requirements.
Test infrastructure code with multiple Terraform versions.
Implement a staged update process for critical environments.
Use dependency locking with .terraform.lock.hcl files.

# Infrastructure as Code Scenarios

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid: