Question

cloud_cost_optimization_scenario_01

Answer 1

output:

Cloud Cost Optimization AWS, Terraform, Production environment

Summary:

No summary provided

What Happened:

The finance team reported that the monthly AWS bill had doubled compared to the previous month, despite no significant changes in application traffic or new feature deployments. The increase appeared across multiple services including EC2, EBS, and S3.

Diagnosis Steps:

Analyzed AWS Cost Explorer reports to identify the services with the largest increases.
Used AWS Cost Anomaly Detection to pinpoint specific resources contributing to the spike.
Compared resource inventories between the current and previous months.
Reviewed recent infrastructure changes through CloudTrail logs.
Examined Terraform state files and deployment history.

Root Cause:

Multiple issues contributed to the cost spike: 1. A load testing environment with 20 high-capacity EC2 instances was left running after tests completed 2. Several terminated EC2 instances had orphaned EBS volumes that were never deleted 3. A development S3 bucket was storing uncompressed log files with no lifecycle policies 4. A misconfigured autoscaling group was scaling based on CPU rather than application-specific metrics, causing over-provisioning

Fix/Workaround:

• Short-term: Identified and terminated unused resources:


#!/bin/bash
# cleanup_orphaned_resources.sh
set -euo pipefail
# Find and terminate orphaned EC2 instances
echo "Finding orphaned EC2 instances..."
ORPHANED_INSTANCES=$(aws ec2 describe-instances \
  --filters "Name=tag:Environment,Values=loadtest" "Name=instance-state-name,Values=running" \
  --query "Reservations[].Instances[].InstanceId" \
  --output text)
if [ -n "$ORPHANED_INSTANCES" ]; then
  echo "Terminating orphaned instances: $ORPHANED_INSTANCES"
  aws ec2 terminate-instances --instance-ids $ORPHANED_INSTANCES
else
  echo "No orphaned instances found."
fi
# Find and delete unattached EBS volumes
echo "Finding unattached EBS volumes..."
UNATTACHED_VOLUMES=$(aws ec2 describe-volumes \
  --filters "Name=status,Values=available" \
  --query "Volumes[].VolumeId" \
  --output text)
if [ -n "$UNATTACHED_VOLUMES" ]; then
  for VOLUME_ID in $UNATTACHED_VOLUMES; do
    echo "Deleting unattached volume: $VOLUME_ID"
    aws ec2 delete-volume --volume-id $VOLUME_ID
  done
else
  echo "No unattached volumes found."
fi
# Find and clean up old snapshots
echo "Finding old EBS snapshots..."
RETENTION_DAYS=30
CUTOFF_DATE=$(date -d "$RETENTION_DAYS days ago" +%Y-%m-%d)
OLD_SNAPSHOTS=$(aws ec2 describe-snapshots \
  --owner-ids self \
  --query "Snapshots[?StartTime<='$CUTOFF_DATE'].SnapshotId" \
  --output text)
if [ -n "$OLD_SNAPSHOTS" ]; then
  for SNAPSHOT_ID in $OLD_SNAPSHOTS; do
    echo "Deleting old snapshot: $SNAPSHOT_ID"
    aws ec2 delete-snapshot --snapshot-id $SNAPSHOT_ID
  done
else
  echo "No old snapshots found."
fi
echo "Resource cleanup completed."

• Long-term: Implemented proper resource tagging and lifecycle management:


# resource_tagging.tf - Standardized tagging for all resources
locals {
  common_tags = {
    Environment = var.environment
    Project     = var.project_name
    Owner       = var.team_email
    ManagedBy   = "Terraform"
    CostCenter  = var.cost_center
    Expiration  = var.environment == "production" ? "permanent" : timeadd(timestamp(), "168h")
  }
}
# ec2_instance.tf - EC2 instance with proper tagging and monitoring
resource "aws_instance" "application_server" {
  ami           = var.ami_id
  instance_type = var.instance_type
  subnet_id     = var.subnet_id
  # Ensure volumes are deleted on termination
  root_block_device {
    volume_type           = "gp3"
    volume_size           = 50
    delete_on_termination = true
    encrypted             = true
    tags = merge(
      local.common_tags,
      {
        Name = "${var.project_name}-${var.environment}-root-volume"
      }
    )
  }
  # Enable detailed monitoring for better autoscaling
  monitoring = true
  # Apply standardized tags
  tags = merge(
    local.common_tags,
    {
      Name = "${var.project_name}-${var.environment}-server"
    }
  )
  # Ensure all tags are propagated to volumes
  volume_tags = merge(
    local.common_tags,
    {
      Name = "${var.project_name}-${var.environment}-volumes"
    }
  )
}
# s3_bucket.tf - S3 bucket with lifecycle policies
resource "aws_s3_bucket" "logs_bucket" {
  bucket = "${var.project_name}-${var.environment}-logs"
  tags = merge(
    local.common_tags,
    {
      Name = "${var.project_name}-${var.environment}-logs"
    }
  )
}
resource "aws_s3_bucket_lifecycle_configuration" "logs_lifecycle" {
  bucket = aws_s3_bucket.logs_bucket.id
  rule {
    id     = "log-transition-and-expiration"
    status = "Enabled"
    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }
    transition {
      days          = 90
      storage_class = "GLACIER"
    }
    expiration {
      days = 365
    }
  }
}
# autoscaling.tf - Improved autoscaling configuration
resource "aws_autoscaling_group" "application_asg" {
  name                 = "${var.project_name}-${var.environment}-asg"
  min_size             = var.min_instances
  max_size             = var.max_instances
  desired_capacity     = var.desired_instances
  vpc_zone_identifier  = var.subnet_ids
  launch_configuration = aws_launch_configuration.application_lc.name
  # Use instance refresh for zero-downtime updates
  instance_refresh {
    strategy = "Rolling"
    preferences {
      min_healthy_percentage = 90
    }
  }
  # Use multiple metrics for better scaling decisions
  tag {
    key                 = "Name"
    value               = "${var.project_name}-${var.environment}-asg-instance"
    propagate_at_launch = true
  }
  dynamic "tag" {
    for_each = local.common_tags
    content {
      key                 = tag.key
      value               = tag.value
      propagate_at_launch = true
    }
  }
}
resource "aws_autoscaling_policy" "application_scaling_policy" {
  name                   = "${var.project_name}-${var.environment}-scaling-policy"
  autoscaling_group_name = aws_autoscaling_group.application_asg.name
  policy_type            = "TargetTrackingScaling"
  target_tracking_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ALBRequestCountPerTarget"
      resource_label         = "${aws_lb.application_lb.arn_suffix}/${aws_lb_target_group.application_tg.arn_suffix}"
    }
    target_value = 1000
    disable_scale_in = false
  }
}

• Implemented a cost monitoring and alerting system:


# cost_monitor.py
import boto3
import datetime
import json
import os
import logging
from dateutil.relativedelta import relativedelta
# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger('cost_monitor')
# Configuration
BUDGET_THRESHOLD_PERCENT = 80
ANOMALY_THRESHOLD_PERCENT = 20
SNS_TOPIC_ARN = os.environ.get('SNS_TOPIC_ARN')
SLACK_WEBHOOK_URL = os.environ.get('SLACK_WEBHOOK_URL')
def lambda_handler(event, context):
    """AWS Lambda handler for cost monitoring"""
    try:
        # Initialize clients
        ce_client = boto3.client('ce')
        budgets_client = boto3.client('budgets')
        sns_client = boto3.client('sns')
        # Get current date information
        today = datetime.datetime.utcnow().date()
        first_day_month = today.replace(day=1)
        last_day_month = (first_day_month + relativedelta(months=1, days=-1))
        # Get month-to-date costs
        mtd_costs = get_month_to_date_costs(ce_client, first_day_month, today)
        # Get cost forecast for the month
        forecast = get_cost_forecast(ce_client, today, last_day_month)
        # Check budgets
        budget_alerts = check_budgets(budgets_client)
        # Check for cost anomalies
        anomalies = detect_cost_anomalies(ce_client)
        # Generate cost report
        cost_report = {
            'month_to_date': mtd_costs,
            'forecast': forecast,
            'budget_alerts': budget_alerts,
            'anomalies': anomalies
        }
        # Send notifications if needed
        if budget_alerts or anomalies:
            send_notifications(sns_client, cost_report)
        return {
            'statusCode': 200,
            'body': json.dumps(cost_report)
        }
    except Exception as e:
        logger.error(f"Error in cost monitoring: {str(e)}")
        raise
def get_month_to_date_costs(ce_client, start_date, end_date):
    """Get month-to-date costs from AWS Cost Explorer"""
    response = ce_client.get_cost_and_usage(
        TimePeriod={
            'Start': start_date.isoformat(),
            'End': end_date.isoformat()
        },
        Granularity='MONTHLY',
        Metrics=['UnblendedCost'],
        GroupBy=[
            {
                'Type': 'DIMENSION',
                'Key': 'SERVICE'
            }
        ]
    )
    total_cost = 0
    service_costs = {}
    for result in response['ResultsByTime']:
        for group in result['Groups']:
            service = group['Keys'][0]
            amount = float(group['Metrics']['UnblendedCost']['Amount'])
            service_costs[service] = amount
            total_cost += amount
    return {
        'total': total_cost,
        'by_service': service_costs
    }
def get_cost_forecast(ce_client, start_date, end_date):
    """Get cost forecast from AWS Cost Explorer"""
    response = ce_client.get_cost_forecast(
        TimePeriod={
            'Start': start_date.isoformat(),
            'End': end_date.isoformat()
        },
        Metric='UNBLENDED_COST',
        Granularity='MONTHLY'
    )
    return {
        'total': float(response['Total']['Amount']),
        'forecast_date': response['ForecastResultsByTime'][0]['TimePeriod']
    }
def check_budgets(budgets_client):
    """Check AWS Budgets for alerts"""
    response = budgets_client.describe_budgets(
        AccountId=boto3.client('sts').get_caller_identity()['Account']
    )
    alerts = []
    for budget in response.get('Budgets', []):
        budget_name = budget['BudgetName']
        budget_amount = float(budget['BudgetLimit']['Amount'])
        actual_amount = float(budget.get('CalculatedSpend', {}).get('ActualSpend', {}).get('Amount', 0))
        forecast_amount = float(budget.get('CalculatedSpend', {}).get('ForecastedSpend', {}).get('Amount', 0))
        # Check if actual spend exceeds threshold
        actual_percent = (actual_amount / budget_amount) * 100
        if actual_percent >= BUDGET_THRESHOLD_PERCENT:
            alerts.append({
                'budget_name': budget_name,
                'budget_amount': budget_amount,
                'actual_amount': actual_amount,
                'actual_percent': actual_percent,
                'type': 'actual'
            })
        # Check if forecast exceeds budget
        if forecast_amount > budget_amount:
            forecast_percent = (forecast_amount / budget_amount) * 100
            alerts.append({
                'budget_name': budget_name,
                'budget_amount': budget_amount,
                'forecast_amount': forecast_amount,
                'forecast_percent': forecast_percent,
                'type': 'forecast'
            })
    return alerts
def detect_cost_anomalies(ce_client):
    """Detect cost anomalies using AWS Cost Anomaly Detection"""
    # Get anomaly monitors
    monitors_response = ce_client.get_anomaly_monitors()
    anomalies = []
    # For each monitor, get anomalies
    for monitor in monitors_response.get('AnomalyMonitors', []):
        monitor_arn = monitor['MonitorArn']
        # Get anomalies for the last 30 days
        end_date = datetime.datetime.utcnow().date()
        start_date = end_date - datetime.timedelta(days=30)
        anomalies_response = ce_client.get_anomalies(
            MonitorArn=monitor_arn,
            DateInterval={
                'StartDate': start_date.isoformat(),
                'EndDate': end_date.isoformat()
            }
        )
        for anomaly in anomalies_response.get('Anomalies', []):
            impact = float(anomaly['Impact'])
            # Only report significant anomalies
            if impact >= ANOMALY_THRESHOLD_PERCENT:
                anomalies.append({
                    'id': anomaly['AnomalyId'],
                    'monitor_name': monitor['MonitorName'],
                    'impact': impact,
                    'impact_percent': impact,
                    'root_causes': anomaly.get('RootCauses', []),
                    'start_date': anomaly['AnomalyStartDate'],
                    'end_date': anomaly.get('AnomalyEndDate')
                })
    return anomalies
def send_notifications(sns_client, cost_report):
    """Send notifications about cost issues"""
    # Format message
    message = format_notification_message(cost_report)
    # Send SNS notification
    if SNS_TOPIC_ARN:
        sns_client.publish(
            TopicArn=SNS_TOPIC_ARN,
            Subject='AWS Cost Alert',
            Message=message
        )
    # Send Slack notification
    if SLACK_WEBHOOK_URL:
        send_slack_notification(cost_report)
def format_notification_message(cost_report):
    """Format notification message"""
    message = "AWS Cost Alert\n\n"
    # Add budget alerts
    if cost_report['budget_alerts']:
        message += "Budget Alerts:\n"
        for alert in cost_report['budget_alerts']:
            if alert['type'] == 'actual':
                message += f"- Budget '{alert['budget_name']}' has reached {alert['actual_percent']:.1f}% " \
                          f"(${alert['actual_amount']:.2f} of ${alert['budget_amount']:.2f})\n"
            else:
                message += f"- Budget '{alert['budget_name']}' is forecasted to reach {alert['forecast_percent']:.1f}% " \
                          f"(${alert['forecast_amount']:.2f} of ${alert['budget_amount']:.2f})\n"
        message += "\n"
    # Add anomalies
    if cost_report['anomalies']:
        message += "Cost Anomalies:\n"
        for anomaly in cost_report['anomalies']:
            message += f"- {anomaly['monitor_name']}: {anomaly['impact_percent']:.1f}% increase detected\n"
            if anomaly['root_causes']:
                message += "  Root causes:\n"
                for cause in anomaly['root_causes']:
                    service = cause.get('Service', 'Unknown service')
                    message += f"  - {service}\n"
        message += "\n"
    # Add month-to-date costs
    message += "Month-to-Date Costs:\n"
    message += f"- Total: ${cost_report['month_to_date']['total']:.2f}\n"
    message += "- Top Services:\n"
    # Sort services by cost (descending)
    sorted_services = sorted(
        cost_report['month_to_date']['by_service'].items(),
        key=lambda x: x[1],
        reverse=True
    )
    # Show top 5 services
    for service, cost in sorted_services[:5]:
        message += f"  - {service}: ${cost:.2f}\n"
    # Add forecast
    message += f"\nForecast for this month: ${cost_report['forecast']['total']:.2f}\n"
    return message
def send_slack_notification(cost_report):
    """Send notification to Slack"""
    # Implementation of Slack notification
    pass
if __name__ == '__main__':
    # For local testing
    lambda_handler(None, None)

• Created a resource tagging enforcement policy:


{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "EnforceTaggingOnResourceCreation",
      "Effect": "Deny",
      "Action": [
        "ec2:RunInstances",
        "ec2:CreateVolume",
        "rds:CreateDBInstance",
        "s3:CreateBucket",
        "dynamodb:CreateTable",
        "elasticloadbalancing:CreateLoadBalancer"
      ],
      "Resource": "*",
      "Condition": {
        "Null": {
          "aws:RequestTag/Environment": "true",
          "aws:RequestTag/Project": "true",
          "aws:RequestTag/Owner": "true",
          "aws:RequestTag/CostCenter": "true"
        }
      }
    },
    {
      "Sid": "EnforceTaggingOnResourceTagging",
      "Effect": "Deny",
      "Action": [
        "ec2:CreateTags"
      ],
      "Resource": "*",
      "Condition": {
        "Null": {
          "aws:RequestTag/Environment": "true",
          "aws:RequestTag/Project": "true",
          "aws:RequestTag/Owner": "true",
          "aws:RequestTag/CostCenter": "true"
        },
        "ForAllValues:StringEquals": {
          "aws:TagKeys": [
            "Environment",
            "Project",
            "Owner",
            "CostCenter"
          ]
        }
      }
    },
    {
      "Sid": "EnforceTaggingOnResourceModification",
      "Effect": "Deny",
      "Action": [
        "ec2:ModifyInstanceAttribute",
        "rds:ModifyDBInstance",
        "dynamodb:UpdateTable",
        "elasticloadbalancing:ModifyLoadBalancerAttributes"
      ],
      "Resource": "*",
      "Condition": {
        "Null": {
          "aws:ResourceTag/Environment": "true",
          "aws:ResourceTag/Project": "true",
          "aws:ResourceTag/Owner": "true",
          "aws:ResourceTag/CostCenter": "true"
        }
      }
    }
  ]
}

Lessons Learned:

Cloud costs require proactive monitoring and governance to prevent unexpected spikes.

How to Avoid:

Implement mandatory resource tagging with ownership information.
Set up automated cleanup of non-production resources after a defined period.
Configure budget alerts and anomaly detection with appropriate thresholds.
Use infrastructure as code with proper resource lifecycle management.
Implement regular cost reviews and optimization processes.

Answer 2

output:

Cloud Cost Optimization AWS, Terraform, Multi-account organization

Summary:

No summary provided

What Happened:

The finance team reported a significant spike in cloud costs during the monthly review. The increase couldn't be attributed to any planned infrastructure expansion or traffic growth. The DevOps team was tasked with identifying and resolving the issue quickly.

Diagnosis Steps:

Analyzed AWS Cost Explorer reports to identify which services showed cost increases.
Compared current resource usage with historical baselines.
Used AWS Cost Anomaly Detection to pinpoint specific resources.
Reviewed recent infrastructure changes across all environments.
Checked for orphaned resources using custom tagging policies.

Root Cause:

Multiple causes were identified: 1. A development team had launched large GPU instances for ML testing but didn't terminate them after testing. 2. Several EBS volumes remained after their associated EC2 instances were terminated. 3. A misconfigured autoscaling group was scaling up but not properly scaling down. 4. Several NAT Gateways were running in regions where they were no longer needed. 5. A large number of unattached Elastic IPs were being billed.

Fix/Workaround:

• Short-term: Identified and terminated unnecessary resources:


#!/bin/bash
# cleanup_orphaned_resources.sh
# Set AWS region
REGION="us-west-2"
echo "Cleaning up orphaned resources in $REGION..."
# Find and terminate stopped instances older than 7 days
echo "Finding stopped EC2 instances older than 7 days..."
STOPPED_INSTANCES=$(aws ec2 describe-instances \
  --region $REGION \
  --filters "Name=instance-state-name,Values=stopped" \
  --query "Reservations[].Instances[?LaunchTime<='$(date -d '7 days ago' --iso-8601)'].InstanceId" \
  --output text)
if [ -n "$STOPPED_INSTANCES" ]; then
  echo "Terminating stopped instances: $STOPPED_INSTANCES"
  aws ec2 terminate-instances --region $REGION --instance-ids $STOPPED_INSTANCES
else
  echo "No stopped instances found to terminate."
fi
# Find and delete unattached EBS volumes
echo "Finding unattached EBS volumes..."
UNATTACHED_VOLUMES=$(aws ec2 describe-volumes \
  --region $REGION \
  --filters "Name=status,Values=available" \
  --query "Volumes[].VolumeId" \
  --output text)
if [ -n "$UNATTACHED_VOLUMES" ]; then
  for VOLUME_ID in $UNATTACHED_VOLUMES; do
    echo "Deleting unattached volume: $VOLUME_ID"
    aws ec2 delete-volume --region $REGION --volume-id $VOLUME_ID
  done
else
  echo "No unattached volumes found."
fi
# Find and release unassociated Elastic IPs
echo "Finding unassociated Elastic IPs..."
UNASSOCIATED_EIPS=$(aws ec2 describe-addresses \
  --region $REGION \
  --query "Addresses[?AssociationId==null].AllocationId" \
  --output text)
if [ -n "$UNASSOCIATED_EIPS" ]; then
  for EIP_ID in $UNASSOCIATED_EIPS; do
    echo "Releasing unassociated Elastic IP: $EIP_ID"
    aws ec2 release-address --region $REGION --allocation-id $EIP_ID
  done
else
  echo "No unassociated Elastic IPs found."
fi
# Find and delete unused NAT Gateways
echo "Finding unused NAT Gateways..."
UNUSED_NAT_GATEWAYS=$(aws ec2 describe-nat-gateways \
  --region $REGION \
  --filter "Name=state,Values=available" \
  --query "NatGateways[].NatGatewayId" \
  --output text)
if [ -n "$UNUSED_NAT_GATEWAYS" ]; then
  for NAT_ID in $UNUSED_NAT_GATEWAYS; do
    # Check if NAT Gateway is actually in use by checking route tables
    ROUTES_USING_NAT=$(aws ec2 describe-route-tables \
      --region $REGION \
      --filters "Name=route.nat-gateway-id,Values=$NAT_ID" \
      --query "RouteTables[].RouteTableId" \
      --output text)
    if [ -z "$ROUTES_USING_NAT" ]; then
      echo "Deleting unused NAT Gateway: $NAT_ID"
      aws ec2 delete-nat-gateway --region $REGION --nat-gateway-id $NAT_ID
    else
      echo "NAT Gateway $NAT_ID is in use by route tables: $ROUTES_USING_NAT"
    fi
  done
else
  echo "No unused NAT Gateways found."
fi
echo "Cleanup completed!"

• Long-term: Implemented a comprehensive cost management strategy:


# cost_governance.tf - Resource tagging and lifecycle policies
# Define required tags
locals {
  required_tags = {
    Environment = "Required"
    Project     = "Required"
    Owner       = "Required"
    CostCenter  = "Required"
    Expiration  = "Optional"
  }
}
# AWS Provider with default tags
provider "aws" {
  region = "us-west-2"
  default_tags {
    tags = {
      ManagedBy = "Terraform"
    }
  }
}
# IAM policy to enforce tagging
resource "aws_iam_policy" "enforce_tagging" {
  name        = "enforce-resource-tagging"
  description = "Enforces tagging standards for AWS resources"
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid       = "EnforceTaggingOnResourceCreation"
        Effect    = "Deny"
        Action    = [
          "ec2:RunInstances",
          "ec2:CreateVolume",
          "rds:CreateDBInstance",
          "dynamodb:CreateTable",
          "s3:CreateBucket"
        ]
        Resource  = "*"
        Condition = {
          "Null" = {
            "aws:RequestTag/Environment" = "true"
            "aws:RequestTag/Project"     = "true"
            "aws:RequestTag/Owner"       = "true"
            "aws:RequestTag/CostCenter"  = "true"
          }
        }
      }
    ]
  })
}
# EC2 instance with lifecycle rules
resource "aws_instance" "example" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t3.micro"
  tags = {
    Name        = "example-instance"
    Environment = "development"
    Project     = "cost-optimization"
    Owner       = "devops-team"
    CostCenter  = "engineering"
    Expiration  = "2023-12-31"
  }
  # Prevent accidental deletion
  lifecycle {
    prevent_destroy = true
  }
  # Create before destroy to minimize downtime
  lifecycle {
    create_before_destroy = true
  }
}
# Auto Scaling Group with proper scaling policies
resource "aws_autoscaling_group" "example" {
  name                 = "example-asg"
  min_size             = 1
  max_size             = 5
  desired_capacity     = 2
  vpc_zone_identifier  = [aws_subnet.example.id]
  launch_configuration = aws_launch_configuration.example.name
  # Properly distribute instances across AZs
  availability_zones = ["us-west-2a", "us-west-2b", "us-west-2c"]
  # Enable instance scale-in protection
  protect_from_scale_in = false
  # Use mixed instances policy for cost optimization
  mixed_instances_policy {
    instances_distribution {
      on_demand_base_capacity                  = 1
      on_demand_percentage_above_base_capacity = 50
      spot_allocation_strategy                 = "capacity-optimized"
    }
    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.example.id
        version            = "$Latest"
      }
      override {
        instance_type = "t3.micro"
      }
      override {
        instance_type = "t3a.micro"
      }
    }
  }
  # Proper scaling policies
  tag {
    key                 = "Name"
    value               = "example-asg-instance"
    propagate_at_launch = true
  }
  tag {
    key                 = "Environment"
    value               = "development"
    propagate_at_launch = true
  }
  tag {
    key                 = "Project"
    value               = "cost-optimization"
    propagate_at_launch = true
  }
  tag {
    key                 = "Owner"
    value               = "devops-team"
    propagate_at_launch = true
  }
  tag {
    key                 = "CostCenter"
    value               = "engineering"
    propagate_at_launch = true
  }
}
# Proper scaling policies
resource "aws_autoscaling_policy" "scale_up" {
  name                   = "scale-up"
  scaling_adjustment     = 1
  adjustment_type        = "ChangeInCapacity"
  cooldown               = 300
  autoscaling_group_name = aws_autoscaling_group.example.name
}
resource "aws_autoscaling_policy" "scale_down" {
  name                   = "scale-down"
  scaling_adjustment     = -1
  adjustment_type        = "ChangeInCapacity"
  cooldown               = 300
  autoscaling_group_name = aws_autoscaling_group.example.name
}
# CloudWatch alarms for scaling
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
  alarm_name          = "high-cpu-utilization"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 300
  statistic           = "Average"
  threshold           = 70
  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.example.name
  }
  alarm_description = "Scale up when CPU exceeds 70%"
  alarm_actions     = [aws_autoscaling_policy.scale_up.arn]
}
resource "aws_cloudwatch_metric_alarm" "low_cpu" {
  alarm_name          = "low-cpu-utilization"
  comparison_operator = "LessThanThreshold"
  evaluation_periods  = 2
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 300
  statistic           = "Average"
  threshold           = 30
  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.example.name
  }
  alarm_description = "Scale down when CPU is below 30%"
  alarm_actions     = [aws_autoscaling_policy.scale_down.arn]
}
# S3 lifecycle policy for cost optimization
resource "aws_s3_bucket" "example" {
  bucket = "example-bucket"
  tags = {
    Name        = "example-bucket"
    Environment = "development"
    Project     = "cost-optimization"
    Owner       = "devops-team"
    CostCenter  = "engineering"
  }
}
resource "aws_s3_bucket_lifecycle_configuration" "example" {
  bucket = aws_s3_bucket.example.id
  rule {
    id     = "transition-to-infrequent-access"
    status = "Enabled"
    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }
    transition {
      days          = 90
      storage_class = "GLACIER"
    }
    expiration {
      days = 365
    }
  }
}
# DynamoDB auto-scaling
resource "aws_appautoscaling_target" "dynamodb_table_read_target" {
  max_capacity       = 100
  min_capacity       = 5
  resource_id        = "table/example-table"
  scalable_dimension = "dynamodb:table:ReadCapacityUnits"
  service_namespace  = "dynamodb"
}
resource "aws_appautoscaling_policy" "dynamodb_table_read_policy" {
  name               = "dynamodb-table-read-policy"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.dynamodb_table_read_target.resource_id
  scalable_dimension = aws_appautoscaling_target.dynamodb_table_read_target.scalable_dimension
  service_namespace  = aws_appautoscaling_target.dynamodb_table_read_target.service_namespace
  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "DynamoDBReadCapacityUtilization"
    }
    target_value       = 70.0
    scale_in_cooldown  = 300
    scale_out_cooldown = 300
  }
}

• Implemented a Go-based cost monitoring and alerting system:


// cost_monitor.go
package main
import (
	"context"
	"encoding/json"
	"fmt"
	"log"
	"os"
	"time"
	"github.com/aws/aws-sdk-go-v2/aws"
	"github.com/aws/aws-sdk-go-v2/config"
	"github.com/aws/aws-sdk-go-v2/service/costexplorer"
	"github.com/aws/aws-sdk-go-v2/service/costexplorer/types"
	"github.com/aws/aws-sdk-go-v2/service/ec2"
	ec2types "github.com/aws/aws-sdk-go-v2/service/ec2/types"
	"github.com/aws/aws-sdk-go-v2/service/sns"
	"github.com/robfig/cron/v3"
)
type ResourceCost struct {
	ResourceID string  `json:"resourceId"`
	Service    string  `json:"service"`
	Cost       float64 `json:"cost"`
	Currency   string  `json:"currency"`
	Region     string  `json:"region"`
	Tags       map[string]string `json:"tags"`
}
type OrphanedResource struct {
	ResourceID   string    `json:"resourceId"`
	ResourceType string    `json:"resourceType"`
	Region       string    `json:"region"`
	CreatedAt    time.Time `json:"createdAt"`
	Tags         map[string]string `json:"tags"`
}
func main() {
	// Load AWS configuration
	cfg, err := config.LoadDefaultConfig(context.TODO(), 
		config.WithRegion("us-west-2"),
	)
	if err != nil {
		log.Fatalf("unable to load SDK config, %v", err)
	}
	// Create Cost Explorer client
	ceClient := costexplorer.NewFromConfig(cfg)
	// Create EC2 client
	ec2Client := ec2.NewFromConfig(cfg)
	// Create SNS client for notifications
	snsClient := sns.NewFromConfig(cfg)
	// Set up cron scheduler
	c := cron.New()
	// Run daily cost analysis
	c.AddFunc("0 1 * * *", func() {
		log.Println("Running daily cost analysis...")
		// Get yesterday's date
		yesterday := time.Now().AddDate(0, 0, -1)
		startDate := yesterday.Format("2006-01-02")
		endDate := time.Now().Format("2006-01-02")
		// Get cost and usage data
		costData, err := getCostAndUsage(ceClient, startDate, endDate)
		if err != nil {
			log.Printf("Error getting cost data: %v", err)
			return
		}
		// Find orphaned resources
		orphanedResources, err := findOrphanedResources(ec2Client)
		if err != nil {
			log.Printf("Error finding orphaned resources: %v", err)
		}
		// Check for cost anomalies
		anomalies, err := detectCostAnomalies(ceClient, startDate, endDate)
		if err != nil {
			log.Printf("Error detecting cost anomalies: %v", err)
		}
		// Send notifications if needed
		if len(orphanedResources) > 0 || len(anomalies) > 0 {
			sendNotification(snsClient, costData, orphanedResources, anomalies)
		}
	})
	// Start cron scheduler
	c.Start()
	// Keep the application running
	select {}
}
func getCostAndUsage(client *costexplorer.Client, startDate, endDate string) ([]ResourceCost, error) {
	input := &costexplorer.GetCostAndUsageInput{
		TimePeriod: &types.DateInterval{
			Start: aws.String(startDate),
			End:   aws.String(endDate),
		},
		Granularity: types.GranularityDaily,
		Metrics:     []string{"BlendedCost"},
		GroupBy: []types.GroupDefinition{
			{
				Type: types.GroupDefinitionTypeDimension,
				Key:  aws.String("SERVICE"),
			},
			{
				Type: types.GroupDefinitionTypeTag,
				Key:  aws.String("ResourceId"),
			},
		},
	}
	result, err := client.GetCostAndUsage(context.TODO(), input)
	if err != nil {
		return nil, fmt.Errorf("failed to get cost and usage: %w", err)
	}
	var resources []ResourceCost
	for _, resultByTime := range result.ResultsByTime {
		for _, group := range resultByTime.Groups {
			// Parse the cost amount
			cost := 0.0
			if len(group.Metrics) > 0 {
				if blendedCost, ok := group.Metrics["BlendedCost"]; ok {
					if amount := blendedCost.Amount; amount != nil {
						if parsedCost, err := parseFloat(*amount); err == nil {
							cost = parsedCost
						}
					}
				}
			}
			// Skip resources with zero cost
			if cost == 0 {
				continue
			}
			// Extract service and resource ID
			service := ""
			resourceID := ""
			if len(group.Keys) >= 2 {
				service = group.Keys[0]
				resourceID = group.Keys[1]
			}
			// Create resource cost entry
			resources = append(resources, ResourceCost{
				ResourceID: resourceID,
				Service:    service,
				Cost:       cost,
				Currency:   "USD", // Assuming USD
				Region:     "us-west-2", // Assuming us-west-2
				Tags:       map[string]string{}, // Would need additional API calls to get tags
			})
		}
	}
	return resources, nil
}
func findOrphanedResources(client *ec2.Client) ([]OrphanedResource, error) {
	var orphanedResources []OrphanedResource
	// Find unattached EBS volumes
	volumesResult, err := client.DescribeVolumes(context.TODO(), &ec2.DescribeVolumesInput{
		Filters: []ec2types.Filter{
			{
				Name:   aws.String("status"),
				Values: []string{"available"},
			},
		},
	})
	if err != nil {
		return nil, fmt.Errorf("failed to describe volumes: %w", err)
	}
	for _, volume := range volumesResult.Volumes {
		tags := make(map[string]string)
		for _, tag := range volume.Tags {
			if tag.Key != nil && tag.Value != nil {
				tags[*tag.Key] = *tag.Value
			}
		}
		orphanedResources = append(orphanedResources, OrphanedResource{
			ResourceID:   *volume.VolumeId,
			ResourceType: "EBS Volume",
			Region:       "us-west-2", // Assuming us-west-2
			CreatedAt:    *volume.CreateTime,
			Tags:         tags,
		})
	}
	// Find unassociated Elastic IPs
	addressesResult, err := client.DescribeAddresses(context.TODO(), &ec2.DescribeAddressesInput{})
	if err != nil {
		return nil, fmt.Errorf("failed to describe addresses: %w", err)
	}
	for _, address := range addressesResult.Addresses {
		if address.AssociationId == nil {
			tags := make(map[string]string)
			for _, tag := range address.Tags {
				if tag.Key != nil && tag.Value != nil {
					tags[*tag.Key] = *tag.Value
				}
			}
			orphanedResources = append(orphanedResources, OrphanedResource{
				ResourceID:   *address.AllocationId,
				ResourceType: "Elastic IP",
				Region:       "us-west-2", // Assuming us-west-2
				CreatedAt:    time.Now(),  // EIPs don't have creation time in the API
				Tags:         tags,
			})
		}
	}
	// Find stopped EC2 instances
	instancesResult, err := client.DescribeInstances(context.TODO(), &ec2.DescribeInstancesInput{
		Filters: []ec2types.Filter{
			{
				Name:   aws.String("instance-state-name"),
				Values: []string{"stopped"},
			},
		},
	})
	if err != nil {
		return nil, fmt.Errorf("failed to describe instances: %w", err)
	}
	for _, reservation := range instancesResult.Reservations {
		for _, instance := range reservation.Instances {
			// Check if instance has been stopped for more than 7 days
			if instance.StateTransitionReason != nil {
				// Parse the state transition reason to get the stop time
				// This is a bit hacky but the API doesn't provide a direct way to get this
				if stopTime, err := parseStateTransitionTime(*instance.StateTransitionReason); err == nil {
					if time.Since(stopTime) > 7*24*time.Hour {
						tags := make(map[string]string)
						for _, tag := range instance.Tags {
							if tag.Key != nil && tag.Value != nil {
								tags[*tag.Key] = *tag.Value
							}
						}
						orphanedResources = append(orphanedResources, OrphanedResource{
							ResourceID:   *instance.InstanceId,
							ResourceType: "EC2 Instance",
							Region:       "us-west-2", // Assuming us-west-2
							CreatedAt:    *instance.LaunchTime,
							Tags:         tags,
						})
					}
				}
			}
		}
	}
	return orphanedResources, nil
}
func detectCostAnomalies(client *costexplorer.Client, startDate, endDate string) ([]types.AnomalyMonitor, error) {
	// Get cost anomaly monitors
	monitorsResult, err := client.GetAnomalyMonitors(context.TODO(), &costexplorer.GetAnomalyMonitorsInput{})
	if err != nil {
		return nil, fmt.Errorf("failed to get anomaly monitors: %w", err)
	}
	var anomalies []types.AnomalyMonitor
	// For each monitor, get anomalies
	for _, monitor := range monitorsResult.AnomalyMonitors {
		subscriptionsResult, err := client.GetAnomalySubscriptions(context.TODO(), &costexplorer.GetAnomalySubscriptionsInput{
			MonitorArn: monitor.MonitorArn,
		})
		if err != nil {
			log.Printf("Failed to get anomaly subscriptions for monitor %s: %v", *monitor.MonitorArn, err)
			continue
		}
		for _, subscription := range subscriptionsResult.AnomalySubscriptions {
			anomaliesResult, err := client.GetAnomalies(context.TODO(), &costexplorer.GetAnomaliesInput{
				DateInterval: &types.AnomalyDateInterval{
					StartDate: aws.String(startDate),
					EndDate:   aws.String(endDate),
				},
				MonitorArn:      monitor.MonitorArn,
				SubscriptionArn: subscription.SubscriptionArn,
			})
			if err != nil {
				log.Printf("Failed to get anomalies for subscription %s: %v", *subscription.SubscriptionArn, err)
				continue
			}
			if len(anomaliesResult.Anomalies) > 0 {
				anomalies = append(anomalies, monitor)
				break
			}
		}
	}
	return anomalies, nil
}
func sendNotification(client *sns.Client, costData []ResourceCost, orphanedResources []OrphanedResource, anomalies []types.AnomalyMonitor) {
	// Prepare notification message
	message := "AWS Cost Optimization Report\n\n"
	// Add cost data summary
	totalCost := 0.0
	for _, resource := range costData {
		totalCost += resource.Cost
	}
	message += fmt.Sprintf("Total Cost: $%.2f\n\n", totalCost)
	// Add orphaned resources
	if len(orphanedResources) > 0 {
		message += "Orphaned Resources:\n"
		for _, resource := range orphanedResources {
			message += fmt.Sprintf("- %s (%s): %s (Created: %s)\n", 
				resource.ResourceType, 
				resource.ResourceID, 
				resource.Region, 
				resource.CreatedAt.Format("2006-01-02"))
		}
		message += "\n"
	}
	// Add cost anomalies
	if len(anomalies) > 0 {
		message += "Cost Anomalies Detected:\n"
		for _, anomaly := range anomalies {
			message += fmt.Sprintf("- %s: %s\n", *anomaly.MonitorName, *anomaly.MonitorType)
		}
		message += "\n"
	}
	// Add top 10 most expensive resources
	if len(costData) > 0 {
		message += "Top 10 Most Expensive Resources:\n"
		// Sort cost data by cost (descending)
		// This is a simple bubble sort for demonstration
		for i := 0; i < len(costData); i++ {
			for j := i + 1; j < len(costData); j++ {
				if costData[i].Cost < costData[j].Cost {
					costData[i], costData[j] = costData[j], costData[i]
				}
			}
		}
		// Take top 10 or less
		count := 10
		if len(costData) < 10 {
			count = len(costData)
		}
		for i := 0; i < count; i++ {
			message += fmt.Sprintf("- %s (%s): $%.2f\n", 
				costData[i].ResourceID, 
				costData[i].Service, 
				costData[i].Cost)
		}
	}
	// Send SNS notification
	_, err := client.Publish(context.TODO(), &sns.PublishInput{
		TopicArn: aws.String(os.Getenv("SNS_TOPIC_ARN")),
		Subject:  aws.String("AWS Cost Optimization Report"),
		Message:  aws.String(message),
	})
	if err != nil {
		log.Printf("Failed to send notification: %v", err)
	} else {
		log.Println("Cost optimization notification sent successfully")
	}
}
// Helper functions
func parseFloat(s string) (float64, error) {
	var f float64
	_, err := fmt.Sscanf(s, "%f", &f)
	return f, err
}
func parseStateTransitionTime(reason string) (time.Time, error) {
	// Example: "User initiated (2023-05-15 10:30:00 GMT)"
	var year, month, day, hour, min, sec int
	_, err := fmt.Sscanf(reason, "User initiated (%d-%d-%d %d:%d:%d GMT)", 
		&year, &month, &day, &hour, &min, &sec)
	if err != nil {
		return time.Time{}, err
	}
	return time.Date(year, time.Month(month), day, hour, min, sec, 0, time.UTC), nil
}

• Created a Rust-based resource tagging enforcement tool:


// tag_enforcer.rs
use aws_config::meta::region::RegionProviderChain;
use aws_sdk_ec2::{Client as Ec2Client, Error as Ec2Error};
use aws_sdk_resourcegroupstaggingapi::{Client as TaggingClient, Error as TaggingError};
use aws_sdk_sns::{Client as SnsClient, Error as SnsError};
use chrono::{DateTime, Duration, Utc};
use serde::{Deserialize, Serialize};
use std::collections::HashMap;
use std::env;
use structopt::StructOpt;
use tokio::time;
#[derive(Debug, StructOpt)]
#[structopt(name = "tag-enforcer", about = "AWS resource tag enforcement tool")]
struct Opt {
    /// AWS Region
    #[structopt(short, long, default_value = "us-west-2")]
    region: String,
    /// SNS Topic ARN for notifications
    #[structopt(long, env = "SNS_TOPIC_ARN")]
    sns_topic_arn: String,
    /// Dry run mode (don't make any changes)
    #[structopt(long)]
    dry_run: bool,
    /// Required tags (comma-separated)
    #[structopt(long, default_value = "Environment,Project,Owner,CostCenter")]
    required_tags: String,
}
#[derive(Debug, Serialize, Deserialize)]
struct UntaggedResource {
    resource_arn: String,
    resource_type: String,
    missing_tags: Vec<String>,
    existing_tags: HashMap<String, String>,
}
#[derive(Debug, Serialize, Deserialize)]
struct ExpiringResource {
    resource_arn: String,
    resource_type: String,
    expiration_date: String,
    days_until_expiration: i64,
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let opt = Opt::from_args();
    // Load AWS configuration
    let region_provider = RegionProviderChain::first_try(opt.region.clone())
        .or_default_provider()
        .or_else(opt.region);
    let config = aws_config::from_env().region(region_provider).load().await;
    // Create clients
    let tagging_client = TaggingClient::new(&config);
    let ec2_client = Ec2Client::new(&config);
    let sns_client = SnsClient::new(&config);
    // Parse required tags
    let required_tags: Vec<String> = opt.required_tags
        .split(',')
        .map(|s| s.trim().to_string())
        .collect();
    println!("Starting tag enforcement with required tags: {:?}", required_tags);
    // Run tag enforcement every day
    loop {
        println!("Running tag enforcement check...");
        // Find untagged resources
        let untagged_resources = find_untagged_resources(&tagging_client, &required_tags).await?;
        // Find expiring resources
        let expiring_resources = find_expiring_resources(&tagging_client).await?;
        // Take action on untagged resources
        if !opt.dry_run && !untagged_resources.is_empty() {
            handle_untagged_resources(&ec2_client, &untagged_resources).await?;
        }
        // Send notification
        if !untagged_resources.is_empty() || !expiring_resources.is_empty() {
            send_notification(
                &sns_client, 
                &opt.sns_topic_arn, 
                &untagged_resources, 
                &expiring_resources,
                opt.dry_run,
            ).await?;
        }
        // Wait for next run
        println!("Tag enforcement check completed. Waiting for next run...");
        time::sleep(time::Duration::from_secs(24 * 60 * 60)).await;
    }
}
async fn find_untagged_resources(
    client: &TaggingClient,
    required_tags: &[String],
) -> Result<Vec<UntaggedResource>, TaggingError> {
    let mut untagged_resources = Vec::new();
    let mut pagination_token = None;
    loop {
        let resp = client
            .get_resources()
            .set_pagination_token(pagination_token)
            .send()
            .await?;
        if let Some(resources) = resp.resource_tag_mapping_list() {
            for resource in resources {
                let resource_arn = match resource.resource_arn() {
                    Some(arn) => arn,
                    None => continue,
                };
                let tags = resource.tags().unwrap_or_default();
                let tag_keys: Vec<&str> = tags.iter().map(|t| t.key().unwrap_or_default()).collect();
                let mut missing_tags = Vec::new();
                for required_tag in required_tags {
                    if !tag_keys.contains(&required_tag.as_str()) {
                        missing_tags.push(required_tag.clone());
                    }
                }
                if !missing_tags.is_empty() {
                    let mut existing_tags = HashMap::new();
                    for tag in tags {
                        if let (Some(key), Some(value)) = (tag.key(), tag.value()) {
                            existing_tags.insert(key.to_string(), value.to_string());
                        }
                    }
                    let resource_type = resource_arn.split(':').nth(2).unwrap_or("unknown").to_string();
                    untagged_resources.push(UntaggedResource {
                        resource_arn: resource_arn.to_string(),
                        resource_type,
                        missing_tags,
                        existing_tags,
                    });
                }
            }
        }
        pagination_token = resp.pagination_token().map(|s| s.to_string());
        if pagination_token.is_none() {
            break;
        }
    }
    Ok(untagged_resources)
}
async fn find_expiring_resources(
    client: &TaggingClient,
) -> Result<Vec<ExpiringResource>, TaggingError> {
    let mut expiring_resources = Vec::new();
    let mut pagination_token = None;
    loop {
        let resp = client
            .get_resources()
            .set_pagination_token(pagination_token)
            .send()
            .await?;
        if let Some(resources) = resp.resource_tag_mapping_list() {
            for resource in resources {
                let resource_arn = match resource.resource_arn() {
                    Some(arn) => arn,
                    None => continue,
                };
                let tags = resource.tags().unwrap_or_default();
                // Check for Expiration tag
                for tag in tags {
                    if let (Some(key), Some(value)) = (tag.key(), tag.value()) {
                        if key == "Expiration" {
                            // Parse expiration date
                            if let Ok(expiration_date) = chrono::NaiveDate::parse_from_str(value, "%Y-%m-%d") {
                                let expiration_datetime = DateTime::<Utc>::from_utc(
                                    expiration_date.and_hms(0, 0, 0),
                                    Utc,
                                );
                                let now = Utc::now();
                                // Calculate days until expiration
                                let days_until_expiration = (expiration_datetime - now).num_days();
                                // If expiring within 7 days, add to list
                                if days_until_expiration >= 0 && days_until_expiration <= 7 {
                                    let resource_type = resource_arn.split(':').nth(2).unwrap_or("unknown").to_string();
                                    expiring_resources.push(ExpiringResource {
                                        resource_arn: resource_arn.to_string(),
                                        resource_type,
                                        expiration_date: value.to_string(),
                                        days_until_expiration,
                                    });
                                }
                            }
                        }
                    }
                }
            }
        }
        pagination_token = resp.pagination_token().map(|s| s.to_string());
        if pagination_token.is_none() {
            break;
        }
    }
    Ok(expiring_resources)
}
async fn handle_untagged_resources(
    client: &Ec2Client,
    untagged_resources: &[UntaggedResource],
) -> Result<(), Ec2Error> {
    for resource in untagged_resources {
        // For EC2 instances, stop if missing required tags
        if resource.resource_type == "ec2" && resource.resource_arn.contains(":instance/") {
            let instance_id = resource.resource_arn.split('/').last().unwrap_or_default();
            println!("Stopping untagged EC2 instance: {}", instance_id);
            client
                .stop_instances()
                .instance_ids(instance_id)
                .send()
                .await?;
        }
    }
    Ok(())
}
async fn send_notification(
    client: &SnsClient,
    topic_arn: &str,
    untagged_resources: &[UntaggedResource],
    expiring_resources: &[ExpiringResource],
    dry_run: bool,
) -> Result<(), SnsError> {
    let mut message = String::new();
    message.push_str("AWS Resource Tag Enforcement Report\n\n");
    if dry_run {
        message.push_str("*** DRY RUN MODE - No actions taken ***\n\n");
    }
    // Add untagged resources
    if !untagged_resources.is_empty() {
        message.push_str(&format!("Untagged Resources ({})\n", untagged_resources.len()));
        message.push_str("====================\n");
        for resource in untagged_resources {
            message.push_str(&format!("Resource: {}\n", resource.resource_arn));
            message.push_str(&format!("Type: {}\n", resource.resource_type));
            message.push_str(&format!("Missing Tags: {:?}\n", resource.missing_tags));
            message.push_str(&format!("Existing Tags: {:?}\n", resource.existing_tags));
            if resource.resource_type == "ec2" && resource.resource_arn.contains(":instance/") && !dry_run {
                message.push_str("Action: Instance stopped due to missing required tags\n");
            }
            message.push_str("\n");
        }
    }
    // Add expiring resources
    if !expiring_resources.is_empty() {
        message.push_str(&format!("Expiring Resources ({})\n", expiring_resources.len()));
        message.push_str("====================\n");
        for resource in expiring_resources {
            message.push_str(&format!("Resource: {}\n", resource.resource_arn));
            message.push_str(&format!("Type: {}\n", resource.resource_type));
            message.push_str(&format!("Expiration Date: {}\n", resource.expiration_date));
            message.push_str(&format!("Days Until Expiration: {}\n", resource.days_until_expiration));
            message.push_str("\n");
        }
    }
    // Send SNS notification
    client
        .publish()
        .topic_arn(topic_arn)
        .subject("AWS Resource Tag Enforcement Report")
        .message(message)
        .send()
        .await?;
    println!("Notification sent to SNS topic: {}", topic_arn);
    Ok(())
}

Lessons Learned:

Proactive cost management requires both automated monitoring and proper resource lifecycle policies.

How to Avoid:

Implement mandatory resource tagging with ownership and expiration information.
Set up automated cleanup of orphaned resources.
Configure proper autoscaling policies with scale-down rules.
Use cost anomaly detection with alerting.
Implement resource quotas and budget alerts.

Answer 3

output:

Cloud Cost Optimization AWS, Kubernetes, Production environment

Summary:

No summary provided

What Happened:

After deploying a new feature, the company's cloud costs tripled overnight. The finance team raised an urgent alert when they saw the preliminary billing report. The spike occurred despite no significant increase in user traffic or application load.

Diagnosis Steps:

Analyzed AWS Cost Explorer reports to identify cost drivers.
Reviewed recent infrastructure changes and deployments.
Examined auto-scaling configurations and scaling events.
Analyzed application metrics and logs for unusual patterns.
Checked for potential security incidents or unauthorized resource usage.

Root Cause:

Multiple issues contributed to the cost spike: 1. Horizontal Pod Autoscaler (HPA) was configured with too aggressive scaling parameters 2. Missing scaling limits allowed unlimited scaling during brief traffic spikes 3. A monitoring agent was causing artificial CPU load, triggering unnecessary scaling 4. Cluster Autoscaler was configured to scale up quickly but scale down slowly 5. Unused resources (EBS volumes, load balancers) were not being cleaned up

Fix/Workaround:

• Short-term: Implemented immediate cost controls with optimized HPA configuration

• Fixed Cluster Autoscaler configuration to scale down more efficiently

• Implemented resource tagging and lifecycle policies for unused resources

• Adjusted monitoring agent configuration to reduce CPU overhead

• Long-term: Implemented a comprehensive cloud cost optimization strategy with automated reporting and alerting

Lessons Learned:

Auto-scaling configurations require careful tuning to balance performance and cost.

How to Avoid:

Implement cost monitoring with alerts for unexpected spikes.
Set appropriate scaling limits in all auto-scaling configurations.
Regularly audit and clean up unused cloud resources.
Test auto-scaling behavior before deploying to production.
Use spot instances and reserved instances where appropriate.

Answer 4

output:

Cloud Cost Optimization AWS, Terraform, Production environment

Summary:

No summary provided

What Happened:

During a monthly financial review, the finance team flagged a significant increase in cloud costs. The DevOps team was tasked with investigating the spike and found numerous orphaned resources including unused EBS volumes, idle RDS instances, and forgotten development environments that had been running for months without proper monitoring or cost allocation tags.

Diagnosis Steps:

Analyzed AWS Cost Explorer reports to identify cost anomalies.
Used AWS Cost and Usage Reports to break down costs by service and region.
Reviewed resource tagging compliance across all AWS accounts.
Ran AWS Trusted Advisor cost optimization checks.
Compared infrastructure-as-code definitions with actual deployed resources.

Root Cause:

The investigation revealed multiple issues contributing to the cost spike: 1. Developers were creating temporary resources for testing but not cleaning them up 2. Automated CI/CD pipelines were creating resources but failing to destroy them when builds failed 3. Many resources lacked proper ownership and project tags for cost allocation 4. No automated process existed for identifying and removing unused resources 5. Terraform state files were inconsistent with actual deployed resources

Fix/Workaround:

• Short-term: Implemented immediate cost reduction measures:


#!/bin/bash
# cleanup_orphaned_resources.sh
# Script to identify and clean up orphaned AWS resources
# Set AWS region
AWS_REGION="us-west-2"
echo "Starting orphaned resource cleanup in region $AWS_REGION"
# Find and remove unattached EBS volumes
echo "Finding unattached EBS volumes..."
UNATTACHED_VOLUMES=$(aws ec2 describe-volumes \
  --region $AWS_REGION \
  --filters Name=status,Values=available \
  --query 'Volumes[*].{ID:VolumeId,Size:Size,Created:CreateTime}' \
  --output json)
VOLUME_COUNT=$(echo $UNATTACHED_VOLUMES | jq length)
echo "Found $VOLUME_COUNT unattached volumes"
if [ $VOLUME_COUNT -gt 0 ]; then
  echo "Volumes to be removed:"
  echo $UNATTACHED_VOLUMES | jq -r '.[] | "ID: \(.ID), Size: \(.Size)GB, Created: \(.Created)"'
  # Ask for confirmation before deleting
  read -p "Do you want to delete these volumes? (y/n) " -n 1 -r
  echo
  if [[ $REPLY =~ ^[Yy]$ ]]; then
    echo $UNATTACHED_VOLUMES | jq -r '.[].ID' | while read VOLUME_ID; do
      echo "Deleting volume $VOLUME_ID"
      aws ec2 delete-volume --region $AWS_REGION --volume-id $VOLUME_ID
    done
  fi
fi
# Find and remove unused Elastic IPs
echo "Finding unused Elastic IPs..."
UNUSED_EIPS=$(aws ec2 describe-addresses \
  --region $AWS_REGION \
  --query 'Addresses[?AssociationId==null]' \
  --output json)
EIP_COUNT=$(echo $UNUSED_EIPS | jq length)
echo "Found $EIP_COUNT unused Elastic IPs"
if [ $EIP_COUNT -gt 0 ]; then
  echo "Elastic IPs to be released:"
  echo $UNUSED_EIPS | jq -r '.[] | "ID: \(.AllocationId), IP: \(.PublicIp)"'
  # Ask for confirmation before releasing
  read -p "Do you want to release these Elastic IPs? (y/n) " -n 1 -r
  echo
  if [[ $REPLY =~ ^[Yy]$ ]]; then
    echo $UNUSED_EIPS | jq -r '.[].AllocationId' | while read EIP_ID; do
      echo "Releasing Elastic IP $EIP_ID"
      aws ec2 release-address --region $AWS_REGION --allocation-id $EIP_ID
    done
  fi
fi
# Find and remove old snapshots
echo "Finding old EBS snapshots..."
# Get snapshots older than 90 days that are not used by AMIs
NINETY_DAYS_AGO=$(date -d "90 days ago" +%Y-%m-%dT%H:%M:%S)
OLD_SNAPSHOTS=$(aws ec2 describe-snapshots \
  --region $AWS_REGION \
  --owner-ids self \
  --query "Snapshots[?StartTime<='$NINETY_DAYS_AGO']" \
  --output json)
# Filter out snapshots used by AMIs
AMI_SNAPSHOTS=$(aws ec2 describe-images \
  --region $AWS_REGION \
  --owners self \
  --query 'Images[].BlockDeviceMappings[].Ebs.SnapshotId' \
  --output json)
UNUSED_OLD_SNAPSHOTS=$(echo $OLD_SNAPSHOTS | jq --argjson ami_snaps "$AMI_SNAPSHOTS" '[.[] | select(.SnapshotId as $snap_id | $ami_snaps | index($snap_id) | not)]')
SNAPSHOT_COUNT=$(echo $UNUSED_OLD_SNAPSHOTS | jq length)
echo "Found $SNAPSHOT_COUNT old unused snapshots"
if [ $SNAPSHOT_COUNT -gt 0 ]; then
  echo "Old snapshots to be removed:"
  echo $UNUSED_OLD_SNAPSHOTS | jq -r '.[] | "ID: \(.SnapshotId), Created: \(.StartTime), Size: \(.VolumeSize)GB"'
  # Ask for confirmation before deleting
  read -p "Do you want to delete these snapshots? (y/n) " -n 1 -r
  echo
  if [[ $REPLY =~ ^[Yy]$ ]]; then
    echo $UNUSED_OLD_SNAPSHOTS | jq -r '.[].SnapshotId' | while read SNAPSHOT_ID; do
      echo "Deleting snapshot $SNAPSHOT_ID"
      aws ec2 delete-snapshot --region $AWS_REGION --snapshot-id $SNAPSHOT_ID
    done
  fi
fi
# Find idle RDS instances (low connection count)
echo "Finding potentially idle RDS instances..."
aws rds describe-db-instances \
  --region $AWS_REGION \
  --query 'DBInstances[*].{ID:DBInstanceIdentifier,Class:DBInstanceClass,Engine:Engine,Status:DBInstanceStatus}' \
  --output table
echo "To check CloudWatch metrics for connection counts, run:"
echo "aws cloudwatch get-metric-statistics --namespace AWS/RDS --metric-name DatabaseConnections --dimensions Name=DBInstanceIdentifier,Value=<instance-id> --start-time $(date -d '7 days ago' +%Y-%m-%dT%H:%M:%S) --end-time $(date +%Y-%m-%dT%H:%M:%S) --period 3600 --statistics Average"
# Find and report on untagged resources
echo "Finding resources missing required tags..."
REQUIRED_TAGS="Owner,Project,Environment"
# Check EC2 instances
echo "Checking EC2 instances for missing tags..."
aws ec2 describe-instances \
  --region $AWS_REGION \
  --query 'Reservations[].Instances[?!not_null(Tags[?Key==`Owner`].Value|[0]) || !not_null(Tags[?Key==`Project`].Value|[0]) || !not_null(Tags[?Key==`Environment`].Value|[0])].[InstanceId,InstanceType,State.Name]' \
  --output table
# Check EBS volumes
echo "Checking EBS volumes for missing tags..."
aws ec2 describe-volumes \
  --region $AWS_REGION \
  --query 'Volumes[?!not_null(Tags[?Key==`Owner`].Value|[0]) || !not_null(Tags[?Key==`Project`].Value|[0]) || !not_null(Tags[?Key==`Environment`].Value|[0])].[VolumeId,Size,State]' \
  --output table
echo "Cleanup script completed."

• Implemented a Go-based cloud resource analyzer:


// cloud_resource_analyzer.go
package main
import (
	"context"
	"encoding/json"
	"flag"
	"fmt"
	"log"
	"os"
	"sort"
	"strings"
	"sync"
	"time"
	"github.com/aws/aws-sdk-go-v2/aws"
	"github.com/aws/aws-sdk-go-v2/config"
	"github.com/aws/aws-sdk-go-v2/service/cloudwatch"
	"github.com/aws/aws-sdk-go-v2/service/cloudwatch/types"
	"github.com/aws/aws-sdk-go-v2/service/ec2"
	ec2types "github.com/aws/aws-sdk-go-v2/service/ec2/types"
	"github.com/aws/aws-sdk-go-v2/service/rds"
	rdstypes "github.com/aws/aws-sdk-go-v2/service/rds/types"
	"github.com/aws/aws-sdk-go-v2/service/s3"
	"github.com/olekukonko/tablewriter"
)
// ResourceInfo represents information about a cloud resource
type ResourceInfo struct {
	ResourceID   string
	ResourceType string
	Region       string
	Account      string
	Size         string
	State        string
	CreatedAt    time.Time
	LastUsed     time.Time
	Tags         map[string]string
	EstimatedCost float64
	Utilization   float64
	Recommendation string
}
// ResourceAnalyzer analyzes cloud resources
type ResourceAnalyzer struct {
	cfg         aws.Config
	regions     []string
	requiredTags []string
	resourceChan chan ResourceInfo
	wg          sync.WaitGroup
	ctx         context.Context
}
func main() {
	// Parse command line flags
	regionFlag := flag.String("region", "", "AWS region (comma-separated for multiple regions)")
	outputFlag := flag.String("output", "table", "Output format (table, json, csv)")
	tagsFlag := flag.String("required-tags", "Owner,Project,Environment", "Required tags (comma-separated)")
	daysFlag := flag.Int("days", 30, "Number of days to consider a resource unused")
	flag.Parse()
	// Set up regions
	var regions []string
	if *regionFlag != "" {
		regions = strings.Split(*regionFlag, ",")
	} else {
		// Default to common regions
		regions = []string{"us-east-1", "us-west-2", "eu-west-1"}
	}
	// Set up required tags
	requiredTags := strings.Split(*tagsFlag, ",")
	// Initialize analyzer
	analyzer, err := NewResourceAnalyzer(regions, requiredTags)
	if err != nil {
		log.Fatalf("Failed to initialize resource analyzer: %v", err)
	}
	// Analyze resources
	resources, err := analyzer.AnalyzeResources(*daysFlag)
	if err != nil {
		log.Fatalf("Failed to analyze resources: %v", err)
	}
	// Output results
	switch *outputFlag {
	case "json":
		outputJSON(resources)
	case "csv":
		outputCSV(resources)
	default:
		outputTable(resources)
	}
	// Print summary
	printSummary(resources)
}
// NewResourceAnalyzer creates a new resource analyzer
func NewResourceAnalyzer(regions, requiredTags []string) (*ResourceAnalyzer, error) {
	// Load AWS configuration
	cfg, err := config.LoadDefaultConfig(context.TODO())
	if err != nil {
		return nil, fmt.Errorf("failed to load AWS config: %w", err)
	}
	return &ResourceAnalyzer{
		cfg:         cfg,
		regions:     regions,
		requiredTags: requiredTags,
		resourceChan: make(chan ResourceInfo, 100),
		ctx:         context.Background(),
	}, nil
}
// AnalyzeResources analyzes all resources across regions
func (ra *ResourceAnalyzer) AnalyzeResources(unusedDays int) ([]ResourceInfo, error) {
	var resources []ResourceInfo
	resultChan := make(chan []ResourceInfo)
	errorChan := make(chan error)
	// Start a goroutine to collect results
	go func() {
		var allResources []ResourceInfo
		for res := range resultChan {
			allResources = append(allResources, res...)
		}
		resources = allResources
	}()
	// Process each region
	for _, region := range ra.regions {
		ra.wg.Add(1)
		go func(region string) {
			defer ra.wg.Done()
			// Create region-specific config
			regionCfg := ra.cfg.Copy()
			regionCfg.Region = region
			// Analyze EC2 resources
			ec2Resources, err := ra.analyzeEC2Resources(regionCfg, unusedDays)
			if err != nil {
				errorChan <- fmt.Errorf("failed to analyze EC2 resources in %s: %w", region, err)
				return
			}
			resultChan <- ec2Resources
			// Analyze RDS resources
			rdsResources, err := ra.analyzeRDSResources(regionCfg, unusedDays)
			if err != nil {
				errorChan <- fmt.Errorf("failed to analyze RDS resources in %s: %w", region, err)
				return
			}
			resultChan <- rdsResources
			// Analyze S3 resources (global, so only do this once)
			if region == ra.regions[0] {
				s3Resources, err := ra.analyzeS3Resources(regionCfg, unusedDays)
				if err != nil {
					errorChan <- fmt.Errorf("failed to analyze S3 resources: %w", err)
					return
				}
				resultChan <- s3Resources
			}
		}(region)
	}
	// Wait for all goroutines to complete
	go func() {
		ra.wg.Wait()
		close(resultChan)
		close(errorChan)
	}()
	// Check for errors
	for err := range errorChan {
		return nil, err
	}
	return resources, nil
}
// analyzeEC2Resources analyzes EC2 resources in a region
func (ra *ResourceAnalyzer) analyzeEC2Resources(cfg aws.Config, unusedDays int) ([]ResourceInfo, error) {
	var resources []ResourceInfo
	// Create EC2 client
	ec2Client := ec2.NewFromConfig(cfg)
	// Get EC2 instances
	instances, err := ec2Client.DescribeInstances(ra.ctx, &ec2.DescribeInstancesInput{})
	if err != nil {
		return nil, fmt.Errorf("failed to describe EC2 instances: %w", err)
	}
	// Process instances
	for _, reservation := range instances.Reservations {
		for _, instance := range reservation.Instances {
			// Convert tags to map
			tags := make(map[string]string)
			for _, tag := range instance.Tags {
				tags[*tag.Key] = *tag.Value
			}
			// Get instance details
			resourceInfo := ResourceInfo{
				ResourceID:   *instance.InstanceId,
				ResourceType: "EC2 Instance",
				Region:       cfg.Region,
				Size:         string(instance.InstanceType),
				State:        string(instance.State.Name),
				CreatedAt:    *instance.LaunchTime,
				Tags:         tags,
			}
			// Check utilization
			utilization, lastUsed, err := ra.getEC2Utilization(cfg, *instance.InstanceId, unusedDays)
			if err != nil {
				log.Printf("Warning: Failed to get utilization for instance %s: %v", *instance.InstanceId, err)
			} else {
				resourceInfo.Utilization = utilization
				resourceInfo.LastUsed = lastUsed
			}
			// Estimate cost
			resourceInfo.EstimatedCost = estimateEC2Cost(string(instance.InstanceType), cfg.Region)
			// Generate recommendation
			resourceInfo.Recommendation = generateEC2Recommendation(resourceInfo, unusedDays)
			resources = append(resources, resourceInfo)
		}
	}
	// Get EBS volumes
	volumes, err := ec2Client.DescribeVolumes(ra.ctx, &ec2.DescribeVolumesInput{})
	if err != nil {
		return nil, fmt.Errorf("failed to describe EBS volumes: %w", err)
	}
	// Process volumes
	for _, volume := range volumes.Volumes {
		// Convert tags to map
		tags := make(map[string]string)
		for _, tag := range volume.Tags {
			tags[*tag.Key] = *tag.Value
		}
		// Get volume details
		resourceInfo := ResourceInfo{
			ResourceID:   *volume.VolumeId,
			ResourceType: "EBS Volume",
			Region:       cfg.Region,
			Size:         fmt.Sprintf("%d GB", *volume.Size),
			State:        string(volume.State),
			CreatedAt:    *volume.CreateTime,
			Tags:         tags,
		}
		// Check if volume is attached
		isAttached := len(volume.Attachments) > 0
		// Estimate cost
		resourceInfo.EstimatedCost = estimateEBSCost(*volume.Size, string(volume.VolumeType), cfg.Region)
		// Generate recommendation
		if !isAttached {
			resourceInfo.Recommendation = "Delete unattached volume"
		} else {
			resourceInfo.Recommendation = "Volume is in use"
		}
		resources = append(resources, resourceInfo)
	}
	// Get Elastic IPs
	eips, err := ec2Client.DescribeAddresses(ra.ctx, &ec2.DescribeAddressesInput{})
	if err != nil {
		return nil, fmt.Errorf("failed to describe Elastic IPs: %w", err)
	}
	// Process Elastic IPs
	for _, eip := range eips.Addresses {
		// Convert tags to map
		tags := make(map[string]string)
		for _, tag := range eip.Tags {
			tags[*tag.Key] = *tag.Value
		}
		// Get EIP details
		resourceInfo := ResourceInfo{
			ResourceID:   *eip.AllocationId,
			ResourceType: "Elastic IP",
			Region:       cfg.Region,
			State:        "allocated",
			Tags:         tags,
		}
		// Check if EIP is associated
		isAssociated := eip.AssociationId != nil
		// Estimate cost (only charged if not associated with running instance)
		if !isAssociated {
			resourceInfo.EstimatedCost = 3.6 // $3.6/month for unused EIP
			resourceInfo.Recommendation = "Release unused Elastic IP"
		} else {
			resourceInfo.Recommendation = "Elastic IP is in use"
		}
		resources = append(resources, resourceInfo)
	}
	return resources, nil
}
// analyzeRDSResources analyzes RDS resources in a region
func (ra *ResourceAnalyzer) analyzeRDSResources(cfg aws.Config, unusedDays int) ([]ResourceInfo, error) {
	var resources []ResourceInfo
	// Create RDS client
	rdsClient := rds.NewFromConfig(cfg)
	// Get RDS instances
	instances, err := rdsClient.DescribeDBInstances(ra.ctx, &rds.DescribeDBInstancesInput{})
	if err != nil {
		return nil, fmt.Errorf("failed to describe RDS instances: %w", err)
	}
	// Process instances
	for _, instance := range instances.DBInstances {
		// Convert tags to map
		tags := make(map[string]string)
		tagList, err := rdsClient.ListTagsForResource(ra.ctx, &rds.ListTagsForResourceInput{
			ResourceName: instance.DBInstanceArn,
		})
		if err == nil {
			for _, tag := range tagList.TagList {
				tags[*tag.Key] = *tag.Value
			}
		}
		// Get instance details
		resourceInfo := ResourceInfo{
			ResourceID:   *instance.DBInstanceIdentifier,
			ResourceType: "RDS Instance",
			Region:       cfg.Region,
			Size:         *instance.DBInstanceClass,
			State:        *instance.DBInstanceStatus,
			CreatedAt:    *instance.InstanceCreateTime,
			Tags:         tags,
		}
		// Check utilization
		utilization, lastUsed, err := ra.getRDSUtilization(cfg, *instance.DBInstanceIdentifier, unusedDays)
		if err != nil {
			log.Printf("Warning: Failed to get utilization for RDS instance %s: %v", *instance.DBInstanceIdentifier, err)
		} else {
			resourceInfo.Utilization = utilization
			resourceInfo.LastUsed = lastUsed
		}
		// Estimate cost
		resourceInfo.EstimatedCost = estimateRDSCost(*instance.DBInstanceClass, *instance.Engine, cfg.Region)
		// Generate recommendation
		resourceInfo.Recommendation = generateRDSRecommendation(resourceInfo, unusedDays)
		resources = append(resources, resourceInfo)
	}
	return resources, nil
}
// analyzeS3Resources analyzes S3 resources
func (ra *ResourceAnalyzer) analyzeS3Resources(cfg aws.Config, unusedDays int) ([]ResourceInfo, error) {
	var resources []ResourceInfo
	// Create S3 client
	s3Client := s3.NewFromConfig(cfg)
	// Get S3 buckets
	buckets, err := s3Client.ListBuckets(ra.ctx, &s3.ListBucketsInput{})
	if err != nil {
		return nil, fmt.Errorf("failed to list S3 buckets: %w", err)
	}
	// Process buckets
	for _, bucket := range buckets.Buckets {
		// Get bucket location
		location, err := s3Client.GetBucketLocation(ra.ctx, &s3.GetBucketLocationInput{
			Bucket: bucket.Name,
		})
		if err != nil {
			log.Printf("Warning: Failed to get location for bucket %s: %v", *bucket.Name, err)
			continue
		}
		region := "us-east-1" // Default region
		if location.LocationConstraint != "" {
			region = string(location.LocationConstraint)
		}
		// Get bucket tags
		tags := make(map[string]string)
		tagging, err := s3Client.GetBucketTagging(ra.ctx, &s3.GetBucketTaggingInput{
			Bucket: bucket.Name,
		})
		if err == nil {
			for _, tag := range tagging.TagSet {
				tags[*tag.Key] = *tag.Value
			}
		}
		// Get bucket details
		resourceInfo := ResourceInfo{
			ResourceID:   *bucket.Name,
			ResourceType: "S3 Bucket",
			Region:       region,
			CreatedAt:    *bucket.CreationDate,
			Tags:         tags,
		}
		// Check last access
		lastUsed, err := ra.getS3LastAccess(cfg, *bucket.Name)
		if err != nil {
			log.Printf("Warning: Failed to get last access for bucket %s: %v", *bucket.Name, err)
		} else {
			resourceInfo.LastUsed = lastUsed
		}
		// Generate recommendation
		if time.Since(resourceInfo.LastUsed).Hours() > float64(unusedDays*24) {
			resourceInfo.Recommendation = "Consider deleting unused bucket"
		} else {
			resourceInfo.Recommendation = "Bucket is in use"
		}
		resources = append(resources, resourceInfo)
	}
	return resources, nil
}
// getEC2Utilization gets the CPU utilization of an EC2 instance
func (ra *ResourceAnalyzer) getEC2Utilization(cfg aws.Config, instanceID string, unusedDays int) (float64, time.Time, error) {
	// Create CloudWatch client
	cwClient := cloudwatch.NewFromConfig(cfg)
	// Set up time range
	endTime := time.Now()
	startTime := endTime.AddDate(0, 0, -unusedDays)
	// Get CPU utilization
	result, err := cwClient.GetMetricStatistics(ra.ctx, &cloudwatch.GetMetricStatisticsInput{
		Namespace:  aws.String("AWS/EC2"),
		MetricName: aws.String("CPUUtilization"),
		Dimensions: []types.Dimension{
			{
				Name:  aws.String("InstanceId"),
				Value: aws.String(instanceID),
			},
		},
		StartTime:  aws.Time(startTime),
		EndTime:    aws.Time(endTime),
		Period:     aws.Int32(86400), // 1 day
		Statistics: []types.Statistic{types.StatisticAverage},
	})
	if err != nil {
		return 0, time.Time{}, err
	}
	// Process results
	if len(result.Datapoints) == 0 {
		return 0, time.Time{}, nil
	}
	// Sort datapoints by time
	sort.Slice(result.Datapoints, func(i, j int) bool {
		return result.Datapoints[i].Timestamp.After(*result.Datapoints[j].Timestamp)
	})
	// Calculate average utilization
	var totalUtilization float64
	for _, dp := range result.Datapoints {
		totalUtilization += *dp.Average
	}
	avgUtilization := totalUtilization / float64(len(result.Datapoints))
	// Get last used time (most recent datapoint with non-zero utilization)
	var lastUsed time.Time
	for _, dp := range result.Datapoints {
		if *dp.Average > 1.0 { // Consider >1% CPU as "used"
			lastUsed = *dp.Timestamp
			break
		}
	}
	return avgUtilization, lastUsed, nil
}
// getRDSUtilization gets the connection count of an RDS instance
func (ra *ResourceAnalyzer) getRDSUtilization(cfg aws.Config, instanceID string, unusedDays int) (float64, time.Time, error) {
	// Create CloudWatch client
	cwClient := cloudwatch.NewFromConfig(cfg)
	// Set up time range
	endTime := time.Now()
	startTime := endTime.AddDate(0, 0, -unusedDays)
	// Get database connections
	result, err := cwClient.GetMetricStatistics(ra.ctx, &cloudwatch.GetMetricStatisticsInput{
		Namespace:  aws.String("AWS/RDS"),
		MetricName: aws.String("DatabaseConnections"),
		Dimensions: []types.Dimension{
			{
				Name:  aws.String("DBInstanceIdentifier"),
				Value: aws.String(instanceID),
			},
		},
		StartTime:  aws.Time(startTime),
		EndTime:    aws.Time(endTime),
		Period:     aws.Int32(86400), // 1 day
		Statistics: []types.Statistic{types.StatisticAverage},
	})
	if err != nil {
		return 0, time.Time{}, err
	}
	// Process results
	if len(result.Datapoints) == 0 {
		return 0, time.Time{}, nil
	}
	// Sort datapoints by time
	sort.Slice(result.Datapoints, func(i, j int) bool {
		return result.Datapoints[i].Timestamp.After(*result.Datapoints[j].Timestamp)
	})
	// Calculate average connections
	var totalConnections float64
	for _, dp := range result.Datapoints {
		totalConnections += *dp.Average
	}
	avgConnections := totalConnections / float64(len(result.Datapoints))
	// Get last used time (most recent datapoint with non-zero connections)
	var lastUsed time.Time
	for _, dp := range result.Datapoints {
		if *dp.Average > 0 {
			lastUsed = *dp.Timestamp
			break
		}
	}
	return avgConnections, lastUsed, nil
}
// getS3LastAccess gets the last access time of an S3 bucket
func (ra *ResourceAnalyzer) getS3LastAccess(cfg aws.Config, bucketName string) (time.Time, error) {
	// In a real implementation, this would use S3 analytics or CloudTrail
	// For this example, we'll return a random time within the last 90 days
	days := rand.Intn(90)
	return time.Now().AddDate(0, 0, -days), nil
}
// Helper functions for cost estimation and recommendations
func estimateEC2Cost(instanceType, region string) float64 {
	// Simplified cost estimation based on instance type
	// In a real implementation, this would use pricing API or a pricing database
	switch {
	case strings.HasPrefix(instanceType, "t2.micro"):
		return 8.5
	case strings.HasPrefix(instanceType, "t2.small"):
		return 17.0
	case strings.HasPrefix(instanceType, "t2.medium"):
		return 34.0
	case strings.HasPrefix(instanceType, "m5.large"):
		return 69.0
	case strings.HasPrefix(instanceType, "m5.xlarge"):
		return 138.0
	default:
		return 50.0 // Default estimate
	}
}
func estimateEBSCost(sizeGB int, volumeType, region string) float64 {
	// Simplified cost estimation based on volume type and size
	var pricePerGB float64
	switch volumeType {
	case "gp2", "gp3":
		pricePerGB = 0.1
	case "io1", "io2":
		pricePerGB = 0.125
	case "st1":
		pricePerGB = 0.045
	case "sc1":
		pricePerGB = 0.025
	default:
		pricePerGB = 0.1
	}
	return float64(sizeGB) * pricePerGB
}
func estimateRDSCost(instanceClass, engine, region string) float64 {
	// Simplified cost estimation based on instance class
	switch {
	case strings.HasPrefix(instanceClass, "db.t3.micro"):
		return 12.5
	case strings.HasPrefix(instanceClass, "db.t3.small"):
		return 25.0
	case strings.HasPrefix(instanceClass, "db.t3.medium"):
		return 50.0
	case strings.HasPrefix(instanceClass, "db.m5.large"):
		return 120.0
	case strings.HasPrefix(instanceClass, "db.m5.xlarge"):
		return 240.0
	default:
		return 100.0 // Default estimate
	}
}
func generateEC2Recommendation(resource ResourceInfo, unusedDays int) string {
	// Generate recommendation based on resource state and utilization
	if resource.State == "stopped" {
		return "Consider terminating stopped instance"
	}
	if resource.Utilization < 5.0 && time.Since(resource.LastUsed).Hours() > float64(unusedDays*24) {
		return "Terminate idle instance (low CPU utilization)"
	}
	if resource.Utilization < 20.0 {
		return "Consider downsizing instance (low CPU utilization)"
	}
	// Check for missing required tags
	missingTags := []string{}
	for _, tag := range ra.requiredTags {
		if _, ok := resource.Tags[tag]; !ok {
			missingTags = append(missingTags, tag)
		}
	}
	if len(missingTags) > 0 {
		return fmt.Sprintf("Add missing tags: %s", strings.Join(missingTags, ", "))
	}
	return "No action needed"
}
func generateRDSRecommendation(resource ResourceInfo, unusedDays int) string {
	// Generate recommendation based on resource state and utilization
	if resource.Utilization < 1.0 && time.Since(resource.LastUsed).Hours() > float64(unusedDays*24) {
		return "Consider deleting idle database (no connections)"
	}
	if resource.Utilization < 5.0 {
		return "Consider downsizing instance (low connection count)"
	}
	// Check for missing required tags
	missingTags := []string{}
	for _, tag := range ra.requiredTags {
		if _, ok := resource.Tags[tag]; !ok {
			missingTags = append(missingTags, tag)
		}
	}
	if len(missingTags) > 0 {
		return fmt.Sprintf("Add missing tags: %s", strings.Join(missingTags, ", "))
	}
	return "No action needed"
}
// Output functions
func outputTable(resources []ResourceInfo) {
	table := tablewriter.NewWriter(os.Stdout)
	table.SetHeader([]string{"ID", "Type", "Region", "Size", "State", "Created", "Last Used", "Cost ($)", "Recommendation"})
	for _, res := range resources {
		lastUsedStr := ""
		if !res.LastUsed.IsZero() {
			lastUsedStr = res.LastUsed.Format("2006-01-02")
		}
		table.Append([]string{
			res.ResourceID,
			res.ResourceType,
			res.Region,
			res.Size,
			res.State,
			res.CreatedAt.Format("2006-01-02"),
			lastUsedStr,
			fmt.Sprintf("%.2f", res.EstimatedCost),
			res.Recommendation,
		})
	}
	table.Render()
}
func outputJSON(resources []ResourceInfo) {
	data, err := json.MarshalIndent(resources, "", "  ")
	if err != nil {
		log.Fatalf("Failed to marshal resources to JSON: %v", err)
	}
	fmt.Println(string(data))
}
func outputCSV(resources []ResourceInfo) {
	fmt.Println("ID,Type,Region,Size,State,Created,LastUsed,Cost,Recommendation")
	for _, res := range resources {
		lastUsedStr := ""
		if !res.LastUsed.IsZero() {
			lastUsedStr = res.LastUsed.Format("2006-01-02")
		}
		fmt.Printf("%s,%s,%s,%s,%s,%s,%s,%.2f,%s\n",
			res.ResourceID,
			res.ResourceType,
			res.Region,
			res.Size,
			res.State,
			res.CreatedAt.Format("2006-01-02"),
			lastUsedStr,
			res.EstimatedCost,
			res.Recommendation,
		)
	}
}
func printSummary(resources []ResourceInfo) {
	// Calculate total cost
	var totalCost float64
	var potentialSavings float64
	resourceCounts := make(map[string]int)
	recommendationCounts := make(map[string]int)
	for _, res := range resources {
		totalCost += res.EstimatedCost
		resourceCounts[res.ResourceType]++
		if res.Recommendation != "No action needed" && res.Recommendation != "Resource is in use" {
			potentialSavings += res.EstimatedCost
			recommendationCounts[res.Recommendation]++
		}
	}
	fmt.Println("\nSummary:")
	fmt.Printf("Total resources: %d\n", len(resources))
	fmt.Printf("Total monthly cost: $%.2f\n", totalCost)
	fmt.Printf("Potential monthly savings: $%.2f (%.1f%%)\n", potentialSavings, (potentialSavings/totalCost)*100)
	fmt.Println("\nResource Types:")
	for resType, count := range resourceCounts {
		fmt.Printf("  %s: %d\n", resType, count)
	}
	fmt.Println("\nTop Recommendations:")
	for rec, count := range recommendationCounts {
		if count > 0 {
			fmt.Printf("  %s: %d resources\n", rec, count)
		}
	}
}

• Implemented a Terraform module for enforcing resource tagging:


# aws_resource_tagging.tf
variable "required_tags" {
  description = "Required tags for all resources"
  type        = map(string)
  default = {
    Owner       = "unknown"
    Project     = "unknown"
    Environment = "unknown"
    ManagedBy   = "terraform"
  }
}
variable "tag_enforcement_enabled" {
  description = "Enable tag enforcement"
  type        = bool
  default     = true
}
# AWS Organizations policy to enforce tagging
resource "aws_organizations_policy" "tag_policy" {
  count = var.tag_enforcement_enabled ? 1 : 0
  name  = "required-tags-policy"
  content = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid       = "RequireTagsForEC2"
        Effect    = "Deny"
        Action    = ["ec2:RunInstances", "ec2:CreateVolume"]
        Resource  = ["arn:aws:ec2:*:*:instance/*", "arn:aws:ec2:*:*:volume/*"]
        Condition = {
          "Null" = {
            "aws:RequestTag/Owner"       = "true"
            "aws:RequestTag/Project"     = "true"
            "aws:RequestTag/Environment" = "true"
          }
        }
      },
      {
        Sid       = "RequireTagsForRDS"
        Effect    = "Deny"
        Action    = ["rds:CreateDBInstance"]
        Resource  = ["arn:aws:rds:*:*:db:*"]
        Condition = {
          "Null" = {
            "aws:RequestTag/Owner"       = "true"
            "aws:RequestTag/Project"     = "true"
            "aws:RequestTag/Environment" = "true"
          }
        }
      },
      {
        Sid       = "RequireTagsForS3"
        Effect    = "Deny"
        Action    = ["s3:CreateBucket"]
        Resource  = ["arn:aws:s3:::*"]
        Condition = {
          "Null" = {
            "aws:RequestTag/Owner"       = "true"
            "aws:RequestTag/Project"     = "true"
            "aws:RequestTag/Environment" = "true"
          }
        }
      }
    ]
  })
}
# Lambda function to check for untagged resources
resource "aws_lambda_function" "tag_compliance_checker" {
  function_name    = "tag-compliance-checker"
  role             = aws_iam_role.tag_compliance_checker.arn
  handler          = "index.handler"
  runtime          = "nodejs14.x"
  timeout          = 300
  memory_size      = 256
  source_code_hash = filebase64sha256("${path.module}/tag_compliance_checker.zip")
  filename         = "${path.module}/tag_compliance_checker.zip"
  environment {
    variables = {
      REQUIRED_TAGS = jsonencode(keys(var.required_tags))
      SNS_TOPIC_ARN = aws_sns_topic.tag_compliance_alerts.arn
    }
  }
}
# IAM role for the Lambda function
resource "aws_iam_role" "tag_compliance_checker" {
  name = "tag-compliance-checker-role"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "lambda.amazonaws.com"
        }
      }
    ]
  })
}
# IAM policy for the Lambda function
resource "aws_iam_policy" "tag_compliance_checker" {
  name        = "tag-compliance-checker-policy"
  description = "Policy for tag compliance checker Lambda"
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = [
          "ec2:DescribeInstances",
          "ec2:DescribeVolumes",
          "rds:DescribeDBInstances",
          "s3:ListAllMyBuckets",
          "s3:GetBucketTagging",
          "logs:CreateLogGroup",
          "logs:CreateLogStream",
          "logs:PutLogEvents",
          "sns:Publish"
        ]
        Effect   = "Allow"
        Resource = "*"
      }
    ]
  })
}
# Attach policy to role
resource "aws_iam_role_policy_attachment" "tag_compliance_checker" {
  role       = aws_iam_role.tag_compliance_checker.name
  policy_arn = aws_iam_policy.tag_compliance_checker.arn
}
# CloudWatch event rule to trigger the Lambda function daily
resource "aws_cloudwatch_event_rule" "tag_compliance_daily" {
  name                = "tag-compliance-daily-check"
  description         = "Trigger tag compliance check daily"
  schedule_expression = "rate(1 day)"
}
# CloudWatch event target
resource "aws_cloudwatch_event_target" "tag_compliance_lambda" {
  rule      = aws_cloudwatch_event_rule.tag_compliance_daily.name
  target_id = "tag-compliance-checker"
  arn       = aws_lambda_function.tag_compliance_checker.arn
}
# Permission for CloudWatch to invoke Lambda
resource "aws_lambda_permission" "allow_cloudwatch" {
  statement_id  = "AllowExecutionFromCloudWatch"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.tag_compliance_checker.function_name
  principal     = "events.amazonaws.com"
  source_arn    = aws_cloudwatch_event_rule.tag_compliance_daily.arn
}
# SNS topic for alerts
resource "aws_sns_topic" "tag_compliance_alerts" {
  name = "tag-compliance-alerts"
}
# Default tags for all resources
provider "aws" {
  default_tags {
    tags = var.required_tags
  }
}
# Output the SNS topic ARN
output "tag_compliance_sns_topic_arn" {
  value = aws_sns_topic.tag_compliance_alerts.arn
}
# Output the Lambda function ARN
output "tag_compliance_lambda_arn" {
  value = aws_lambda_function.tag_compliance_checker.arn
}

• Long-term: Implemented a comprehensive cloud cost optimization strategy:

Created a centralized tagging and cost allocation system

Implemented automated resource cleanup for unused resources

Developed a cost optimization dashboard

Established clear procedures for resource provisioning

Implemented monitoring and alerting for cost anomalies

Lessons Learned:

Cloud cost management requires proactive monitoring and automated cleanup of unused resources.

How to Avoid:

Implement mandatory resource tagging for cost allocation.
Set up automated detection and cleanup of unused resources.
Establish clear ownership and lifecycle policies for cloud resources.
Implement cost anomaly detection and alerting.
Regularly review and optimize cloud resource utilization.

Answer 5

output:

Cloud Cost Optimization AWS, Terraform, Multi-account organization

Summary:

No summary provided

What Happened:

During a monthly financial review, the finance team flagged a significant increase in AWS costs across multiple accounts. The increase occurred gradually over several months but accelerated in the last billing cycle. Initial investigation showed no corresponding increase in application traffic or planned infrastructure expansion. The cost increase was spread across multiple services and accounts, making it difficult to identify the root cause through standard AWS Cost Explorer views.

Diagnosis Steps:

Analyzed detailed AWS Cost and Usage Reports (CUR) to identify cost anomalies.
Compared resource counts and types across multiple billing periods.
Used AWS Resource Explorer to identify resources across all accounts and regions.
Reviewed recent infrastructure deployments and changes.
Analyzed resource tagging compliance across the organization.

Root Cause:

The investigation revealed multiple issues contributing to the cost increase: 1. Numerous orphaned EBS volumes remained after EC2 instance termination 2. Development environments were provisioned but not decommissioned after project completion 3. Unused Elastic IPs were allocated but not attached to any resources 4. Several large RDS instances were over-provisioned with minimal utilization 5. Multiple Lambda functions with excessive memory allocation and timeout settings

Fix/Workaround:

• Created a resource cleanup script to identify and remove unused resources

• Implemented a Terraform module for enforcing resource tagging:


# aws_tagging_policy.tf - Enforce resource tagging
module "tagging_policy" {
  source = "./modules/tagging-policy"
  required_tags = {
    Environment = ["dev", "staging", "prod"]
    Project     = true
    Owner       = true
    CostCenter  = true
  }
  tag_enforcement_resources = [
    "aws_instance",
    "aws_volume",
    "aws_db_instance",
    "aws_lambda_function"
  ]
}

• Developed a cost optimization dashboard with resource utilization metrics

• Implemented automated resource cleanup for development environments

• Created a cost allocation tagging strategy with enforcement

Lessons Learned:

Proactive resource management and tagging are essential for cloud cost control.

How to Avoid:

Implement automated resource cleanup for orphaned and unused resources.
Enforce tagging policies for all cloud resources.
Set up cost anomaly detection with automated alerts.
Implement resource lifecycle policies for development environments.
Regularly review and right-size provisioned resources.

Answer 6

output:

Cloud Cost Optimization AWS, Multi-account organization, Reserved Instance management

Summary:

No summary provided

What Happened:

During a quarterly cloud cost review, the finance team identified that despite having purchased a large number of Reserved Instances across multiple AWS accounts, the company's overall cloud costs were higher than expected. Further investigation revealed that many RIs were underutilized or completely unused, while on-demand instances were being provisioned for workloads that could have been covered by existing RIs. The situation resulted in the company effectively paying twice for some resources - once for the unused RI commitment and again for the on-demand instances.

Diagnosis Steps:

Analyzed Reserved Instance utilization reports across all accounts.
Compared RI inventory with actual instance usage patterns.
Reviewed RI purchase history and decision-making process.
Examined instance type distribution across workloads.
Assessed RI sharing configuration across the organization.

Root Cause:

The investigation revealed multiple issues with RI management: 1. RIs were purchased based on historical usage without accounting for planned workload changes 2. RI sharing was not properly configured across all accounts in the organization 3. Instance type standardization was lacking, leading to fragmented RI coverage 4. No regular review process existed for RI utilization and optimization 5. Teams were provisioning resources without visibility into existing RI inventory

Fix/Workaround:

• Implemented immediate RI optimization actions

• Created a centralized RI management strategy

• Established regular RI utilization reviews

• Developed instance type standardization guidelines

• Implemented automated RI coverage monitoring

Lessons Learned:

Reserved Instance management requires ongoing optimization and organization-wide visibility.

How to Avoid:

Implement centralized RI purchasing and management.
Configure proper RI sharing across all accounts in the organization.
Establish regular RI utilization reviews and optimization cycles.
Create instance type standardization guidelines for workloads.
Develop automated alerting for underutilized RIs and coverage gaps.

Answer 7

output:

Cloud Cost Optimization Multi-cloud (AWS, Azure), Large enterprise, Multiple business units

Summary:

No summary provided

What Happened:

A large enterprise with multiple business units migrated to a multi-cloud environment. The finance team implemented a chargeback model to allocate cloud costs to different departments. However, after several months, they discovered that a significant portion of cloud resources (approximately 40%) were untagged or incorrectly tagged, making accurate cost allocation impossible. This led to financial disputes between departments, budget overruns, and delayed cloud adoption initiatives.

Diagnosis Steps:

Analyzed resource tagging compliance across cloud accounts.
Reviewed tagging policies and enforcement mechanisms.
Examined resource provisioning workflows and automation.
Interviewed teams about tagging practices and challenges.
Assessed the cost allocation and reporting processes.

Root Cause:

The investigation revealed multiple issues with the tagging strategy: 1. Inconsistent tagging standards across different cloud platforms 2. Lack of automated tag validation during resource provisioning 3. Manual resource creation bypassing governance controls 4. Insufficient training and awareness about tagging importance 5. No regular auditing or remediation of untagged resources

Fix/Workaround:

• Implemented immediate tagging remediation for existing resources

• Created consistent cross-cloud tagging standards

• Developed automated tag validation and enforcement

• Established regular tagging compliance audits

• Improved cost allocation reporting for untagged resources

Lessons Learned:

Effective cloud cost allocation requires consistent tagging governance and automation.

How to Avoid:

Implement automated tag validation during resource provisioning.
Create consistent tagging standards across all cloud platforms.
Establish regular tagging compliance audits and remediation.
Provide comprehensive training on tagging importance and practices.
Develop fallback allocation methods for untagged resources.

Answer 8

output:

Cloud Cost Optimization Multi-cloud (AWS, Azure, GCP), Terraform, Production environment

Summary:

No summary provided

What Happened:

A mid-sized SaaS company noticed their cloud bill had increased by 35% over three months without a corresponding increase in customer usage or new deployments. The finance team flagged the issue during quarterly budget reviews, but the engineering team couldn't immediately identify the cause. Initial investigations focused on active production workloads, but these showed normal resource utilization. Further investigation revealed numerous orphaned resources across multiple cloud providers, including unused load balancers, detached storage volumes, idle database instances, and test environments that were never decommissioned after project completions.

Diagnosis Steps:

Analyzed billing data across all cloud providers for the past six months.
Created resource inventory reports categorized by service type, account, and team.
Cross-referenced resources with active projects and applications.
Examined resource creation patterns and identified resources without proper ownership tags.
Reviewed infrastructure-as-code repositories to identify resources created outside of the standard process.

Root Cause:

The investigation revealed multiple issues contributing to orphaned resources: 1. Lack of consistent resource tagging strategy across cloud providers 2. No automated process for decommissioning test and development environments 3. Manual resource creation outside of infrastructure-as-code workflows 4. Incomplete ownership and lifecycle metadata for cloud resources 5. Absence of regular cost reviews and resource pruning processes

Fix/Workaround:

• Implemented immediate cost reduction measures

• Identified and terminated clearly unused resources (saving ~20% of costs)

• Established a resource tagging policy and enforcement mechanism

• Created an automated orphaned resource detection system

• Implemented regular cost review processes with team accountability


# Resource Tagging Policy Implementation
# File: resource_tagging_policy.tf
# AWS Provider Configuration with Default Tags
provider "aws" {
  region = var.aws_region
  default_tags {
    tags = {
      Environment     = var.environment
      Project         = var.project_name
      Owner           = var.team_email
      CostCenter      = var.cost_center
      CreatedBy       = "Terraform"
      CreationDate    = timestamp()
      ExpirationDate  = var.environment == "production" ? null : timeadd(timestamp(), var.resource_ttl)
    }
  }
}
# Azure Provider with Required Tags Policy
resource "azurerm_policy_definition" "require_tags" {
  name         = "require-resource-tags"
  display_name = "Require specified tags on all resources"
  description  = "This policy ensures all resources have the required tags"
  policy_type  = "Custom"
  mode         = "Indexed"
  metadata = <<METADATA
    {
      "version": "1.0.0",
      "category": "Tags"
    }
METADATA
  policy_rule = <<POLICY_RULE
    {
      "if": {
        "anyOf": [
          {
            "field": "tags['Environment']",
            "exists": "false"
          },
          {
            "field": "tags['Project']",
            "exists": "false"
          },
          {
            "field": "tags['Owner']",
            "exists": "false"
          },
          {
            "field": "tags['CostCenter']",
            "exists": "false"
          },
          {
            "field": "tags['CreatedBy']",
            "exists": "false"
          },
          {
            "field": "tags['CreationDate']",
            "exists": "false"
          },
          {
            "field": "tags['ExpirationDate']",
            "exists": "false",
            "equals": "false",
            "where": {
              "value": "[resourceGroup().tags['Environment']]",
              "notEquals": "production"
            }
          }
        ]
      },
      "then": {
        "effect": "deny"
      }
    }
POLICY_RULE
}
# GCP Organization Policy for Required Labels
resource "google_organization_policy" "require_labels" {
  org_id     = var.organization_id
  constraint = "constraints/compute.requireLabels"
  list_policy {
    allow {
      values = [
        "Environment",
        "Project",
        "Owner",
        "CostCenter",
        "CreatedBy",
        "CreationDate",
        "ExpirationDate"
      ]
    }
  }
}
# Automated Resource Expiration Lambda Function
resource "aws_lambda_function" "resource_expiration_checker" {
  function_name    = "resource-expiration-checker"
  role             = aws_iam_role.resource_expiration_checker.arn
  handler          = "index.handler"
  runtime          = "nodejs14.x"
  timeout          = 300
  memory_size      = 256
  environment {
    variables = {
      NOTIFICATION_SNS_TOPIC = aws_sns_topic.resource_expiration_notifications.arn
      DRY_RUN               = "false"
      GRACE_PERIOD_DAYS     = "7"
    }
  }
  filename         = "${path.module}/lambda/resource_expiration_checker.zip"
  source_code_hash = filebase64sha256("${path.module}/lambda/resource_expiration_checker.zip")
}
# CloudWatch Event Rule to trigger the Lambda daily
resource "aws_cloudwatch_event_rule" "daily_resource_check" {
  name                = "daily-resource-expiration-check"
  description         = "Triggers the resource expiration checker Lambda daily"
  schedule_expression = "cron(0 1 * * ? *)"
}
resource "aws_cloudwatch_event_target" "check_resources_daily" {
  rule      = aws_cloudwatch_event_rule.daily_resource_check.name
  target_id = "resource_expiration_checker"
  arn       = aws_lambda_function.resource_expiration_checker.arn
}
# SNS Topic for notifications
resource "aws_sns_topic" "resource_expiration_notifications" {
  name = "resource-expiration-notifications"
}
# Subscription for the team
resource "aws_sns_topic_subscription" "team_email" {
  topic_arn = aws_sns_topic.resource_expiration_notifications.arn
  protocol  = "email"
  endpoint  = var.team_email
}


// AWS Lambda Function for Resource Expiration Checking
// File: resource_expiration_checker.js
const AWS = require('aws-sdk');
const moment = require('moment');
// Initialize AWS clients
const ec2 = new AWS.EC2();
const rds = new AWS.RDS();
const s3 = new AWS.S3();
const elasticache = new AWS.ElastiCache();
const sns = new AWS.SNS();
// Configuration from environment variables
const SNS_TOPIC = process.env.NOTIFICATION_SNS_TOPIC;
const DRY_RUN = process.env.DRY_RUN === 'true';
const GRACE_PERIOD_DAYS = parseInt(process.env.GRACE_PERIOD_DAYS || '7', 10);
exports.handler = async (event) => {
  console.log('Starting resource expiration check');
  // Track resources for reporting
  const report = {
    expiredResources: [],
    expiringResources: [],
    errors: []
  };
  try {
    // Check EC2 instances
    await checkEC2Instances(report);
    // Check EBS volumes
    await checkEBSVolumes(report);
    // Check RDS instances
    await checkRDSInstances(report);
    // Check ElastiCache clusters
    await checkElastiCacheClusters(report);
    // Check S3 buckets
    await checkS3Buckets(report);
    // Send notification with report
    await sendReport(report);
    return {
      statusCode: 200,
      body: JSON.stringify({
        message: 'Resource expiration check completed',
        expiredCount: report.expiredResources.length,
        expiringCount: report.expiringResources.length,
        errorCount: report.errors.length
      })
    };
  } catch (error) {
    console.error('Error in resource expiration check:', error);
    return {
      statusCode: 500,
      body: JSON.stringify({ error: error.message })
    };
  }
};
async function checkEC2Instances(report) {
  console.log('Checking EC2 instances');
  try {
    const { Reservations } = await ec2.describeInstances({}).promise();
    for (const reservation of Reservations) {
      for (const instance of reservation.Instances) {
        try {
          // Skip terminated instances
          if (instance.State.Name === 'terminated') continue;
          const tags = instance.Tags || [];
          const expirationTag = tags.find(tag => tag.Key === 'ExpirationDate');
          if (expirationTag && expirationTag.Value) {
            const expirationDate = moment(expirationTag.Value);
            const now = moment();
            if (expirationDate.isBefore(now)) {
              // Resource is expired
              report.expiredResources.push({
                type: 'EC2 Instance',
                id: instance.InstanceId,
                expirationDate: expirationTag.Value,
                action: DRY_RUN ? 'Would terminate' : 'Terminating'
              });
              if (!DRY_RUN) {
                await ec2.terminateInstances({
                  InstanceIds: [instance.InstanceId]
                }).promise();
              }
            } else if (expirationDate.isBefore(now.add(GRACE_PERIOD_DAYS, 'days'))) {
              // Resource is expiring soon
              report.expiringResources.push({
                type: 'EC2 Instance',
                id: instance.InstanceId,
                expirationDate: expirationTag.Value,
                daysRemaining: expirationDate.diff(now, 'days')
              });
            }
          }
        } catch (instanceError) {
          report.errors.push({
            type: 'EC2 Instance',
            id: instance.InstanceId,
            error: instanceError.message
          });
        }
      }
    }
  } catch (error) {
    report.errors.push({
      type: 'EC2 Service',
      error: error.message
    });
  }
}
// Similar functions for other resource types
// checkEBSVolumes, checkRDSInstances, checkElastiCacheClusters, checkS3Buckets
// Implementation omitted for brevity
async function sendReport(report) {
  if (report.expiredResources.length === 0 && 
      report.expiringResources.length === 0 && 
      report.errors.length === 0) {
    console.log('No resources to report');
    return;
  }
  let message = 'Resource Expiration Report\n\n';
  if (report.expiredResources.length > 0) {
    message += `Expired Resources (${report.expiredResources.length}):\n`;
    report.expiredResources.forEach(resource => {
      message += `- ${resource.type} ${resource.id}: Expired on ${resource.expirationDate}, ${resource.action}\n`;
    });
    message += '\n';
  }
  if (report.expiringResources.length > 0) {
    message += `Expiring Soon (${report.expiringResources.length}):\n`;
    report.expiringResources.forEach(resource => {
      message += `- ${resource.type} ${resource.id}: Expires in ${resource.daysRemaining} days\n`;
    });
    message += '\n';
  }
  if (report.errors.length > 0) {
    message += `Errors (${report.errors.length}):\n`;
    report.errors.forEach(error => {
      message += `- ${error.type}${error.id ? ' ' + error.id : ''}: ${error.error}\n`;
    });
  }
  await sns.publish({
    TopicArn: SNS_TOPIC,
    Subject: `Resource Expiration Report - ${DRY_RUN ? 'DRY RUN' : 'LIVE RUN'}`,
    Message: message
  }).promise();
  console.log('Report sent to SNS topic');
}


# Cloud Cost Analysis and Orphaned Resource Detection
# File: orphaned_resource_detector.py
import argparse
import boto3
import csv
import datetime
import json
import logging
import os
from azure.identity import DefaultAzureCredential
from azure.mgmt.resource import ResourceManagementClient
from azure.mgmt.compute import ComputeManagementClient
from azure.mgmt.network import NetworkManagementClient
from azure.mgmt.storage import StorageManagementClient
from google.cloud import compute_v1
from google.cloud import storage
# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
class OrphanedResourceDetector:
    def __init__(self, output_dir, days_threshold=30):
        self.output_dir = output_dir
        self.days_threshold = days_threshold
        self.orphaned_resources = []
        # Ensure output directory exists
        os.makedirs(output_dir, exist_ok=True)
    def detect_all(self):
        """Run detection across all cloud providers"""
        logger.info("Starting orphaned resource detection")
        # AWS detection
        self.detect_aws_orphaned_resources()
        # Azure detection
        self.detect_azure_orphaned_resources()
        # GCP detection
        self.detect_gcp_orphaned_resources()
        # Generate reports
        self.generate_reports()
        logger.info(f"Detection complete. Found {len(self.orphaned_resources)} potentially orphaned resources")
        return self.orphaned_resources
    def detect_aws_orphaned_resources(self):
        """Detect orphaned resources in AWS"""
        logger.info("Detecting AWS orphaned resources")
        try:
            # Initialize AWS clients
            ec2 = boto3.client('ec2')
            elb = boto3.client('elb')
            elbv2 = boto3.client('elbv2')
            rds = boto3.client('rds')
            # Check for unattached EBS volumes
            self._detect_unattached_ebs_volumes(ec2)
            # Check for idle EC2 instances
            self._detect_idle_ec2_instances(ec2)
            # Check for unused Elastic IPs
            self._detect_unused_elastic_ips(ec2)
            # Check for unused load balancers
            self._detect_unused_load_balancers(elb, elbv2)
            # Check for idle RDS instances
            self._detect_idle_rds_instances(rds)
        except Exception as e:
            logger.error(f"Error detecting AWS orphaned resources: {str(e)}")
    def detect_azure_orphaned_resources(self):
        """Detect orphaned resources in Azure"""
        logger.info("Detecting Azure orphaned resources")
        try:
            # Initialize Azure clients
            credential = DefaultAzureCredential()
            resource_client = ResourceManagementClient(credential, os.environ.get("AZURE_SUBSCRIPTION_ID"))
            compute_client = ComputeManagementClient(credential, os.environ.get("AZURE_SUBSCRIPTION_ID"))
            network_client = NetworkManagementClient(credential, os.environ.get("AZURE_SUBSCRIPTION_ID"))
            storage_client = StorageManagementClient(credential, os.environ.get("AZURE_SUBSCRIPTION_ID"))
            # Check for unused disks
            self._detect_unused_azure_disks(compute_client)
            # Check for idle VMs
            self._detect_idle_azure_vms(compute_client)
            # Check for unused public IPs
            self._detect_unused_azure_public_ips(network_client)
            # Check for unused network security groups
            self._detect_unused_azure_nsgs(network_client)
        except Exception as e:
            logger.error(f"Error detecting Azure orphaned resources: {str(e)}")
    def detect_gcp_orphaned_resources(self):
        """Detect orphaned resources in GCP"""
        logger.info("Detecting GCP orphaned resources")
        try:
            # Initialize GCP clients
            compute_client = compute_v1.InstancesClient()
            disks_client = compute_v1.DisksClient()
            addresses_client = compute_v1.AddressesClient()
            storage_client = storage.Client()
            # Check for unused persistent disks
            self._detect_unused_gcp_disks(disks_client)
            # Check for idle VM instances
            self._detect_idle_gcp_instances(compute_client)
            # Check for unused static IPs
            self._detect_unused_gcp_addresses(addresses_client)
        except Exception as e:
            logger.error(f"Error detecting GCP orphaned resources: {str(e)}")
    # AWS detection methods
    def _detect_unattached_ebs_volumes(self, ec2):
        """Detect unattached EBS volumes"""
        try:
            response = ec2.describe_volumes(
                Filters=[{'Name': 'status', 'Values': ['available']}]
            )
            for volume in response['Volumes']:
                # Check if volume has been unattached for more than threshold days
                create_time = volume['CreateTime']
                age_days = (datetime.datetime.now(datetime.timezone.utc) - create_time).days
                if age_days > self.days_threshold:
                    tags = {tag['Key']: tag['Value'] for tag in volume.get('Tags', [])}
                    self.orphaned_resources.append({
                        'cloud_provider': 'AWS',
                        'resource_type': 'EBS Volume',
                        'resource_id': volume['VolumeId'],
                        'region': volume['AvailabilityZone'][:-1],  # Remove AZ letter to get region
                        'created_time': create_time.isoformat(),
                        'age_days': age_days,
                        'size': f"{volume['Size']} GB",
                        'monthly_cost_estimate': round(volume['Size'] * 0.1, 2),  # Rough estimate
                        'tags': tags,
                        'owner': tags.get('Owner', 'Unknown'),
                        'project': tags.get('Project', 'Unknown'),
                        'environment': tags.get('Environment', 'Unknown'),
                        'last_attached': 'Unknown'
                    })
        except Exception as e:
            logger.error(f"Error detecting unattached EBS volumes: {str(e)}")
    # Additional detection methods for other resource types and cloud providers
    # Implementation omitted for brevity
    def generate_reports(self):
        """Generate reports from detected orphaned resources"""
        if not self.orphaned_resources:
            logger.info("No orphaned resources detected")
            return
        # Save full JSON report
        json_path = os.path.join(self.output_dir, 'orphaned_resources.json')
        with open(json_path, 'w') as f:
            json.dump(self.orphaned_resources, f, indent=2)
        # Save CSV report
        csv_path = os.path.join(self.output_dir, 'orphaned_resources.csv')
        with open(csv_path, 'w', newline='') as f:
            if not self.orphaned_resources:
                f.write("No orphaned resources detected")
                return
            fieldnames = self.orphaned_resources[0].keys()
            writer = csv.DictWriter(f, fieldnames=fieldnames)
            writer.writeheader()
            writer.writerows(self.orphaned_resources)
        # Generate cost summary by resource type
        cost_summary = {}
        for resource in self.orphaned_resources:
            resource_type = resource['resource_type']
            cost = resource.get('monthly_cost_estimate', 0)
            if resource_type not in cost_summary:
                cost_summary[resource_type] = {
                    'count': 0,
                    'total_cost': 0
                }
            cost_summary[resource_type]['count'] += 1
            cost_summary[resource_type]['total_cost'] += cost
        # Save cost summary
        summary_path = os.path.join(self.output_dir, 'cost_summary.json')
        with open(summary_path, 'w') as f:
            json.dump(cost_summary, f, indent=2)
        logger.info(f"Reports generated in {self.output_dir}")
if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Detect orphaned cloud resources')
    parser.add_argument('--output-dir', default='./reports', help='Directory to store reports')
    parser.add_argument('--days-threshold', type=int, default=30, help='Age threshold in days')
    args = parser.parse_args()
    detector = OrphanedResourceDetector(args.output_dir, args.days_threshold)
    detector.detect_all()

Lessons Learned:

Effective cloud cost management requires proactive resource lifecycle tracking and automated cleanup processes.

How to Avoid:

Implement comprehensive resource tagging policies across all cloud providers.
Use infrastructure-as-code for all resource provisioning with mandatory lifecycle metadata.
Create automated processes for detecting and cleaning up orphaned resources.
Establish regular cost reviews with team accountability.
Implement time-to-live (TTL) for non-production resources with automated cleanup.

Answer 9

output:

Cloud Cost Optimization AWS, Multi-account organization, Production environment

Summary:

No summary provided

What Happened:

A large enterprise had implemented a centralized cloud cost optimization strategy that included purchasing 3-year Reserved Instances for their predictable workloads. After a year, a cost analysis revealed that many of these RIs were significantly underutilized, with some having utilization rates below 30%. Despite the discount from on-demand pricing, the company was effectively wasting money on unused capacity. The issue was particularly severe in development and testing environments, where workloads were often turned off outside of business hours but the RIs continued to be billed.

Diagnosis Steps:

Analyzed RI utilization reports across all accounts.
Reviewed workload patterns and instance usage.
Examined the RI purchase decision process.
Compared actual usage with forecasted usage.
Investigated instance scheduling practices.

Root Cause:

The investigation revealed multiple issues with the RI strategy: 1. RI purchases were based on peak usage rather than average utilization 2. Development and testing environments were included in RI purchases despite their intermittent usage 3. There was no process for redistributing underutilized RIs across accounts 4. The company had purchased 3-year terms for all RIs without considering workload volatility 5. There was no regular review process for RI utilization

Fix/Workaround:

• Implemented immediate improvements to RI management

• Created a centralized RI management function

• Implemented instance scheduling for non-production environments

• Converted some RIs to more flexible Savings Plans

• Established regular RI utilization reviews

Lessons Learned:

Reserved Instances require careful planning, ongoing management, and regular optimization to achieve their cost-saving potential.

How to Avoid:

Match RI terms to workload stability (longer terms for stable workloads only).
Implement instance scheduling for non-production environments.
Create a centralized RI management function with regular reviews.
Consider more flexible discount options like Savings Plans for variable workloads.
Establish clear processes for redistributing underutilized RIs.

# Cloud Cost Optimization Scenarios

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid: