The finance team reported that the monthly AWS bill had doubled compared to the previous month, despite no significant changes in application traffic or new feature deployments. The increase appeared across multiple services including EC2, EBS, and S3.
# Cloud Cost Optimization Scenarios
No summary provided
What Happened:
Diagnosis Steps:
Analyzed AWS Cost Explorer reports to identify the services with the largest increases.
Used AWS Cost Anomaly Detection to pinpoint specific resources contributing to the spike.
Compared resource inventories between the current and previous months.
Reviewed recent infrastructure changes through CloudTrail logs.
Examined Terraform state files and deployment history.
Root Cause:
Multiple issues contributed to the cost spike: 1. A load testing environment with 20 high-capacity EC2 instances was left running after tests completed 2. Several terminated EC2 instances had orphaned EBS volumes that were never deleted 3. A development S3 bucket was storing uncompressed log files with no lifecycle policies 4. A misconfigured autoscaling group was scaling based on CPU rather than application-specific metrics, causing over-provisioning
Fix/Workaround:
• Short-term: Identified and terminated unused resources:
#!/bin/bash
# cleanup_orphaned_resources.sh
set -euo pipefail
# Find and terminate orphaned EC2 instances
echo "Finding orphaned EC2 instances..."
ORPHANED_INSTANCES=$(aws ec2 describe-instances \
--filters "Name=tag:Environment,Values=loadtest" "Name=instance-state-name,Values=running" \
--query "Reservations[].Instances[].InstanceId" \
--output text)
if [ -n "$ORPHANED_INSTANCES" ]; then
echo "Terminating orphaned instances: $ORPHANED_INSTANCES"
aws ec2 terminate-instances --instance-ids $ORPHANED_INSTANCES
else
echo "No orphaned instances found."
fi
# Find and delete unattached EBS volumes
echo "Finding unattached EBS volumes..."
UNATTACHED_VOLUMES=$(aws ec2 describe-volumes \
--filters "Name=status,Values=available" \
--query "Volumes[].VolumeId" \
--output text)
if [ -n "$UNATTACHED_VOLUMES" ]; then
for VOLUME_ID in $UNATTACHED_VOLUMES; do
echo "Deleting unattached volume: $VOLUME_ID"
aws ec2 delete-volume --volume-id $VOLUME_ID
done
else
echo "No unattached volumes found."
fi
# Find and clean up old snapshots
echo "Finding old EBS snapshots..."
RETENTION_DAYS=30
CUTOFF_DATE=$(date -d "$RETENTION_DAYS days ago" +%Y-%m-%d)
OLD_SNAPSHOTS=$(aws ec2 describe-snapshots \
--owner-ids self \
--query "Snapshots[?StartTime<='$CUTOFF_DATE'].SnapshotId" \
--output text)
if [ -n "$OLD_SNAPSHOTS" ]; then
for SNAPSHOT_ID in $OLD_SNAPSHOTS; do
echo "Deleting old snapshot: $SNAPSHOT_ID"
aws ec2 delete-snapshot --snapshot-id $SNAPSHOT_ID
done
else
echo "No old snapshots found."
fi
echo "Resource cleanup completed."
• Long-term: Implemented proper resource tagging and lifecycle management:
# resource_tagging.tf - Standardized tagging for all resources
locals {
common_tags = {
Environment = var.environment
Project = var.project_name
Owner = var.team_email
ManagedBy = "Terraform"
CostCenter = var.cost_center
Expiration = var.environment == "production" ? "permanent" : timeadd(timestamp(), "168h")
}
}
# ec2_instance.tf - EC2 instance with proper tagging and monitoring
resource "aws_instance" "application_server" {
ami = var.ami_id
instance_type = var.instance_type
subnet_id = var.subnet_id
# Ensure volumes are deleted on termination
root_block_device {
volume_type = "gp3"
volume_size = 50
delete_on_termination = true
encrypted = true
tags = merge(
local.common_tags,
{
Name = "${var.project_name}-${var.environment}-root-volume"
}
)
}
# Enable detailed monitoring for better autoscaling
monitoring = true
# Apply standardized tags
tags = merge(
local.common_tags,
{
Name = "${var.project_name}-${var.environment}-server"
}
)
# Ensure all tags are propagated to volumes
volume_tags = merge(
local.common_tags,
{
Name = "${var.project_name}-${var.environment}-volumes"
}
)
}
# s3_bucket.tf - S3 bucket with lifecycle policies
resource "aws_s3_bucket" "logs_bucket" {
bucket = "${var.project_name}-${var.environment}-logs"
tags = merge(
local.common_tags,
{
Name = "${var.project_name}-${var.environment}-logs"
}
)
}
resource "aws_s3_bucket_lifecycle_configuration" "logs_lifecycle" {
bucket = aws_s3_bucket.logs_bucket.id
rule {
id = "log-transition-and-expiration"
status = "Enabled"
transition {
days = 30
storage_class = "STANDARD_IA"
}
transition {
days = 90
storage_class = "GLACIER"
}
expiration {
days = 365
}
}
}
# autoscaling.tf - Improved autoscaling configuration
resource "aws_autoscaling_group" "application_asg" {
name = "${var.project_name}-${var.environment}-asg"
min_size = var.min_instances
max_size = var.max_instances
desired_capacity = var.desired_instances
vpc_zone_identifier = var.subnet_ids
launch_configuration = aws_launch_configuration.application_lc.name
# Use instance refresh for zero-downtime updates
instance_refresh {
strategy = "Rolling"
preferences {
min_healthy_percentage = 90
}
}
# Use multiple metrics for better scaling decisions
tag {
key = "Name"
value = "${var.project_name}-${var.environment}-asg-instance"
propagate_at_launch = true
}
dynamic "tag" {
for_each = local.common_tags
content {
key = tag.key
value = tag.value
propagate_at_launch = true
}
}
}
resource "aws_autoscaling_policy" "application_scaling_policy" {
name = "${var.project_name}-${var.environment}-scaling-policy"
autoscaling_group_name = aws_autoscaling_group.application_asg.name
policy_type = "TargetTrackingScaling"
target_tracking_configuration {
predefined_metric_specification {
predefined_metric_type = "ALBRequestCountPerTarget"
resource_label = "${aws_lb.application_lb.arn_suffix}/${aws_lb_target_group.application_tg.arn_suffix}"
}
target_value = 1000
disable_scale_in = false
}
}
• Implemented a cost monitoring and alerting system:
# cost_monitor.py
import boto3
import datetime
import json
import os
import logging
from dateutil.relativedelta import relativedelta
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger('cost_monitor')
# Configuration
BUDGET_THRESHOLD_PERCENT = 80
ANOMALY_THRESHOLD_PERCENT = 20
SNS_TOPIC_ARN = os.environ.get('SNS_TOPIC_ARN')
SLACK_WEBHOOK_URL = os.environ.get('SLACK_WEBHOOK_URL')
def lambda_handler(event, context):
"""AWS Lambda handler for cost monitoring"""
try:
# Initialize clients
ce_client = boto3.client('ce')
budgets_client = boto3.client('budgets')
sns_client = boto3.client('sns')
# Get current date information
today = datetime.datetime.utcnow().date()
first_day_month = today.replace(day=1)
last_day_month = (first_day_month + relativedelta(months=1, days=-1))
# Get month-to-date costs
mtd_costs = get_month_to_date_costs(ce_client, first_day_month, today)
# Get cost forecast for the month
forecast = get_cost_forecast(ce_client, today, last_day_month)
# Check budgets
budget_alerts = check_budgets(budgets_client)
# Check for cost anomalies
anomalies = detect_cost_anomalies(ce_client)
# Generate cost report
cost_report = {
'month_to_date': mtd_costs,
'forecast': forecast,
'budget_alerts': budget_alerts,
'anomalies': anomalies
}
# Send notifications if needed
if budget_alerts or anomalies:
send_notifications(sns_client, cost_report)
return {
'statusCode': 200,
'body': json.dumps(cost_report)
}
except Exception as e:
logger.error(f"Error in cost monitoring: {str(e)}")
raise
def get_month_to_date_costs(ce_client, start_date, end_date):
"""Get month-to-date costs from AWS Cost Explorer"""
response = ce_client.get_cost_and_usage(
TimePeriod={
'Start': start_date.isoformat(),
'End': end_date.isoformat()
},
Granularity='MONTHLY',
Metrics=['UnblendedCost'],
GroupBy=[
{
'Type': 'DIMENSION',
'Key': 'SERVICE'
}
]
)
total_cost = 0
service_costs = {}
for result in response['ResultsByTime']:
for group in result['Groups']:
service = group['Keys'][0]
amount = float(group['Metrics']['UnblendedCost']['Amount'])
service_costs[service] = amount
total_cost += amount
return {
'total': total_cost,
'by_service': service_costs
}
def get_cost_forecast(ce_client, start_date, end_date):
"""Get cost forecast from AWS Cost Explorer"""
response = ce_client.get_cost_forecast(
TimePeriod={
'Start': start_date.isoformat(),
'End': end_date.isoformat()
},
Metric='UNBLENDED_COST',
Granularity='MONTHLY'
)
return {
'total': float(response['Total']['Amount']),
'forecast_date': response['ForecastResultsByTime'][0]['TimePeriod']
}
def check_budgets(budgets_client):
"""Check AWS Budgets for alerts"""
response = budgets_client.describe_budgets(
AccountId=boto3.client('sts').get_caller_identity()['Account']
)
alerts = []
for budget in response.get('Budgets', []):
budget_name = budget['BudgetName']
budget_amount = float(budget['BudgetLimit']['Amount'])
actual_amount = float(budget.get('CalculatedSpend', {}).get('ActualSpend', {}).get('Amount', 0))
forecast_amount = float(budget.get('CalculatedSpend', {}).get('ForecastedSpend', {}).get('Amount', 0))
# Check if actual spend exceeds threshold
actual_percent = (actual_amount / budget_amount) * 100
if actual_percent >= BUDGET_THRESHOLD_PERCENT:
alerts.append({
'budget_name': budget_name,
'budget_amount': budget_amount,
'actual_amount': actual_amount,
'actual_percent': actual_percent,
'type': 'actual'
})
# Check if forecast exceeds budget
if forecast_amount > budget_amount:
forecast_percent = (forecast_amount / budget_amount) * 100
alerts.append({
'budget_name': budget_name,
'budget_amount': budget_amount,
'forecast_amount': forecast_amount,
'forecast_percent': forecast_percent,
'type': 'forecast'
})
return alerts
def detect_cost_anomalies(ce_client):
"""Detect cost anomalies using AWS Cost Anomaly Detection"""
# Get anomaly monitors
monitors_response = ce_client.get_anomaly_monitors()
anomalies = []
# For each monitor, get anomalies
for monitor in monitors_response.get('AnomalyMonitors', []):
monitor_arn = monitor['MonitorArn']
# Get anomalies for the last 30 days
end_date = datetime.datetime.utcnow().date()
start_date = end_date - datetime.timedelta(days=30)
anomalies_response = ce_client.get_anomalies(
MonitorArn=monitor_arn,
DateInterval={
'StartDate': start_date.isoformat(),
'EndDate': end_date.isoformat()
}
)
for anomaly in anomalies_response.get('Anomalies', []):
impact = float(anomaly['Impact'])
# Only report significant anomalies
if impact >= ANOMALY_THRESHOLD_PERCENT:
anomalies.append({
'id': anomaly['AnomalyId'],
'monitor_name': monitor['MonitorName'],
'impact': impact,
'impact_percent': impact,
'root_causes': anomaly.get('RootCauses', []),
'start_date': anomaly['AnomalyStartDate'],
'end_date': anomaly.get('AnomalyEndDate')
})
return anomalies
def send_notifications(sns_client, cost_report):
"""Send notifications about cost issues"""
# Format message
message = format_notification_message(cost_report)
# Send SNS notification
if SNS_TOPIC_ARN:
sns_client.publish(
TopicArn=SNS_TOPIC_ARN,
Subject='AWS Cost Alert',
Message=message
)
# Send Slack notification
if SLACK_WEBHOOK_URL:
send_slack_notification(cost_report)
def format_notification_message(cost_report):
"""Format notification message"""
message = "AWS Cost Alert\n\n"
# Add budget alerts
if cost_report['budget_alerts']:
message += "Budget Alerts:\n"
for alert in cost_report['budget_alerts']:
if alert['type'] == 'actual':
message += f"- Budget '{alert['budget_name']}' has reached {alert['actual_percent']:.1f}% " \
f"(${alert['actual_amount']:.2f} of ${alert['budget_amount']:.2f})\n"
else:
message += f"- Budget '{alert['budget_name']}' is forecasted to reach {alert['forecast_percent']:.1f}% " \
f"(${alert['forecast_amount']:.2f} of ${alert['budget_amount']:.2f})\n"
message += "\n"
# Add anomalies
if cost_report['anomalies']:
message += "Cost Anomalies:\n"
for anomaly in cost_report['anomalies']:
message += f"- {anomaly['monitor_name']}: {anomaly['impact_percent']:.1f}% increase detected\n"
if anomaly['root_causes']:
message += " Root causes:\n"
for cause in anomaly['root_causes']:
service = cause.get('Service', 'Unknown service')
message += f" - {service}\n"
message += "\n"
# Add month-to-date costs
message += "Month-to-Date Costs:\n"
message += f"- Total: ${cost_report['month_to_date']['total']:.2f}\n"
message += "- Top Services:\n"
# Sort services by cost (descending)
sorted_services = sorted(
cost_report['month_to_date']['by_service'].items(),
key=lambda x: x[1],
reverse=True
)
# Show top 5 services
for service, cost in sorted_services[:5]:
message += f" - {service}: ${cost:.2f}\n"
# Add forecast
message += f"\nForecast for this month: ${cost_report['forecast']['total']:.2f}\n"
return message
def send_slack_notification(cost_report):
"""Send notification to Slack"""
# Implementation of Slack notification
pass
if __name__ == '__main__':
# For local testing
lambda_handler(None, None)
• Created a resource tagging enforcement policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "EnforceTaggingOnResourceCreation",
"Effect": "Deny",
"Action": [
"ec2:RunInstances",
"ec2:CreateVolume",
"rds:CreateDBInstance",
"s3:CreateBucket",
"dynamodb:CreateTable",
"elasticloadbalancing:CreateLoadBalancer"
],
"Resource": "*",
"Condition": {
"Null": {
"aws:RequestTag/Environment": "true",
"aws:RequestTag/Project": "true",
"aws:RequestTag/Owner": "true",
"aws:RequestTag/CostCenter": "true"
}
}
},
{
"Sid": "EnforceTaggingOnResourceTagging",
"Effect": "Deny",
"Action": [
"ec2:CreateTags"
],
"Resource": "*",
"Condition": {
"Null": {
"aws:RequestTag/Environment": "true",
"aws:RequestTag/Project": "true",
"aws:RequestTag/Owner": "true",
"aws:RequestTag/CostCenter": "true"
},
"ForAllValues:StringEquals": {
"aws:TagKeys": [
"Environment",
"Project",
"Owner",
"CostCenter"
]
}
}
},
{
"Sid": "EnforceTaggingOnResourceModification",
"Effect": "Deny",
"Action": [
"ec2:ModifyInstanceAttribute",
"rds:ModifyDBInstance",
"dynamodb:UpdateTable",
"elasticloadbalancing:ModifyLoadBalancerAttributes"
],
"Resource": "*",
"Condition": {
"Null": {
"aws:ResourceTag/Environment": "true",
"aws:ResourceTag/Project": "true",
"aws:ResourceTag/Owner": "true",
"aws:ResourceTag/CostCenter": "true"
}
}
}
]
}
Lessons Learned:
Cloud costs require proactive monitoring and governance to prevent unexpected spikes.
How to Avoid:
Implement mandatory resource tagging with ownership information.
Set up automated cleanup of non-production resources after a defined period.
Configure budget alerts and anomaly detection with appropriate thresholds.
Use infrastructure as code with proper resource lifecycle management.
Implement regular cost reviews and optimization processes.
No summary provided
What Happened:
The finance team reported a significant spike in cloud costs during the monthly review. The increase couldn't be attributed to any planned infrastructure expansion or traffic growth. The DevOps team was tasked with identifying and resolving the issue quickly.
Diagnosis Steps:
Analyzed AWS Cost Explorer reports to identify which services showed cost increases.
Compared current resource usage with historical baselines.
Used AWS Cost Anomaly Detection to pinpoint specific resources.
Reviewed recent infrastructure changes across all environments.
Checked for orphaned resources using custom tagging policies.
Root Cause:
Multiple causes were identified: 1. A development team had launched large GPU instances for ML testing but didn't terminate them after testing. 2. Several EBS volumes remained after their associated EC2 instances were terminated. 3. A misconfigured autoscaling group was scaling up but not properly scaling down. 4. Several NAT Gateways were running in regions where they were no longer needed. 5. A large number of unattached Elastic IPs were being billed.
Fix/Workaround:
• Short-term: Identified and terminated unnecessary resources:
#!/bin/bash
# cleanup_orphaned_resources.sh
# Set AWS region
REGION="us-west-2"
echo "Cleaning up orphaned resources in $REGION..."
# Find and terminate stopped instances older than 7 days
echo "Finding stopped EC2 instances older than 7 days..."
STOPPED_INSTANCES=$(aws ec2 describe-instances \
--region $REGION \
--filters "Name=instance-state-name,Values=stopped" \
--query "Reservations[].Instances[?LaunchTime<='$(date -d '7 days ago' --iso-8601)'].InstanceId" \
--output text)
if [ -n "$STOPPED_INSTANCES" ]; then
echo "Terminating stopped instances: $STOPPED_INSTANCES"
aws ec2 terminate-instances --region $REGION --instance-ids $STOPPED_INSTANCES
else
echo "No stopped instances found to terminate."
fi
# Find and delete unattached EBS volumes
echo "Finding unattached EBS volumes..."
UNATTACHED_VOLUMES=$(aws ec2 describe-volumes \
--region $REGION \
--filters "Name=status,Values=available" \
--query "Volumes[].VolumeId" \
--output text)
if [ -n "$UNATTACHED_VOLUMES" ]; then
for VOLUME_ID in $UNATTACHED_VOLUMES; do
echo "Deleting unattached volume: $VOLUME_ID"
aws ec2 delete-volume --region $REGION --volume-id $VOLUME_ID
done
else
echo "No unattached volumes found."
fi
# Find and release unassociated Elastic IPs
echo "Finding unassociated Elastic IPs..."
UNASSOCIATED_EIPS=$(aws ec2 describe-addresses \
--region $REGION \
--query "Addresses[?AssociationId==null].AllocationId" \
--output text)
if [ -n "$UNASSOCIATED_EIPS" ]; then
for EIP_ID in $UNASSOCIATED_EIPS; do
echo "Releasing unassociated Elastic IP: $EIP_ID"
aws ec2 release-address --region $REGION --allocation-id $EIP_ID
done
else
echo "No unassociated Elastic IPs found."
fi
# Find and delete unused NAT Gateways
echo "Finding unused NAT Gateways..."
UNUSED_NAT_GATEWAYS=$(aws ec2 describe-nat-gateways \
--region $REGION \
--filter "Name=state,Values=available" \
--query "NatGateways[].NatGatewayId" \
--output text)
if [ -n "$UNUSED_NAT_GATEWAYS" ]; then
for NAT_ID in $UNUSED_NAT_GATEWAYS; do
# Check if NAT Gateway is actually in use by checking route tables
ROUTES_USING_NAT=$(aws ec2 describe-route-tables \
--region $REGION \
--filters "Name=route.nat-gateway-id,Values=$NAT_ID" \
--query "RouteTables[].RouteTableId" \
--output text)
if [ -z "$ROUTES_USING_NAT" ]; then
echo "Deleting unused NAT Gateway: $NAT_ID"
aws ec2 delete-nat-gateway --region $REGION --nat-gateway-id $NAT_ID
else
echo "NAT Gateway $NAT_ID is in use by route tables: $ROUTES_USING_NAT"
fi
done
else
echo "No unused NAT Gateways found."
fi
echo "Cleanup completed!"
• Long-term: Implemented a comprehensive cost management strategy:
# cost_governance.tf - Resource tagging and lifecycle policies
# Define required tags
locals {
required_tags = {
Environment = "Required"
Project = "Required"
Owner = "Required"
CostCenter = "Required"
Expiration = "Optional"
}
}
# AWS Provider with default tags
provider "aws" {
region = "us-west-2"
default_tags {
tags = {
ManagedBy = "Terraform"
}
}
}
# IAM policy to enforce tagging
resource "aws_iam_policy" "enforce_tagging" {
name = "enforce-resource-tagging"
description = "Enforces tagging standards for AWS resources"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "EnforceTaggingOnResourceCreation"
Effect = "Deny"
Action = [
"ec2:RunInstances",
"ec2:CreateVolume",
"rds:CreateDBInstance",
"dynamodb:CreateTable",
"s3:CreateBucket"
]
Resource = "*"
Condition = {
"Null" = {
"aws:RequestTag/Environment" = "true"
"aws:RequestTag/Project" = "true"
"aws:RequestTag/Owner" = "true"
"aws:RequestTag/CostCenter" = "true"
}
}
}
]
})
}
# EC2 instance with lifecycle rules
resource "aws_instance" "example" {
ami = "ami-0c55b159cbfafe1f0"
instance_type = "t3.micro"
tags = {
Name = "example-instance"
Environment = "development"
Project = "cost-optimization"
Owner = "devops-team"
CostCenter = "engineering"
Expiration = "2023-12-31"
}
# Prevent accidental deletion
lifecycle {
prevent_destroy = true
}
# Create before destroy to minimize downtime
lifecycle {
create_before_destroy = true
}
}
# Auto Scaling Group with proper scaling policies
resource "aws_autoscaling_group" "example" {
name = "example-asg"
min_size = 1
max_size = 5
desired_capacity = 2
vpc_zone_identifier = [aws_subnet.example.id]
launch_configuration = aws_launch_configuration.example.name
# Properly distribute instances across AZs
availability_zones = ["us-west-2a", "us-west-2b", "us-west-2c"]
# Enable instance scale-in protection
protect_from_scale_in = false
# Use mixed instances policy for cost optimization
mixed_instances_policy {
instances_distribution {
on_demand_base_capacity = 1
on_demand_percentage_above_base_capacity = 50
spot_allocation_strategy = "capacity-optimized"
}
launch_template {
launch_template_specification {
launch_template_id = aws_launch_template.example.id
version = "$Latest"
}
override {
instance_type = "t3.micro"
}
override {
instance_type = "t3a.micro"
}
}
}
# Proper scaling policies
tag {
key = "Name"
value = "example-asg-instance"
propagate_at_launch = true
}
tag {
key = "Environment"
value = "development"
propagate_at_launch = true
}
tag {
key = "Project"
value = "cost-optimization"
propagate_at_launch = true
}
tag {
key = "Owner"
value = "devops-team"
propagate_at_launch = true
}
tag {
key = "CostCenter"
value = "engineering"
propagate_at_launch = true
}
}
# Proper scaling policies
resource "aws_autoscaling_policy" "scale_up" {
name = "scale-up"
scaling_adjustment = 1
adjustment_type = "ChangeInCapacity"
cooldown = 300
autoscaling_group_name = aws_autoscaling_group.example.name
}
resource "aws_autoscaling_policy" "scale_down" {
name = "scale-down"
scaling_adjustment = -1
adjustment_type = "ChangeInCapacity"
cooldown = 300
autoscaling_group_name = aws_autoscaling_group.example.name
}
# CloudWatch alarms for scaling
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
alarm_name = "high-cpu-utilization"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "CPUUtilization"
namespace = "AWS/EC2"
period = 300
statistic = "Average"
threshold = 70
dimensions = {
AutoScalingGroupName = aws_autoscaling_group.example.name
}
alarm_description = "Scale up when CPU exceeds 70%"
alarm_actions = [aws_autoscaling_policy.scale_up.arn]
}
resource "aws_cloudwatch_metric_alarm" "low_cpu" {
alarm_name = "low-cpu-utilization"
comparison_operator = "LessThanThreshold"
evaluation_periods = 2
metric_name = "CPUUtilization"
namespace = "AWS/EC2"
period = 300
statistic = "Average"
threshold = 30
dimensions = {
AutoScalingGroupName = aws_autoscaling_group.example.name
}
alarm_description = "Scale down when CPU is below 30%"
alarm_actions = [aws_autoscaling_policy.scale_down.arn]
}
# S3 lifecycle policy for cost optimization
resource "aws_s3_bucket" "example" {
bucket = "example-bucket"
tags = {
Name = "example-bucket"
Environment = "development"
Project = "cost-optimization"
Owner = "devops-team"
CostCenter = "engineering"
}
}
resource "aws_s3_bucket_lifecycle_configuration" "example" {
bucket = aws_s3_bucket.example.id
rule {
id = "transition-to-infrequent-access"
status = "Enabled"
transition {
days = 30
storage_class = "STANDARD_IA"
}
transition {
days = 90
storage_class = "GLACIER"
}
expiration {
days = 365
}
}
}
# DynamoDB auto-scaling
resource "aws_appautoscaling_target" "dynamodb_table_read_target" {
max_capacity = 100
min_capacity = 5
resource_id = "table/example-table"
scalable_dimension = "dynamodb:table:ReadCapacityUnits"
service_namespace = "dynamodb"
}
resource "aws_appautoscaling_policy" "dynamodb_table_read_policy" {
name = "dynamodb-table-read-policy"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.dynamodb_table_read_target.resource_id
scalable_dimension = aws_appautoscaling_target.dynamodb_table_read_target.scalable_dimension
service_namespace = aws_appautoscaling_target.dynamodb_table_read_target.service_namespace
target_tracking_scaling_policy_configuration {
predefined_metric_specification {
predefined_metric_type = "DynamoDBReadCapacityUtilization"
}
target_value = 70.0
scale_in_cooldown = 300
scale_out_cooldown = 300
}
}
• Implemented a Go-based cost monitoring and alerting system:
// cost_monitor.go
package main
import (
"context"
"encoding/json"
"fmt"
"log"
"os"
"time"
"github.com/aws/aws-sdk-go-v2/aws"
"github.com/aws/aws-sdk-go-v2/config"
"github.com/aws/aws-sdk-go-v2/service/costexplorer"
"github.com/aws/aws-sdk-go-v2/service/costexplorer/types"
"github.com/aws/aws-sdk-go-v2/service/ec2"
ec2types "github.com/aws/aws-sdk-go-v2/service/ec2/types"
"github.com/aws/aws-sdk-go-v2/service/sns"
"github.com/robfig/cron/v3"
)
type ResourceCost struct {
ResourceID string `json:"resourceId"`
Service string `json:"service"`
Cost float64 `json:"cost"`
Currency string `json:"currency"`
Region string `json:"region"`
Tags map[string]string `json:"tags"`
}
type OrphanedResource struct {
ResourceID string `json:"resourceId"`
ResourceType string `json:"resourceType"`
Region string `json:"region"`
CreatedAt time.Time `json:"createdAt"`
Tags map[string]string `json:"tags"`
}
func main() {
// Load AWS configuration
cfg, err := config.LoadDefaultConfig(context.TODO(),
config.WithRegion("us-west-2"),
)
if err != nil {
log.Fatalf("unable to load SDK config, %v", err)
}
// Create Cost Explorer client
ceClient := costexplorer.NewFromConfig(cfg)
// Create EC2 client
ec2Client := ec2.NewFromConfig(cfg)
// Create SNS client for notifications
snsClient := sns.NewFromConfig(cfg)
// Set up cron scheduler
c := cron.New()
// Run daily cost analysis
c.AddFunc("0 1 * * *", func() {
log.Println("Running daily cost analysis...")
// Get yesterday's date
yesterday := time.Now().AddDate(0, 0, -1)
startDate := yesterday.Format("2006-01-02")
endDate := time.Now().Format("2006-01-02")
// Get cost and usage data
costData, err := getCostAndUsage(ceClient, startDate, endDate)
if err != nil {
log.Printf("Error getting cost data: %v", err)
return
}
// Find orphaned resources
orphanedResources, err := findOrphanedResources(ec2Client)
if err != nil {
log.Printf("Error finding orphaned resources: %v", err)
}
// Check for cost anomalies
anomalies, err := detectCostAnomalies(ceClient, startDate, endDate)
if err != nil {
log.Printf("Error detecting cost anomalies: %v", err)
}
// Send notifications if needed
if len(orphanedResources) > 0 || len(anomalies) > 0 {
sendNotification(snsClient, costData, orphanedResources, anomalies)
}
})
// Start cron scheduler
c.Start()
// Keep the application running
select {}
}
func getCostAndUsage(client *costexplorer.Client, startDate, endDate string) ([]ResourceCost, error) {
input := &costexplorer.GetCostAndUsageInput{
TimePeriod: &types.DateInterval{
Start: aws.String(startDate),
End: aws.String(endDate),
},
Granularity: types.GranularityDaily,
Metrics: []string{"BlendedCost"},
GroupBy: []types.GroupDefinition{
{
Type: types.GroupDefinitionTypeDimension,
Key: aws.String("SERVICE"),
},
{
Type: types.GroupDefinitionTypeTag,
Key: aws.String("ResourceId"),
},
},
}
result, err := client.GetCostAndUsage(context.TODO(), input)
if err != nil {
return nil, fmt.Errorf("failed to get cost and usage: %w", err)
}
var resources []ResourceCost
for _, resultByTime := range result.ResultsByTime {
for _, group := range resultByTime.Groups {
// Parse the cost amount
cost := 0.0
if len(group.Metrics) > 0 {
if blendedCost, ok := group.Metrics["BlendedCost"]; ok {
if amount := blendedCost.Amount; amount != nil {
if parsedCost, err := parseFloat(*amount); err == nil {
cost = parsedCost
}
}
}
}
// Skip resources with zero cost
if cost == 0 {
continue
}
// Extract service and resource ID
service := ""
resourceID := ""
if len(group.Keys) >= 2 {
service = group.Keys[0]
resourceID = group.Keys[1]
}
// Create resource cost entry
resources = append(resources, ResourceCost{
ResourceID: resourceID,
Service: service,
Cost: cost,
Currency: "USD", // Assuming USD
Region: "us-west-2", // Assuming us-west-2
Tags: map[string]string{}, // Would need additional API calls to get tags
})
}
}
return resources, nil
}
func findOrphanedResources(client *ec2.Client) ([]OrphanedResource, error) {
var orphanedResources []OrphanedResource
// Find unattached EBS volumes
volumesResult, err := client.DescribeVolumes(context.TODO(), &ec2.DescribeVolumesInput{
Filters: []ec2types.Filter{
{
Name: aws.String("status"),
Values: []string{"available"},
},
},
})
if err != nil {
return nil, fmt.Errorf("failed to describe volumes: %w", err)
}
for _, volume := range volumesResult.Volumes {
tags := make(map[string]string)
for _, tag := range volume.Tags {
if tag.Key != nil && tag.Value != nil {
tags[*tag.Key] = *tag.Value
}
}
orphanedResources = append(orphanedResources, OrphanedResource{
ResourceID: *volume.VolumeId,
ResourceType: "EBS Volume",
Region: "us-west-2", // Assuming us-west-2
CreatedAt: *volume.CreateTime,
Tags: tags,
})
}
// Find unassociated Elastic IPs
addressesResult, err := client.DescribeAddresses(context.TODO(), &ec2.DescribeAddressesInput{})
if err != nil {
return nil, fmt.Errorf("failed to describe addresses: %w", err)
}
for _, address := range addressesResult.Addresses {
if address.AssociationId == nil {
tags := make(map[string]string)
for _, tag := range address.Tags {
if tag.Key != nil && tag.Value != nil {
tags[*tag.Key] = *tag.Value
}
}
orphanedResources = append(orphanedResources, OrphanedResource{
ResourceID: *address.AllocationId,
ResourceType: "Elastic IP",
Region: "us-west-2", // Assuming us-west-2
CreatedAt: time.Now(), // EIPs don't have creation time in the API
Tags: tags,
})
}
}
// Find stopped EC2 instances
instancesResult, err := client.DescribeInstances(context.TODO(), &ec2.DescribeInstancesInput{
Filters: []ec2types.Filter{
{
Name: aws.String("instance-state-name"),
Values: []string{"stopped"},
},
},
})
if err != nil {
return nil, fmt.Errorf("failed to describe instances: %w", err)
}
for _, reservation := range instancesResult.Reservations {
for _, instance := range reservation.Instances {
// Check if instance has been stopped for more than 7 days
if instance.StateTransitionReason != nil {
// Parse the state transition reason to get the stop time
// This is a bit hacky but the API doesn't provide a direct way to get this
if stopTime, err := parseStateTransitionTime(*instance.StateTransitionReason); err == nil {
if time.Since(stopTime) > 7*24*time.Hour {
tags := make(map[string]string)
for _, tag := range instance.Tags {
if tag.Key != nil && tag.Value != nil {
tags[*tag.Key] = *tag.Value
}
}
orphanedResources = append(orphanedResources, OrphanedResource{
ResourceID: *instance.InstanceId,
ResourceType: "EC2 Instance",
Region: "us-west-2", // Assuming us-west-2
CreatedAt: *instance.LaunchTime,
Tags: tags,
})
}
}
}
}
}
return orphanedResources, nil
}
func detectCostAnomalies(client *costexplorer.Client, startDate, endDate string) ([]types.AnomalyMonitor, error) {
// Get cost anomaly monitors
monitorsResult, err := client.GetAnomalyMonitors(context.TODO(), &costexplorer.GetAnomalyMonitorsInput{})
if err != nil {
return nil, fmt.Errorf("failed to get anomaly monitors: %w", err)
}
var anomalies []types.AnomalyMonitor
// For each monitor, get anomalies
for _, monitor := range monitorsResult.AnomalyMonitors {
subscriptionsResult, err := client.GetAnomalySubscriptions(context.TODO(), &costexplorer.GetAnomalySubscriptionsInput{
MonitorArn: monitor.MonitorArn,
})
if err != nil {
log.Printf("Failed to get anomaly subscriptions for monitor %s: %v", *monitor.MonitorArn, err)
continue
}
for _, subscription := range subscriptionsResult.AnomalySubscriptions {
anomaliesResult, err := client.GetAnomalies(context.TODO(), &costexplorer.GetAnomaliesInput{
DateInterval: &types.AnomalyDateInterval{
StartDate: aws.String(startDate),
EndDate: aws.String(endDate),
},
MonitorArn: monitor.MonitorArn,
SubscriptionArn: subscription.SubscriptionArn,
})
if err != nil {
log.Printf("Failed to get anomalies for subscription %s: %v", *subscription.SubscriptionArn, err)
continue
}
if len(anomaliesResult.Anomalies) > 0 {
anomalies = append(anomalies, monitor)
break
}
}
}
return anomalies, nil
}
func sendNotification(client *sns.Client, costData []ResourceCost, orphanedResources []OrphanedResource, anomalies []types.AnomalyMonitor) {
// Prepare notification message
message := "AWS Cost Optimization Report\n\n"
// Add cost data summary
totalCost := 0.0
for _, resource := range costData {
totalCost += resource.Cost
}
message += fmt.Sprintf("Total Cost: $%.2f\n\n", totalCost)
// Add orphaned resources
if len(orphanedResources) > 0 {
message += "Orphaned Resources:\n"
for _, resource := range orphanedResources {
message += fmt.Sprintf("- %s (%s): %s (Created: %s)\n",
resource.ResourceType,
resource.ResourceID,
resource.Region,
resource.CreatedAt.Format("2006-01-02"))
}
message += "\n"
}
// Add cost anomalies
if len(anomalies) > 0 {
message += "Cost Anomalies Detected:\n"
for _, anomaly := range anomalies {
message += fmt.Sprintf("- %s: %s\n", *anomaly.MonitorName, *anomaly.MonitorType)
}
message += "\n"
}
// Add top 10 most expensive resources
if len(costData) > 0 {
message += "Top 10 Most Expensive Resources:\n"
// Sort cost data by cost (descending)
// This is a simple bubble sort for demonstration
for i := 0; i < len(costData); i++ {
for j := i + 1; j < len(costData); j++ {
if costData[i].Cost < costData[j].Cost {
costData[i], costData[j] = costData[j], costData[i]
}
}
}
// Take top 10 or less
count := 10
if len(costData) < 10 {
count = len(costData)
}
for i := 0; i < count; i++ {
message += fmt.Sprintf("- %s (%s): $%.2f\n",
costData[i].ResourceID,
costData[i].Service,
costData[i].Cost)
}
}
// Send SNS notification
_, err := client.Publish(context.TODO(), &sns.PublishInput{
TopicArn: aws.String(os.Getenv("SNS_TOPIC_ARN")),
Subject: aws.String("AWS Cost Optimization Report"),
Message: aws.String(message),
})
if err != nil {
log.Printf("Failed to send notification: %v", err)
} else {
log.Println("Cost optimization notification sent successfully")
}
}
// Helper functions
func parseFloat(s string) (float64, error) {
var f float64
_, err := fmt.Sscanf(s, "%f", &f)
return f, err
}
func parseStateTransitionTime(reason string) (time.Time, error) {
// Example: "User initiated (2023-05-15 10:30:00 GMT)"
var year, month, day, hour, min, sec int
_, err := fmt.Sscanf(reason, "User initiated (%d-%d-%d %d:%d:%d GMT)",
&year, &month, &day, &hour, &min, &sec)
if err != nil {
return time.Time{}, err
}
return time.Date(year, time.Month(month), day, hour, min, sec, 0, time.UTC), nil
}
• Created a Rust-based resource tagging enforcement tool:
// tag_enforcer.rs
use aws_config::meta::region::RegionProviderChain;
use aws_sdk_ec2::{Client as Ec2Client, Error as Ec2Error};
use aws_sdk_resourcegroupstaggingapi::{Client as TaggingClient, Error as TaggingError};
use aws_sdk_sns::{Client as SnsClient, Error as SnsError};
use chrono::{DateTime, Duration, Utc};
use serde::{Deserialize, Serialize};
use std::collections::HashMap;
use std::env;
use structopt::StructOpt;
use tokio::time;
#[derive(Debug, StructOpt)]
#[structopt(name = "tag-enforcer", about = "AWS resource tag enforcement tool")]
struct Opt {
/// AWS Region
#[structopt(short, long, default_value = "us-west-2")]
region: String,
/// SNS Topic ARN for notifications
#[structopt(long, env = "SNS_TOPIC_ARN")]
sns_topic_arn: String,
/// Dry run mode (don't make any changes)
#[structopt(long)]
dry_run: bool,
/// Required tags (comma-separated)
#[structopt(long, default_value = "Environment,Project,Owner,CostCenter")]
required_tags: String,
}
#[derive(Debug, Serialize, Deserialize)]
struct UntaggedResource {
resource_arn: String,
resource_type: String,
missing_tags: Vec<String>,
existing_tags: HashMap<String, String>,
}
#[derive(Debug, Serialize, Deserialize)]
struct ExpiringResource {
resource_arn: String,
resource_type: String,
expiration_date: String,
days_until_expiration: i64,
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let opt = Opt::from_args();
// Load AWS configuration
let region_provider = RegionProviderChain::first_try(opt.region.clone())
.or_default_provider()
.or_else(opt.region);
let config = aws_config::from_env().region(region_provider).load().await;
// Create clients
let tagging_client = TaggingClient::new(&config);
let ec2_client = Ec2Client::new(&config);
let sns_client = SnsClient::new(&config);
// Parse required tags
let required_tags: Vec<String> = opt.required_tags
.split(',')
.map(|s| s.trim().to_string())
.collect();
println!("Starting tag enforcement with required tags: {:?}", required_tags);
// Run tag enforcement every day
loop {
println!("Running tag enforcement check...");
// Find untagged resources
let untagged_resources = find_untagged_resources(&tagging_client, &required_tags).await?;
// Find expiring resources
let expiring_resources = find_expiring_resources(&tagging_client).await?;
// Take action on untagged resources
if !opt.dry_run && !untagged_resources.is_empty() {
handle_untagged_resources(&ec2_client, &untagged_resources).await?;
}
// Send notification
if !untagged_resources.is_empty() || !expiring_resources.is_empty() {
send_notification(
&sns_client,
&opt.sns_topic_arn,
&untagged_resources,
&expiring_resources,
opt.dry_run,
).await?;
}
// Wait for next run
println!("Tag enforcement check completed. Waiting for next run...");
time::sleep(time::Duration::from_secs(24 * 60 * 60)).await;
}
}
async fn find_untagged_resources(
client: &TaggingClient,
required_tags: &[String],
) -> Result<Vec<UntaggedResource>, TaggingError> {
let mut untagged_resources = Vec::new();
let mut pagination_token = None;
loop {
let resp = client
.get_resources()
.set_pagination_token(pagination_token)
.send()
.await?;
if let Some(resources) = resp.resource_tag_mapping_list() {
for resource in resources {
let resource_arn = match resource.resource_arn() {
Some(arn) => arn,
None => continue,
};
let tags = resource.tags().unwrap_or_default();
let tag_keys: Vec<&str> = tags.iter().map(|t| t.key().unwrap_or_default()).collect();
let mut missing_tags = Vec::new();
for required_tag in required_tags {
if !tag_keys.contains(&required_tag.as_str()) {
missing_tags.push(required_tag.clone());
}
}
if !missing_tags.is_empty() {
let mut existing_tags = HashMap::new();
for tag in tags {
if let (Some(key), Some(value)) = (tag.key(), tag.value()) {
existing_tags.insert(key.to_string(), value.to_string());
}
}
let resource_type = resource_arn.split(':').nth(2).unwrap_or("unknown").to_string();
untagged_resources.push(UntaggedResource {
resource_arn: resource_arn.to_string(),
resource_type,
missing_tags,
existing_tags,
});
}
}
}
pagination_token = resp.pagination_token().map(|s| s.to_string());
if pagination_token.is_none() {
break;
}
}
Ok(untagged_resources)
}
async fn find_expiring_resources(
client: &TaggingClient,
) -> Result<Vec<ExpiringResource>, TaggingError> {
let mut expiring_resources = Vec::new();
let mut pagination_token = None;
loop {
let resp = client
.get_resources()
.set_pagination_token(pagination_token)
.send()
.await?;
if let Some(resources) = resp.resource_tag_mapping_list() {
for resource in resources {
let resource_arn = match resource.resource_arn() {
Some(arn) => arn,
None => continue,
};
let tags = resource.tags().unwrap_or_default();
// Check for Expiration tag
for tag in tags {
if let (Some(key), Some(value)) = (tag.key(), tag.value()) {
if key == "Expiration" {
// Parse expiration date
if let Ok(expiration_date) = chrono::NaiveDate::parse_from_str(value, "%Y-%m-%d") {
let expiration_datetime = DateTime::<Utc>::from_utc(
expiration_date.and_hms(0, 0, 0),
Utc,
);
let now = Utc::now();
// Calculate days until expiration
let days_until_expiration = (expiration_datetime - now).num_days();
// If expiring within 7 days, add to list
if days_until_expiration >= 0 && days_until_expiration <= 7 {
let resource_type = resource_arn.split(':').nth(2).unwrap_or("unknown").to_string();
expiring_resources.push(ExpiringResource {
resource_arn: resource_arn.to_string(),
resource_type,
expiration_date: value.to_string(),
days_until_expiration,
});
}
}
}
}
}
}
}
pagination_token = resp.pagination_token().map(|s| s.to_string());
if pagination_token.is_none() {
break;
}
}
Ok(expiring_resources)
}
async fn handle_untagged_resources(
client: &Ec2Client,
untagged_resources: &[UntaggedResource],
) -> Result<(), Ec2Error> {
for resource in untagged_resources {
// For EC2 instances, stop if missing required tags
if resource.resource_type == "ec2" && resource.resource_arn.contains(":instance/") {
let instance_id = resource.resource_arn.split('/').last().unwrap_or_default();
println!("Stopping untagged EC2 instance: {}", instance_id);
client
.stop_instances()
.instance_ids(instance_id)
.send()
.await?;
}
}
Ok(())
}
async fn send_notification(
client: &SnsClient,
topic_arn: &str,
untagged_resources: &[UntaggedResource],
expiring_resources: &[ExpiringResource],
dry_run: bool,
) -> Result<(), SnsError> {
let mut message = String::new();
message.push_str("AWS Resource Tag Enforcement Report\n\n");
if dry_run {
message.push_str("*** DRY RUN MODE - No actions taken ***\n\n");
}
// Add untagged resources
if !untagged_resources.is_empty() {
message.push_str(&format!("Untagged Resources ({})\n", untagged_resources.len()));
message.push_str("====================\n");
for resource in untagged_resources {
message.push_str(&format!("Resource: {}\n", resource.resource_arn));
message.push_str(&format!("Type: {}\n", resource.resource_type));
message.push_str(&format!("Missing Tags: {:?}\n", resource.missing_tags));
message.push_str(&format!("Existing Tags: {:?}\n", resource.existing_tags));
if resource.resource_type == "ec2" && resource.resource_arn.contains(":instance/") && !dry_run {
message.push_str("Action: Instance stopped due to missing required tags\n");
}
message.push_str("\n");
}
}
// Add expiring resources
if !expiring_resources.is_empty() {
message.push_str(&format!("Expiring Resources ({})\n", expiring_resources.len()));
message.push_str("====================\n");
for resource in expiring_resources {
message.push_str(&format!("Resource: {}\n", resource.resource_arn));
message.push_str(&format!("Type: {}\n", resource.resource_type));
message.push_str(&format!("Expiration Date: {}\n", resource.expiration_date));
message.push_str(&format!("Days Until Expiration: {}\n", resource.days_until_expiration));
message.push_str("\n");
}
}
// Send SNS notification
client
.publish()
.topic_arn(topic_arn)
.subject("AWS Resource Tag Enforcement Report")
.message(message)
.send()
.await?;
println!("Notification sent to SNS topic: {}", topic_arn);
Ok(())
}
Lessons Learned:
Proactive cost management requires both automated monitoring and proper resource lifecycle policies.
How to Avoid:
Implement mandatory resource tagging with ownership and expiration information.
Set up automated cleanup of orphaned resources.
Configure proper autoscaling policies with scale-down rules.
Use cost anomaly detection with alerting.
Implement resource quotas and budget alerts.
No summary provided
What Happened:
After deploying a new feature, the company's cloud costs tripled overnight. The finance team raised an urgent alert when they saw the preliminary billing report. The spike occurred despite no significant increase in user traffic or application load.
Diagnosis Steps:
Analyzed AWS Cost Explorer reports to identify cost drivers.
Reviewed recent infrastructure changes and deployments.
Examined auto-scaling configurations and scaling events.
Analyzed application metrics and logs for unusual patterns.
Checked for potential security incidents or unauthorized resource usage.
Root Cause:
Multiple issues contributed to the cost spike: 1. Horizontal Pod Autoscaler (HPA) was configured with too aggressive scaling parameters 2. Missing scaling limits allowed unlimited scaling during brief traffic spikes 3. A monitoring agent was causing artificial CPU load, triggering unnecessary scaling 4. Cluster Autoscaler was configured to scale up quickly but scale down slowly 5. Unused resources (EBS volumes, load balancers) were not being cleaned up
Fix/Workaround:
• Short-term: Implemented immediate cost controls with optimized HPA configuration
• Fixed Cluster Autoscaler configuration to scale down more efficiently
• Implemented resource tagging and lifecycle policies for unused resources
• Adjusted monitoring agent configuration to reduce CPU overhead
• Long-term: Implemented a comprehensive cloud cost optimization strategy with automated reporting and alerting
Lessons Learned:
Auto-scaling configurations require careful tuning to balance performance and cost.
How to Avoid:
Implement cost monitoring with alerts for unexpected spikes.
Set appropriate scaling limits in all auto-scaling configurations.
Regularly audit and clean up unused cloud resources.
Test auto-scaling behavior before deploying to production.
Use spot instances and reserved instances where appropriate.
No summary provided
What Happened:
During a monthly financial review, the finance team flagged a significant increase in cloud costs. The DevOps team was tasked with investigating the spike and found numerous orphaned resources including unused EBS volumes, idle RDS instances, and forgotten development environments that had been running for months without proper monitoring or cost allocation tags.
Diagnosis Steps:
Analyzed AWS Cost Explorer reports to identify cost anomalies.
Used AWS Cost and Usage Reports to break down costs by service and region.
Reviewed resource tagging compliance across all AWS accounts.
Ran AWS Trusted Advisor cost optimization checks.
Compared infrastructure-as-code definitions with actual deployed resources.
Root Cause:
The investigation revealed multiple issues contributing to the cost spike: 1. Developers were creating temporary resources for testing but not cleaning them up 2. Automated CI/CD pipelines were creating resources but failing to destroy them when builds failed 3. Many resources lacked proper ownership and project tags for cost allocation 4. No automated process existed for identifying and removing unused resources 5. Terraform state files were inconsistent with actual deployed resources
Fix/Workaround:
• Short-term: Implemented immediate cost reduction measures:
#!/bin/bash
# cleanup_orphaned_resources.sh
# Script to identify and clean up orphaned AWS resources
# Set AWS region
AWS_REGION="us-west-2"
echo "Starting orphaned resource cleanup in region $AWS_REGION"
# Find and remove unattached EBS volumes
echo "Finding unattached EBS volumes..."
UNATTACHED_VOLUMES=$(aws ec2 describe-volumes \
--region $AWS_REGION \
--filters Name=status,Values=available \
--query 'Volumes[*].{ID:VolumeId,Size:Size,Created:CreateTime}' \
--output json)
VOLUME_COUNT=$(echo $UNATTACHED_VOLUMES | jq length)
echo "Found $VOLUME_COUNT unattached volumes"
if [ $VOLUME_COUNT -gt 0 ]; then
echo "Volumes to be removed:"
echo $UNATTACHED_VOLUMES | jq -r '.[] | "ID: \(.ID), Size: \(.Size)GB, Created: \(.Created)"'
# Ask for confirmation before deleting
read -p "Do you want to delete these volumes? (y/n) " -n 1 -r
echo
if [[ $REPLY =~ ^[Yy]$ ]]; then
echo $UNATTACHED_VOLUMES | jq -r '.[].ID' | while read VOLUME_ID; do
echo "Deleting volume $VOLUME_ID"
aws ec2 delete-volume --region $AWS_REGION --volume-id $VOLUME_ID
done
fi
fi
# Find and remove unused Elastic IPs
echo "Finding unused Elastic IPs..."
UNUSED_EIPS=$(aws ec2 describe-addresses \
--region $AWS_REGION \
--query 'Addresses[?AssociationId==null]' \
--output json)
EIP_COUNT=$(echo $UNUSED_EIPS | jq length)
echo "Found $EIP_COUNT unused Elastic IPs"
if [ $EIP_COUNT -gt 0 ]; then
echo "Elastic IPs to be released:"
echo $UNUSED_EIPS | jq -r '.[] | "ID: \(.AllocationId), IP: \(.PublicIp)"'
# Ask for confirmation before releasing
read -p "Do you want to release these Elastic IPs? (y/n) " -n 1 -r
echo
if [[ $REPLY =~ ^[Yy]$ ]]; then
echo $UNUSED_EIPS | jq -r '.[].AllocationId' | while read EIP_ID; do
echo "Releasing Elastic IP $EIP_ID"
aws ec2 release-address --region $AWS_REGION --allocation-id $EIP_ID
done
fi
fi
# Find and remove old snapshots
echo "Finding old EBS snapshots..."
# Get snapshots older than 90 days that are not used by AMIs
NINETY_DAYS_AGO=$(date -d "90 days ago" +%Y-%m-%dT%H:%M:%S)
OLD_SNAPSHOTS=$(aws ec2 describe-snapshots \
--region $AWS_REGION \
--owner-ids self \
--query "Snapshots[?StartTime<='$NINETY_DAYS_AGO']" \
--output json)
# Filter out snapshots used by AMIs
AMI_SNAPSHOTS=$(aws ec2 describe-images \
--region $AWS_REGION \
--owners self \
--query 'Images[].BlockDeviceMappings[].Ebs.SnapshotId' \
--output json)
UNUSED_OLD_SNAPSHOTS=$(echo $OLD_SNAPSHOTS | jq --argjson ami_snaps "$AMI_SNAPSHOTS" '[.[] | select(.SnapshotId as $snap_id | $ami_snaps | index($snap_id) | not)]')
SNAPSHOT_COUNT=$(echo $UNUSED_OLD_SNAPSHOTS | jq length)
echo "Found $SNAPSHOT_COUNT old unused snapshots"
if [ $SNAPSHOT_COUNT -gt 0 ]; then
echo "Old snapshots to be removed:"
echo $UNUSED_OLD_SNAPSHOTS | jq -r '.[] | "ID: \(.SnapshotId), Created: \(.StartTime), Size: \(.VolumeSize)GB"'
# Ask for confirmation before deleting
read -p "Do you want to delete these snapshots? (y/n) " -n 1 -r
echo
if [[ $REPLY =~ ^[Yy]$ ]]; then
echo $UNUSED_OLD_SNAPSHOTS | jq -r '.[].SnapshotId' | while read SNAPSHOT_ID; do
echo "Deleting snapshot $SNAPSHOT_ID"
aws ec2 delete-snapshot --region $AWS_REGION --snapshot-id $SNAPSHOT_ID
done
fi
fi
# Find idle RDS instances (low connection count)
echo "Finding potentially idle RDS instances..."
aws rds describe-db-instances \
--region $AWS_REGION \
--query 'DBInstances[*].{ID:DBInstanceIdentifier,Class:DBInstanceClass,Engine:Engine,Status:DBInstanceStatus}' \
--output table
echo "To check CloudWatch metrics for connection counts, run:"
echo "aws cloudwatch get-metric-statistics --namespace AWS/RDS --metric-name DatabaseConnections --dimensions Name=DBInstanceIdentifier,Value=<instance-id> --start-time $(date -d '7 days ago' +%Y-%m-%dT%H:%M:%S) --end-time $(date +%Y-%m-%dT%H:%M:%S) --period 3600 --statistics Average"
# Find and report on untagged resources
echo "Finding resources missing required tags..."
REQUIRED_TAGS="Owner,Project,Environment"
# Check EC2 instances
echo "Checking EC2 instances for missing tags..."
aws ec2 describe-instances \
--region $AWS_REGION \
--query 'Reservations[].Instances[?!not_null(Tags[?Key==`Owner`].Value|[0]) || !not_null(Tags[?Key==`Project`].Value|[0]) || !not_null(Tags[?Key==`Environment`].Value|[0])].[InstanceId,InstanceType,State.Name]' \
--output table
# Check EBS volumes
echo "Checking EBS volumes for missing tags..."
aws ec2 describe-volumes \
--region $AWS_REGION \
--query 'Volumes[?!not_null(Tags[?Key==`Owner`].Value|[0]) || !not_null(Tags[?Key==`Project`].Value|[0]) || !not_null(Tags[?Key==`Environment`].Value|[0])].[VolumeId,Size,State]' \
--output table
echo "Cleanup script completed."
• Implemented a Go-based cloud resource analyzer:
// cloud_resource_analyzer.go
package main
import (
"context"
"encoding/json"
"flag"
"fmt"
"log"
"os"
"sort"
"strings"
"sync"
"time"
"github.com/aws/aws-sdk-go-v2/aws"
"github.com/aws/aws-sdk-go-v2/config"
"github.com/aws/aws-sdk-go-v2/service/cloudwatch"
"github.com/aws/aws-sdk-go-v2/service/cloudwatch/types"
"github.com/aws/aws-sdk-go-v2/service/ec2"
ec2types "github.com/aws/aws-sdk-go-v2/service/ec2/types"
"github.com/aws/aws-sdk-go-v2/service/rds"
rdstypes "github.com/aws/aws-sdk-go-v2/service/rds/types"
"github.com/aws/aws-sdk-go-v2/service/s3"
"github.com/olekukonko/tablewriter"
)
// ResourceInfo represents information about a cloud resource
type ResourceInfo struct {
ResourceID string
ResourceType string
Region string
Account string
Size string
State string
CreatedAt time.Time
LastUsed time.Time
Tags map[string]string
EstimatedCost float64
Utilization float64
Recommendation string
}
// ResourceAnalyzer analyzes cloud resources
type ResourceAnalyzer struct {
cfg aws.Config
regions []string
requiredTags []string
resourceChan chan ResourceInfo
wg sync.WaitGroup
ctx context.Context
}
func main() {
// Parse command line flags
regionFlag := flag.String("region", "", "AWS region (comma-separated for multiple regions)")
outputFlag := flag.String("output", "table", "Output format (table, json, csv)")
tagsFlag := flag.String("required-tags", "Owner,Project,Environment", "Required tags (comma-separated)")
daysFlag := flag.Int("days", 30, "Number of days to consider a resource unused")
flag.Parse()
// Set up regions
var regions []string
if *regionFlag != "" {
regions = strings.Split(*regionFlag, ",")
} else {
// Default to common regions
regions = []string{"us-east-1", "us-west-2", "eu-west-1"}
}
// Set up required tags
requiredTags := strings.Split(*tagsFlag, ",")
// Initialize analyzer
analyzer, err := NewResourceAnalyzer(regions, requiredTags)
if err != nil {
log.Fatalf("Failed to initialize resource analyzer: %v", err)
}
// Analyze resources
resources, err := analyzer.AnalyzeResources(*daysFlag)
if err != nil {
log.Fatalf("Failed to analyze resources: %v", err)
}
// Output results
switch *outputFlag {
case "json":
outputJSON(resources)
case "csv":
outputCSV(resources)
default:
outputTable(resources)
}
// Print summary
printSummary(resources)
}
// NewResourceAnalyzer creates a new resource analyzer
func NewResourceAnalyzer(regions, requiredTags []string) (*ResourceAnalyzer, error) {
// Load AWS configuration
cfg, err := config.LoadDefaultConfig(context.TODO())
if err != nil {
return nil, fmt.Errorf("failed to load AWS config: %w", err)
}
return &ResourceAnalyzer{
cfg: cfg,
regions: regions,
requiredTags: requiredTags,
resourceChan: make(chan ResourceInfo, 100),
ctx: context.Background(),
}, nil
}
// AnalyzeResources analyzes all resources across regions
func (ra *ResourceAnalyzer) AnalyzeResources(unusedDays int) ([]ResourceInfo, error) {
var resources []ResourceInfo
resultChan := make(chan []ResourceInfo)
errorChan := make(chan error)
// Start a goroutine to collect results
go func() {
var allResources []ResourceInfo
for res := range resultChan {
allResources = append(allResources, res...)
}
resources = allResources
}()
// Process each region
for _, region := range ra.regions {
ra.wg.Add(1)
go func(region string) {
defer ra.wg.Done()
// Create region-specific config
regionCfg := ra.cfg.Copy()
regionCfg.Region = region
// Analyze EC2 resources
ec2Resources, err := ra.analyzeEC2Resources(regionCfg, unusedDays)
if err != nil {
errorChan <- fmt.Errorf("failed to analyze EC2 resources in %s: %w", region, err)
return
}
resultChan <- ec2Resources
// Analyze RDS resources
rdsResources, err := ra.analyzeRDSResources(regionCfg, unusedDays)
if err != nil {
errorChan <- fmt.Errorf("failed to analyze RDS resources in %s: %w", region, err)
return
}
resultChan <- rdsResources
// Analyze S3 resources (global, so only do this once)
if region == ra.regions[0] {
s3Resources, err := ra.analyzeS3Resources(regionCfg, unusedDays)
if err != nil {
errorChan <- fmt.Errorf("failed to analyze S3 resources: %w", err)
return
}
resultChan <- s3Resources
}
}(region)
}
// Wait for all goroutines to complete
go func() {
ra.wg.Wait()
close(resultChan)
close(errorChan)
}()
// Check for errors
for err := range errorChan {
return nil, err
}
return resources, nil
}
// analyzeEC2Resources analyzes EC2 resources in a region
func (ra *ResourceAnalyzer) analyzeEC2Resources(cfg aws.Config, unusedDays int) ([]ResourceInfo, error) {
var resources []ResourceInfo
// Create EC2 client
ec2Client := ec2.NewFromConfig(cfg)
// Get EC2 instances
instances, err := ec2Client.DescribeInstances(ra.ctx, &ec2.DescribeInstancesInput{})
if err != nil {
return nil, fmt.Errorf("failed to describe EC2 instances: %w", err)
}
// Process instances
for _, reservation := range instances.Reservations {
for _, instance := range reservation.Instances {
// Convert tags to map
tags := make(map[string]string)
for _, tag := range instance.Tags {
tags[*tag.Key] = *tag.Value
}
// Get instance details
resourceInfo := ResourceInfo{
ResourceID: *instance.InstanceId,
ResourceType: "EC2 Instance",
Region: cfg.Region,
Size: string(instance.InstanceType),
State: string(instance.State.Name),
CreatedAt: *instance.LaunchTime,
Tags: tags,
}
// Check utilization
utilization, lastUsed, err := ra.getEC2Utilization(cfg, *instance.InstanceId, unusedDays)
if err != nil {
log.Printf("Warning: Failed to get utilization for instance %s: %v", *instance.InstanceId, err)
} else {
resourceInfo.Utilization = utilization
resourceInfo.LastUsed = lastUsed
}
// Estimate cost
resourceInfo.EstimatedCost = estimateEC2Cost(string(instance.InstanceType), cfg.Region)
// Generate recommendation
resourceInfo.Recommendation = generateEC2Recommendation(resourceInfo, unusedDays)
resources = append(resources, resourceInfo)
}
}
// Get EBS volumes
volumes, err := ec2Client.DescribeVolumes(ra.ctx, &ec2.DescribeVolumesInput{})
if err != nil {
return nil, fmt.Errorf("failed to describe EBS volumes: %w", err)
}
// Process volumes
for _, volume := range volumes.Volumes {
// Convert tags to map
tags := make(map[string]string)
for _, tag := range volume.Tags {
tags[*tag.Key] = *tag.Value
}
// Get volume details
resourceInfo := ResourceInfo{
ResourceID: *volume.VolumeId,
ResourceType: "EBS Volume",
Region: cfg.Region,
Size: fmt.Sprintf("%d GB", *volume.Size),
State: string(volume.State),
CreatedAt: *volume.CreateTime,
Tags: tags,
}
// Check if volume is attached
isAttached := len(volume.Attachments) > 0
// Estimate cost
resourceInfo.EstimatedCost = estimateEBSCost(*volume.Size, string(volume.VolumeType), cfg.Region)
// Generate recommendation
if !isAttached {
resourceInfo.Recommendation = "Delete unattached volume"
} else {
resourceInfo.Recommendation = "Volume is in use"
}
resources = append(resources, resourceInfo)
}
// Get Elastic IPs
eips, err := ec2Client.DescribeAddresses(ra.ctx, &ec2.DescribeAddressesInput{})
if err != nil {
return nil, fmt.Errorf("failed to describe Elastic IPs: %w", err)
}
// Process Elastic IPs
for _, eip := range eips.Addresses {
// Convert tags to map
tags := make(map[string]string)
for _, tag := range eip.Tags {
tags[*tag.Key] = *tag.Value
}
// Get EIP details
resourceInfo := ResourceInfo{
ResourceID: *eip.AllocationId,
ResourceType: "Elastic IP",
Region: cfg.Region,
State: "allocated",
Tags: tags,
}
// Check if EIP is associated
isAssociated := eip.AssociationId != nil
// Estimate cost (only charged if not associated with running instance)
if !isAssociated {
resourceInfo.EstimatedCost = 3.6 // $3.6/month for unused EIP
resourceInfo.Recommendation = "Release unused Elastic IP"
} else {
resourceInfo.Recommendation = "Elastic IP is in use"
}
resources = append(resources, resourceInfo)
}
return resources, nil
}
// analyzeRDSResources analyzes RDS resources in a region
func (ra *ResourceAnalyzer) analyzeRDSResources(cfg aws.Config, unusedDays int) ([]ResourceInfo, error) {
var resources []ResourceInfo
// Create RDS client
rdsClient := rds.NewFromConfig(cfg)
// Get RDS instances
instances, err := rdsClient.DescribeDBInstances(ra.ctx, &rds.DescribeDBInstancesInput{})
if err != nil {
return nil, fmt.Errorf("failed to describe RDS instances: %w", err)
}
// Process instances
for _, instance := range instances.DBInstances {
// Convert tags to map
tags := make(map[string]string)
tagList, err := rdsClient.ListTagsForResource(ra.ctx, &rds.ListTagsForResourceInput{
ResourceName: instance.DBInstanceArn,
})
if err == nil {
for _, tag := range tagList.TagList {
tags[*tag.Key] = *tag.Value
}
}
// Get instance details
resourceInfo := ResourceInfo{
ResourceID: *instance.DBInstanceIdentifier,
ResourceType: "RDS Instance",
Region: cfg.Region,
Size: *instance.DBInstanceClass,
State: *instance.DBInstanceStatus,
CreatedAt: *instance.InstanceCreateTime,
Tags: tags,
}
// Check utilization
utilization, lastUsed, err := ra.getRDSUtilization(cfg, *instance.DBInstanceIdentifier, unusedDays)
if err != nil {
log.Printf("Warning: Failed to get utilization for RDS instance %s: %v", *instance.DBInstanceIdentifier, err)
} else {
resourceInfo.Utilization = utilization
resourceInfo.LastUsed = lastUsed
}
// Estimate cost
resourceInfo.EstimatedCost = estimateRDSCost(*instance.DBInstanceClass, *instance.Engine, cfg.Region)
// Generate recommendation
resourceInfo.Recommendation = generateRDSRecommendation(resourceInfo, unusedDays)
resources = append(resources, resourceInfo)
}
return resources, nil
}
// analyzeS3Resources analyzes S3 resources
func (ra *ResourceAnalyzer) analyzeS3Resources(cfg aws.Config, unusedDays int) ([]ResourceInfo, error) {
var resources []ResourceInfo
// Create S3 client
s3Client := s3.NewFromConfig(cfg)
// Get S3 buckets
buckets, err := s3Client.ListBuckets(ra.ctx, &s3.ListBucketsInput{})
if err != nil {
return nil, fmt.Errorf("failed to list S3 buckets: %w", err)
}
// Process buckets
for _, bucket := range buckets.Buckets {
// Get bucket location
location, err := s3Client.GetBucketLocation(ra.ctx, &s3.GetBucketLocationInput{
Bucket: bucket.Name,
})
if err != nil {
log.Printf("Warning: Failed to get location for bucket %s: %v", *bucket.Name, err)
continue
}
region := "us-east-1" // Default region
if location.LocationConstraint != "" {
region = string(location.LocationConstraint)
}
// Get bucket tags
tags := make(map[string]string)
tagging, err := s3Client.GetBucketTagging(ra.ctx, &s3.GetBucketTaggingInput{
Bucket: bucket.Name,
})
if err == nil {
for _, tag := range tagging.TagSet {
tags[*tag.Key] = *tag.Value
}
}
// Get bucket details
resourceInfo := ResourceInfo{
ResourceID: *bucket.Name,
ResourceType: "S3 Bucket",
Region: region,
CreatedAt: *bucket.CreationDate,
Tags: tags,
}
// Check last access
lastUsed, err := ra.getS3LastAccess(cfg, *bucket.Name)
if err != nil {
log.Printf("Warning: Failed to get last access for bucket %s: %v", *bucket.Name, err)
} else {
resourceInfo.LastUsed = lastUsed
}
// Generate recommendation
if time.Since(resourceInfo.LastUsed).Hours() > float64(unusedDays*24) {
resourceInfo.Recommendation = "Consider deleting unused bucket"
} else {
resourceInfo.Recommendation = "Bucket is in use"
}
resources = append(resources, resourceInfo)
}
return resources, nil
}
// getEC2Utilization gets the CPU utilization of an EC2 instance
func (ra *ResourceAnalyzer) getEC2Utilization(cfg aws.Config, instanceID string, unusedDays int) (float64, time.Time, error) {
// Create CloudWatch client
cwClient := cloudwatch.NewFromConfig(cfg)
// Set up time range
endTime := time.Now()
startTime := endTime.AddDate(0, 0, -unusedDays)
// Get CPU utilization
result, err := cwClient.GetMetricStatistics(ra.ctx, &cloudwatch.GetMetricStatisticsInput{
Namespace: aws.String("AWS/EC2"),
MetricName: aws.String("CPUUtilization"),
Dimensions: []types.Dimension{
{
Name: aws.String("InstanceId"),
Value: aws.String(instanceID),
},
},
StartTime: aws.Time(startTime),
EndTime: aws.Time(endTime),
Period: aws.Int32(86400), // 1 day
Statistics: []types.Statistic{types.StatisticAverage},
})
if err != nil {
return 0, time.Time{}, err
}
// Process results
if len(result.Datapoints) == 0 {
return 0, time.Time{}, nil
}
// Sort datapoints by time
sort.Slice(result.Datapoints, func(i, j int) bool {
return result.Datapoints[i].Timestamp.After(*result.Datapoints[j].Timestamp)
})
// Calculate average utilization
var totalUtilization float64
for _, dp := range result.Datapoints {
totalUtilization += *dp.Average
}
avgUtilization := totalUtilization / float64(len(result.Datapoints))
// Get last used time (most recent datapoint with non-zero utilization)
var lastUsed time.Time
for _, dp := range result.Datapoints {
if *dp.Average > 1.0 { // Consider >1% CPU as "used"
lastUsed = *dp.Timestamp
break
}
}
return avgUtilization, lastUsed, nil
}
// getRDSUtilization gets the connection count of an RDS instance
func (ra *ResourceAnalyzer) getRDSUtilization(cfg aws.Config, instanceID string, unusedDays int) (float64, time.Time, error) {
// Create CloudWatch client
cwClient := cloudwatch.NewFromConfig(cfg)
// Set up time range
endTime := time.Now()
startTime := endTime.AddDate(0, 0, -unusedDays)
// Get database connections
result, err := cwClient.GetMetricStatistics(ra.ctx, &cloudwatch.GetMetricStatisticsInput{
Namespace: aws.String("AWS/RDS"),
MetricName: aws.String("DatabaseConnections"),
Dimensions: []types.Dimension{
{
Name: aws.String("DBInstanceIdentifier"),
Value: aws.String(instanceID),
},
},
StartTime: aws.Time(startTime),
EndTime: aws.Time(endTime),
Period: aws.Int32(86400), // 1 day
Statistics: []types.Statistic{types.StatisticAverage},
})
if err != nil {
return 0, time.Time{}, err
}
// Process results
if len(result.Datapoints) == 0 {
return 0, time.Time{}, nil
}
// Sort datapoints by time
sort.Slice(result.Datapoints, func(i, j int) bool {
return result.Datapoints[i].Timestamp.After(*result.Datapoints[j].Timestamp)
})
// Calculate average connections
var totalConnections float64
for _, dp := range result.Datapoints {
totalConnections += *dp.Average
}
avgConnections := totalConnections / float64(len(result.Datapoints))
// Get last used time (most recent datapoint with non-zero connections)
var lastUsed time.Time
for _, dp := range result.Datapoints {
if *dp.Average > 0 {
lastUsed = *dp.Timestamp
break
}
}
return avgConnections, lastUsed, nil
}
// getS3LastAccess gets the last access time of an S3 bucket
func (ra *ResourceAnalyzer) getS3LastAccess(cfg aws.Config, bucketName string) (time.Time, error) {
// In a real implementation, this would use S3 analytics or CloudTrail
// For this example, we'll return a random time within the last 90 days
days := rand.Intn(90)
return time.Now().AddDate(0, 0, -days), nil
}
// Helper functions for cost estimation and recommendations
func estimateEC2Cost(instanceType, region string) float64 {
// Simplified cost estimation based on instance type
// In a real implementation, this would use pricing API or a pricing database
switch {
case strings.HasPrefix(instanceType, "t2.micro"):
return 8.5
case strings.HasPrefix(instanceType, "t2.small"):
return 17.0
case strings.HasPrefix(instanceType, "t2.medium"):
return 34.0
case strings.HasPrefix(instanceType, "m5.large"):
return 69.0
case strings.HasPrefix(instanceType, "m5.xlarge"):
return 138.0
default:
return 50.0 // Default estimate
}
}
func estimateEBSCost(sizeGB int, volumeType, region string) float64 {
// Simplified cost estimation based on volume type and size
var pricePerGB float64
switch volumeType {
case "gp2", "gp3":
pricePerGB = 0.1
case "io1", "io2":
pricePerGB = 0.125
case "st1":
pricePerGB = 0.045
case "sc1":
pricePerGB = 0.025
default:
pricePerGB = 0.1
}
return float64(sizeGB) * pricePerGB
}
func estimateRDSCost(instanceClass, engine, region string) float64 {
// Simplified cost estimation based on instance class
switch {
case strings.HasPrefix(instanceClass, "db.t3.micro"):
return 12.5
case strings.HasPrefix(instanceClass, "db.t3.small"):
return 25.0
case strings.HasPrefix(instanceClass, "db.t3.medium"):
return 50.0
case strings.HasPrefix(instanceClass, "db.m5.large"):
return 120.0
case strings.HasPrefix(instanceClass, "db.m5.xlarge"):
return 240.0
default:
return 100.0 // Default estimate
}
}
func generateEC2Recommendation(resource ResourceInfo, unusedDays int) string {
// Generate recommendation based on resource state and utilization
if resource.State == "stopped" {
return "Consider terminating stopped instance"
}
if resource.Utilization < 5.0 && time.Since(resource.LastUsed).Hours() > float64(unusedDays*24) {
return "Terminate idle instance (low CPU utilization)"
}
if resource.Utilization < 20.0 {
return "Consider downsizing instance (low CPU utilization)"
}
// Check for missing required tags
missingTags := []string{}
for _, tag := range ra.requiredTags {
if _, ok := resource.Tags[tag]; !ok {
missingTags = append(missingTags, tag)
}
}
if len(missingTags) > 0 {
return fmt.Sprintf("Add missing tags: %s", strings.Join(missingTags, ", "))
}
return "No action needed"
}
func generateRDSRecommendation(resource ResourceInfo, unusedDays int) string {
// Generate recommendation based on resource state and utilization
if resource.Utilization < 1.0 && time.Since(resource.LastUsed).Hours() > float64(unusedDays*24) {
return "Consider deleting idle database (no connections)"
}
if resource.Utilization < 5.0 {
return "Consider downsizing instance (low connection count)"
}
// Check for missing required tags
missingTags := []string{}
for _, tag := range ra.requiredTags {
if _, ok := resource.Tags[tag]; !ok {
missingTags = append(missingTags, tag)
}
}
if len(missingTags) > 0 {
return fmt.Sprintf("Add missing tags: %s", strings.Join(missingTags, ", "))
}
return "No action needed"
}
// Output functions
func outputTable(resources []ResourceInfo) {
table := tablewriter.NewWriter(os.Stdout)
table.SetHeader([]string{"ID", "Type", "Region", "Size", "State", "Created", "Last Used", "Cost ($)", "Recommendation"})
for _, res := range resources {
lastUsedStr := ""
if !res.LastUsed.IsZero() {
lastUsedStr = res.LastUsed.Format("2006-01-02")
}
table.Append([]string{
res.ResourceID,
res.ResourceType,
res.Region,
res.Size,
res.State,
res.CreatedAt.Format("2006-01-02"),
lastUsedStr,
fmt.Sprintf("%.2f", res.EstimatedCost),
res.Recommendation,
})
}
table.Render()
}
func outputJSON(resources []ResourceInfo) {
data, err := json.MarshalIndent(resources, "", " ")
if err != nil {
log.Fatalf("Failed to marshal resources to JSON: %v", err)
}
fmt.Println(string(data))
}
func outputCSV(resources []ResourceInfo) {
fmt.Println("ID,Type,Region,Size,State,Created,LastUsed,Cost,Recommendation")
for _, res := range resources {
lastUsedStr := ""
if !res.LastUsed.IsZero() {
lastUsedStr = res.LastUsed.Format("2006-01-02")
}
fmt.Printf("%s,%s,%s,%s,%s,%s,%s,%.2f,%s\n",
res.ResourceID,
res.ResourceType,
res.Region,
res.Size,
res.State,
res.CreatedAt.Format("2006-01-02"),
lastUsedStr,
res.EstimatedCost,
res.Recommendation,
)
}
}
func printSummary(resources []ResourceInfo) {
// Calculate total cost
var totalCost float64
var potentialSavings float64
resourceCounts := make(map[string]int)
recommendationCounts := make(map[string]int)
for _, res := range resources {
totalCost += res.EstimatedCost
resourceCounts[res.ResourceType]++
if res.Recommendation != "No action needed" && res.Recommendation != "Resource is in use" {
potentialSavings += res.EstimatedCost
recommendationCounts[res.Recommendation]++
}
}
fmt.Println("\nSummary:")
fmt.Printf("Total resources: %d\n", len(resources))
fmt.Printf("Total monthly cost: $%.2f\n", totalCost)
fmt.Printf("Potential monthly savings: $%.2f (%.1f%%)\n", potentialSavings, (potentialSavings/totalCost)*100)
fmt.Println("\nResource Types:")
for resType, count := range resourceCounts {
fmt.Printf(" %s: %d\n", resType, count)
}
fmt.Println("\nTop Recommendations:")
for rec, count := range recommendationCounts {
if count > 0 {
fmt.Printf(" %s: %d resources\n", rec, count)
}
}
}
• Implemented a Terraform module for enforcing resource tagging:
# aws_resource_tagging.tf
variable "required_tags" {
description = "Required tags for all resources"
type = map(string)
default = {
Owner = "unknown"
Project = "unknown"
Environment = "unknown"
ManagedBy = "terraform"
}
}
variable "tag_enforcement_enabled" {
description = "Enable tag enforcement"
type = bool
default = true
}
# AWS Organizations policy to enforce tagging
resource "aws_organizations_policy" "tag_policy" {
count = var.tag_enforcement_enabled ? 1 : 0
name = "required-tags-policy"
content = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "RequireTagsForEC2"
Effect = "Deny"
Action = ["ec2:RunInstances", "ec2:CreateVolume"]
Resource = ["arn:aws:ec2:*:*:instance/*", "arn:aws:ec2:*:*:volume/*"]
Condition = {
"Null" = {
"aws:RequestTag/Owner" = "true"
"aws:RequestTag/Project" = "true"
"aws:RequestTag/Environment" = "true"
}
}
},
{
Sid = "RequireTagsForRDS"
Effect = "Deny"
Action = ["rds:CreateDBInstance"]
Resource = ["arn:aws:rds:*:*:db:*"]
Condition = {
"Null" = {
"aws:RequestTag/Owner" = "true"
"aws:RequestTag/Project" = "true"
"aws:RequestTag/Environment" = "true"
}
}
},
{
Sid = "RequireTagsForS3"
Effect = "Deny"
Action = ["s3:CreateBucket"]
Resource = ["arn:aws:s3:::*"]
Condition = {
"Null" = {
"aws:RequestTag/Owner" = "true"
"aws:RequestTag/Project" = "true"
"aws:RequestTag/Environment" = "true"
}
}
}
]
})
}
# Lambda function to check for untagged resources
resource "aws_lambda_function" "tag_compliance_checker" {
function_name = "tag-compliance-checker"
role = aws_iam_role.tag_compliance_checker.arn
handler = "index.handler"
runtime = "nodejs14.x"
timeout = 300
memory_size = 256
source_code_hash = filebase64sha256("${path.module}/tag_compliance_checker.zip")
filename = "${path.module}/tag_compliance_checker.zip"
environment {
variables = {
REQUIRED_TAGS = jsonencode(keys(var.required_tags))
SNS_TOPIC_ARN = aws_sns_topic.tag_compliance_alerts.arn
}
}
}
# IAM role for the Lambda function
resource "aws_iam_role" "tag_compliance_checker" {
name = "tag-compliance-checker-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "lambda.amazonaws.com"
}
}
]
})
}
# IAM policy for the Lambda function
resource "aws_iam_policy" "tag_compliance_checker" {
name = "tag-compliance-checker-policy"
description = "Policy for tag compliance checker Lambda"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = [
"ec2:DescribeInstances",
"ec2:DescribeVolumes",
"rds:DescribeDBInstances",
"s3:ListAllMyBuckets",
"s3:GetBucketTagging",
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents",
"sns:Publish"
]
Effect = "Allow"
Resource = "*"
}
]
})
}
# Attach policy to role
resource "aws_iam_role_policy_attachment" "tag_compliance_checker" {
role = aws_iam_role.tag_compliance_checker.name
policy_arn = aws_iam_policy.tag_compliance_checker.arn
}
# CloudWatch event rule to trigger the Lambda function daily
resource "aws_cloudwatch_event_rule" "tag_compliance_daily" {
name = "tag-compliance-daily-check"
description = "Trigger tag compliance check daily"
schedule_expression = "rate(1 day)"
}
# CloudWatch event target
resource "aws_cloudwatch_event_target" "tag_compliance_lambda" {
rule = aws_cloudwatch_event_rule.tag_compliance_daily.name
target_id = "tag-compliance-checker"
arn = aws_lambda_function.tag_compliance_checker.arn
}
# Permission for CloudWatch to invoke Lambda
resource "aws_lambda_permission" "allow_cloudwatch" {
statement_id = "AllowExecutionFromCloudWatch"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.tag_compliance_checker.function_name
principal = "events.amazonaws.com"
source_arn = aws_cloudwatch_event_rule.tag_compliance_daily.arn
}
# SNS topic for alerts
resource "aws_sns_topic" "tag_compliance_alerts" {
name = "tag-compliance-alerts"
}
# Default tags for all resources
provider "aws" {
default_tags {
tags = var.required_tags
}
}
# Output the SNS topic ARN
output "tag_compliance_sns_topic_arn" {
value = aws_sns_topic.tag_compliance_alerts.arn
}
# Output the Lambda function ARN
output "tag_compliance_lambda_arn" {
value = aws_lambda_function.tag_compliance_checker.arn
}
• Long-term: Implemented a comprehensive cloud cost optimization strategy:
- Created a centralized tagging and cost allocation system
- Implemented automated resource cleanup for unused resources
- Developed a cost optimization dashboard
- Established clear procedures for resource provisioning
- Implemented monitoring and alerting for cost anomalies
Lessons Learned:
Cloud cost management requires proactive monitoring and automated cleanup of unused resources.
How to Avoid:
Implement mandatory resource tagging for cost allocation.
Set up automated detection and cleanup of unused resources.
Establish clear ownership and lifecycle policies for cloud resources.
Implement cost anomaly detection and alerting.
Regularly review and optimize cloud resource utilization.
No summary provided
What Happened:
During a monthly financial review, the finance team flagged a significant increase in AWS costs across multiple accounts. The increase occurred gradually over several months but accelerated in the last billing cycle. Initial investigation showed no corresponding increase in application traffic or planned infrastructure expansion. The cost increase was spread across multiple services and accounts, making it difficult to identify the root cause through standard AWS Cost Explorer views.
Diagnosis Steps:
Analyzed detailed AWS Cost and Usage Reports (CUR) to identify cost anomalies.
Compared resource counts and types across multiple billing periods.
Used AWS Resource Explorer to identify resources across all accounts and regions.
Reviewed recent infrastructure deployments and changes.
Analyzed resource tagging compliance across the organization.
Root Cause:
The investigation revealed multiple issues contributing to the cost increase: 1. Numerous orphaned EBS volumes remained after EC2 instance termination 2. Development environments were provisioned but not decommissioned after project completion 3. Unused Elastic IPs were allocated but not attached to any resources 4. Several large RDS instances were over-provisioned with minimal utilization 5. Multiple Lambda functions with excessive memory allocation and timeout settings
Fix/Workaround:
• Created a resource cleanup script to identify and remove unused resources
• Implemented a Terraform module for enforcing resource tagging:
# aws_tagging_policy.tf - Enforce resource tagging
module "tagging_policy" {
source = "./modules/tagging-policy"
required_tags = {
Environment = ["dev", "staging", "prod"]
Project = true
Owner = true
CostCenter = true
}
tag_enforcement_resources = [
"aws_instance",
"aws_volume",
"aws_db_instance",
"aws_lambda_function"
]
}
• Developed a cost optimization dashboard with resource utilization metrics
• Implemented automated resource cleanup for development environments
• Created a cost allocation tagging strategy with enforcement
Lessons Learned:
Proactive resource management and tagging are essential for cloud cost control.
How to Avoid:
Implement automated resource cleanup for orphaned and unused resources.
Enforce tagging policies for all cloud resources.
Set up cost anomaly detection with automated alerts.
Implement resource lifecycle policies for development environments.
Regularly review and right-size provisioned resources.
No summary provided
What Happened:
During a quarterly cloud cost review, the finance team identified that despite having purchased a large number of Reserved Instances across multiple AWS accounts, the company's overall cloud costs were higher than expected. Further investigation revealed that many RIs were underutilized or completely unused, while on-demand instances were being provisioned for workloads that could have been covered by existing RIs. The situation resulted in the company effectively paying twice for some resources - once for the unused RI commitment and again for the on-demand instances.
Diagnosis Steps:
Analyzed Reserved Instance utilization reports across all accounts.
Compared RI inventory with actual instance usage patterns.
Reviewed RI purchase history and decision-making process.
Examined instance type distribution across workloads.
Assessed RI sharing configuration across the organization.
Root Cause:
The investigation revealed multiple issues with RI management: 1. RIs were purchased based on historical usage without accounting for planned workload changes 2. RI sharing was not properly configured across all accounts in the organization 3. Instance type standardization was lacking, leading to fragmented RI coverage 4. No regular review process existed for RI utilization and optimization 5. Teams were provisioning resources without visibility into existing RI inventory
Fix/Workaround:
• Implemented immediate RI optimization actions
• Created a centralized RI management strategy
• Established regular RI utilization reviews
• Developed instance type standardization guidelines
• Implemented automated RI coverage monitoring
Lessons Learned:
Reserved Instance management requires ongoing optimization and organization-wide visibility.
How to Avoid:
Implement centralized RI purchasing and management.
Configure proper RI sharing across all accounts in the organization.
Establish regular RI utilization reviews and optimization cycles.
Create instance type standardization guidelines for workloads.
Develop automated alerting for underutilized RIs and coverage gaps.
No summary provided
What Happened:
A large enterprise with multiple business units migrated to a multi-cloud environment. The finance team implemented a chargeback model to allocate cloud costs to different departments. However, after several months, they discovered that a significant portion of cloud resources (approximately 40%) were untagged or incorrectly tagged, making accurate cost allocation impossible. This led to financial disputes between departments, budget overruns, and delayed cloud adoption initiatives.
Diagnosis Steps:
Analyzed resource tagging compliance across cloud accounts.
Reviewed tagging policies and enforcement mechanisms.
Examined resource provisioning workflows and automation.
Interviewed teams about tagging practices and challenges.
Assessed the cost allocation and reporting processes.
Root Cause:
The investigation revealed multiple issues with the tagging strategy: 1. Inconsistent tagging standards across different cloud platforms 2. Lack of automated tag validation during resource provisioning 3. Manual resource creation bypassing governance controls 4. Insufficient training and awareness about tagging importance 5. No regular auditing or remediation of untagged resources
Fix/Workaround:
• Implemented immediate tagging remediation for existing resources
• Created consistent cross-cloud tagging standards
• Developed automated tag validation and enforcement
• Established regular tagging compliance audits
• Improved cost allocation reporting for untagged resources
Lessons Learned:
Effective cloud cost allocation requires consistent tagging governance and automation.
How to Avoid:
Implement automated tag validation during resource provisioning.
Create consistent tagging standards across all cloud platforms.
Establish regular tagging compliance audits and remediation.
Provide comprehensive training on tagging importance and practices.
Develop fallback allocation methods for untagged resources.
No summary provided
What Happened:
A mid-sized SaaS company noticed their cloud bill had increased by 35% over three months without a corresponding increase in customer usage or new deployments. The finance team flagged the issue during quarterly budget reviews, but the engineering team couldn't immediately identify the cause. Initial investigations focused on active production workloads, but these showed normal resource utilization. Further investigation revealed numerous orphaned resources across multiple cloud providers, including unused load balancers, detached storage volumes, idle database instances, and test environments that were never decommissioned after project completions.
Diagnosis Steps:
Analyzed billing data across all cloud providers for the past six months.
Created resource inventory reports categorized by service type, account, and team.
Cross-referenced resources with active projects and applications.
Examined resource creation patterns and identified resources without proper ownership tags.
Reviewed infrastructure-as-code repositories to identify resources created outside of the standard process.
Root Cause:
The investigation revealed multiple issues contributing to orphaned resources: 1. Lack of consistent resource tagging strategy across cloud providers 2. No automated process for decommissioning test and development environments 3. Manual resource creation outside of infrastructure-as-code workflows 4. Incomplete ownership and lifecycle metadata for cloud resources 5. Absence of regular cost reviews and resource pruning processes
Fix/Workaround:
• Implemented immediate cost reduction measures
• Identified and terminated clearly unused resources (saving ~20% of costs)
• Established a resource tagging policy and enforcement mechanism
• Created an automated orphaned resource detection system
• Implemented regular cost review processes with team accountability
# Resource Tagging Policy Implementation
# File: resource_tagging_policy.tf
# AWS Provider Configuration with Default Tags
provider "aws" {
region = var.aws_region
default_tags {
tags = {
Environment = var.environment
Project = var.project_name
Owner = var.team_email
CostCenter = var.cost_center
CreatedBy = "Terraform"
CreationDate = timestamp()
ExpirationDate = var.environment == "production" ? null : timeadd(timestamp(), var.resource_ttl)
}
}
}
# Azure Provider with Required Tags Policy
resource "azurerm_policy_definition" "require_tags" {
name = "require-resource-tags"
display_name = "Require specified tags on all resources"
description = "This policy ensures all resources have the required tags"
policy_type = "Custom"
mode = "Indexed"
metadata = <<METADATA
{
"version": "1.0.0",
"category": "Tags"
}
METADATA
policy_rule = <<POLICY_RULE
{
"if": {
"anyOf": [
{
"field": "tags['Environment']",
"exists": "false"
},
{
"field": "tags['Project']",
"exists": "false"
},
{
"field": "tags['Owner']",
"exists": "false"
},
{
"field": "tags['CostCenter']",
"exists": "false"
},
{
"field": "tags['CreatedBy']",
"exists": "false"
},
{
"field": "tags['CreationDate']",
"exists": "false"
},
{
"field": "tags['ExpirationDate']",
"exists": "false",
"equals": "false",
"where": {
"value": "[resourceGroup().tags['Environment']]",
"notEquals": "production"
}
}
]
},
"then": {
"effect": "deny"
}
}
POLICY_RULE
}
# GCP Organization Policy for Required Labels
resource "google_organization_policy" "require_labels" {
org_id = var.organization_id
constraint = "constraints/compute.requireLabels"
list_policy {
allow {
values = [
"Environment",
"Project",
"Owner",
"CostCenter",
"CreatedBy",
"CreationDate",
"ExpirationDate"
]
}
}
}
# Automated Resource Expiration Lambda Function
resource "aws_lambda_function" "resource_expiration_checker" {
function_name = "resource-expiration-checker"
role = aws_iam_role.resource_expiration_checker.arn
handler = "index.handler"
runtime = "nodejs14.x"
timeout = 300
memory_size = 256
environment {
variables = {
NOTIFICATION_SNS_TOPIC = aws_sns_topic.resource_expiration_notifications.arn
DRY_RUN = "false"
GRACE_PERIOD_DAYS = "7"
}
}
filename = "${path.module}/lambda/resource_expiration_checker.zip"
source_code_hash = filebase64sha256("${path.module}/lambda/resource_expiration_checker.zip")
}
# CloudWatch Event Rule to trigger the Lambda daily
resource "aws_cloudwatch_event_rule" "daily_resource_check" {
name = "daily-resource-expiration-check"
description = "Triggers the resource expiration checker Lambda daily"
schedule_expression = "cron(0 1 * * ? *)"
}
resource "aws_cloudwatch_event_target" "check_resources_daily" {
rule = aws_cloudwatch_event_rule.daily_resource_check.name
target_id = "resource_expiration_checker"
arn = aws_lambda_function.resource_expiration_checker.arn
}
# SNS Topic for notifications
resource "aws_sns_topic" "resource_expiration_notifications" {
name = "resource-expiration-notifications"
}
# Subscription for the team
resource "aws_sns_topic_subscription" "team_email" {
topic_arn = aws_sns_topic.resource_expiration_notifications.arn
protocol = "email"
endpoint = var.team_email
}
// AWS Lambda Function for Resource Expiration Checking
// File: resource_expiration_checker.js
const AWS = require('aws-sdk');
const moment = require('moment');
// Initialize AWS clients
const ec2 = new AWS.EC2();
const rds = new AWS.RDS();
const s3 = new AWS.S3();
const elasticache = new AWS.ElastiCache();
const sns = new AWS.SNS();
// Configuration from environment variables
const SNS_TOPIC = process.env.NOTIFICATION_SNS_TOPIC;
const DRY_RUN = process.env.DRY_RUN === 'true';
const GRACE_PERIOD_DAYS = parseInt(process.env.GRACE_PERIOD_DAYS || '7', 10);
exports.handler = async (event) => {
console.log('Starting resource expiration check');
// Track resources for reporting
const report = {
expiredResources: [],
expiringResources: [],
errors: []
};
try {
// Check EC2 instances
await checkEC2Instances(report);
// Check EBS volumes
await checkEBSVolumes(report);
// Check RDS instances
await checkRDSInstances(report);
// Check ElastiCache clusters
await checkElastiCacheClusters(report);
// Check S3 buckets
await checkS3Buckets(report);
// Send notification with report
await sendReport(report);
return {
statusCode: 200,
body: JSON.stringify({
message: 'Resource expiration check completed',
expiredCount: report.expiredResources.length,
expiringCount: report.expiringResources.length,
errorCount: report.errors.length
})
};
} catch (error) {
console.error('Error in resource expiration check:', error);
return {
statusCode: 500,
body: JSON.stringify({ error: error.message })
};
}
};
async function checkEC2Instances(report) {
console.log('Checking EC2 instances');
try {
const { Reservations } = await ec2.describeInstances({}).promise();
for (const reservation of Reservations) {
for (const instance of reservation.Instances) {
try {
// Skip terminated instances
if (instance.State.Name === 'terminated') continue;
const tags = instance.Tags || [];
const expirationTag = tags.find(tag => tag.Key === 'ExpirationDate');
if (expirationTag && expirationTag.Value) {
const expirationDate = moment(expirationTag.Value);
const now = moment();
if (expirationDate.isBefore(now)) {
// Resource is expired
report.expiredResources.push({
type: 'EC2 Instance',
id: instance.InstanceId,
expirationDate: expirationTag.Value,
action: DRY_RUN ? 'Would terminate' : 'Terminating'
});
if (!DRY_RUN) {
await ec2.terminateInstances({
InstanceIds: [instance.InstanceId]
}).promise();
}
} else if (expirationDate.isBefore(now.add(GRACE_PERIOD_DAYS, 'days'))) {
// Resource is expiring soon
report.expiringResources.push({
type: 'EC2 Instance',
id: instance.InstanceId,
expirationDate: expirationTag.Value,
daysRemaining: expirationDate.diff(now, 'days')
});
}
}
} catch (instanceError) {
report.errors.push({
type: 'EC2 Instance',
id: instance.InstanceId,
error: instanceError.message
});
}
}
}
} catch (error) {
report.errors.push({
type: 'EC2 Service',
error: error.message
});
}
}
// Similar functions for other resource types
// checkEBSVolumes, checkRDSInstances, checkElastiCacheClusters, checkS3Buckets
// Implementation omitted for brevity
async function sendReport(report) {
if (report.expiredResources.length === 0 &&
report.expiringResources.length === 0 &&
report.errors.length === 0) {
console.log('No resources to report');
return;
}
let message = 'Resource Expiration Report\n\n';
if (report.expiredResources.length > 0) {
message += `Expired Resources (${report.expiredResources.length}):\n`;
report.expiredResources.forEach(resource => {
message += `- ${resource.type} ${resource.id}: Expired on ${resource.expirationDate}, ${resource.action}\n`;
});
message += '\n';
}
if (report.expiringResources.length > 0) {
message += `Expiring Soon (${report.expiringResources.length}):\n`;
report.expiringResources.forEach(resource => {
message += `- ${resource.type} ${resource.id}: Expires in ${resource.daysRemaining} days\n`;
});
message += '\n';
}
if (report.errors.length > 0) {
message += `Errors (${report.errors.length}):\n`;
report.errors.forEach(error => {
message += `- ${error.type}${error.id ? ' ' + error.id : ''}: ${error.error}\n`;
});
}
await sns.publish({
TopicArn: SNS_TOPIC,
Subject: `Resource Expiration Report - ${DRY_RUN ? 'DRY RUN' : 'LIVE RUN'}`,
Message: message
}).promise();
console.log('Report sent to SNS topic');
}
# Cloud Cost Analysis and Orphaned Resource Detection
# File: orphaned_resource_detector.py
import argparse
import boto3
import csv
import datetime
import json
import logging
import os
from azure.identity import DefaultAzureCredential
from azure.mgmt.resource import ResourceManagementClient
from azure.mgmt.compute import ComputeManagementClient
from azure.mgmt.network import NetworkManagementClient
from azure.mgmt.storage import StorageManagementClient
from google.cloud import compute_v1
from google.cloud import storage
# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
class OrphanedResourceDetector:
def __init__(self, output_dir, days_threshold=30):
self.output_dir = output_dir
self.days_threshold = days_threshold
self.orphaned_resources = []
# Ensure output directory exists
os.makedirs(output_dir, exist_ok=True)
def detect_all(self):
"""Run detection across all cloud providers"""
logger.info("Starting orphaned resource detection")
# AWS detection
self.detect_aws_orphaned_resources()
# Azure detection
self.detect_azure_orphaned_resources()
# GCP detection
self.detect_gcp_orphaned_resources()
# Generate reports
self.generate_reports()
logger.info(f"Detection complete. Found {len(self.orphaned_resources)} potentially orphaned resources")
return self.orphaned_resources
def detect_aws_orphaned_resources(self):
"""Detect orphaned resources in AWS"""
logger.info("Detecting AWS orphaned resources")
try:
# Initialize AWS clients
ec2 = boto3.client('ec2')
elb = boto3.client('elb')
elbv2 = boto3.client('elbv2')
rds = boto3.client('rds')
# Check for unattached EBS volumes
self._detect_unattached_ebs_volumes(ec2)
# Check for idle EC2 instances
self._detect_idle_ec2_instances(ec2)
# Check for unused Elastic IPs
self._detect_unused_elastic_ips(ec2)
# Check for unused load balancers
self._detect_unused_load_balancers(elb, elbv2)
# Check for idle RDS instances
self._detect_idle_rds_instances(rds)
except Exception as e:
logger.error(f"Error detecting AWS orphaned resources: {str(e)}")
def detect_azure_orphaned_resources(self):
"""Detect orphaned resources in Azure"""
logger.info("Detecting Azure orphaned resources")
try:
# Initialize Azure clients
credential = DefaultAzureCredential()
resource_client = ResourceManagementClient(credential, os.environ.get("AZURE_SUBSCRIPTION_ID"))
compute_client = ComputeManagementClient(credential, os.environ.get("AZURE_SUBSCRIPTION_ID"))
network_client = NetworkManagementClient(credential, os.environ.get("AZURE_SUBSCRIPTION_ID"))
storage_client = StorageManagementClient(credential, os.environ.get("AZURE_SUBSCRIPTION_ID"))
# Check for unused disks
self._detect_unused_azure_disks(compute_client)
# Check for idle VMs
self._detect_idle_azure_vms(compute_client)
# Check for unused public IPs
self._detect_unused_azure_public_ips(network_client)
# Check for unused network security groups
self._detect_unused_azure_nsgs(network_client)
except Exception as e:
logger.error(f"Error detecting Azure orphaned resources: {str(e)}")
def detect_gcp_orphaned_resources(self):
"""Detect orphaned resources in GCP"""
logger.info("Detecting GCP orphaned resources")
try:
# Initialize GCP clients
compute_client = compute_v1.InstancesClient()
disks_client = compute_v1.DisksClient()
addresses_client = compute_v1.AddressesClient()
storage_client = storage.Client()
# Check for unused persistent disks
self._detect_unused_gcp_disks(disks_client)
# Check for idle VM instances
self._detect_idle_gcp_instances(compute_client)
# Check for unused static IPs
self._detect_unused_gcp_addresses(addresses_client)
except Exception as e:
logger.error(f"Error detecting GCP orphaned resources: {str(e)}")
# AWS detection methods
def _detect_unattached_ebs_volumes(self, ec2):
"""Detect unattached EBS volumes"""
try:
response = ec2.describe_volumes(
Filters=[{'Name': 'status', 'Values': ['available']}]
)
for volume in response['Volumes']:
# Check if volume has been unattached for more than threshold days
create_time = volume['CreateTime']
age_days = (datetime.datetime.now(datetime.timezone.utc) - create_time).days
if age_days > self.days_threshold:
tags = {tag['Key']: tag['Value'] for tag in volume.get('Tags', [])}
self.orphaned_resources.append({
'cloud_provider': 'AWS',
'resource_type': 'EBS Volume',
'resource_id': volume['VolumeId'],
'region': volume['AvailabilityZone'][:-1], # Remove AZ letter to get region
'created_time': create_time.isoformat(),
'age_days': age_days,
'size': f"{volume['Size']} GB",
'monthly_cost_estimate': round(volume['Size'] * 0.1, 2), # Rough estimate
'tags': tags,
'owner': tags.get('Owner', 'Unknown'),
'project': tags.get('Project', 'Unknown'),
'environment': tags.get('Environment', 'Unknown'),
'last_attached': 'Unknown'
})
except Exception as e:
logger.error(f"Error detecting unattached EBS volumes: {str(e)}")
# Additional detection methods for other resource types and cloud providers
# Implementation omitted for brevity
def generate_reports(self):
"""Generate reports from detected orphaned resources"""
if not self.orphaned_resources:
logger.info("No orphaned resources detected")
return
# Save full JSON report
json_path = os.path.join(self.output_dir, 'orphaned_resources.json')
with open(json_path, 'w') as f:
json.dump(self.orphaned_resources, f, indent=2)
# Save CSV report
csv_path = os.path.join(self.output_dir, 'orphaned_resources.csv')
with open(csv_path, 'w', newline='') as f:
if not self.orphaned_resources:
f.write("No orphaned resources detected")
return
fieldnames = self.orphaned_resources[0].keys()
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(self.orphaned_resources)
# Generate cost summary by resource type
cost_summary = {}
for resource in self.orphaned_resources:
resource_type = resource['resource_type']
cost = resource.get('monthly_cost_estimate', 0)
if resource_type not in cost_summary:
cost_summary[resource_type] = {
'count': 0,
'total_cost': 0
}
cost_summary[resource_type]['count'] += 1
cost_summary[resource_type]['total_cost'] += cost
# Save cost summary
summary_path = os.path.join(self.output_dir, 'cost_summary.json')
with open(summary_path, 'w') as f:
json.dump(cost_summary, f, indent=2)
logger.info(f"Reports generated in {self.output_dir}")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Detect orphaned cloud resources')
parser.add_argument('--output-dir', default='./reports', help='Directory to store reports')
parser.add_argument('--days-threshold', type=int, default=30, help='Age threshold in days')
args = parser.parse_args()
detector = OrphanedResourceDetector(args.output_dir, args.days_threshold)
detector.detect_all()
Lessons Learned:
Effective cloud cost management requires proactive resource lifecycle tracking and automated cleanup processes.
How to Avoid:
Implement comprehensive resource tagging policies across all cloud providers.
Use infrastructure-as-code for all resource provisioning with mandatory lifecycle metadata.
Create automated processes for detecting and cleaning up orphaned resources.
Establish regular cost reviews with team accountability.
Implement time-to-live (TTL) for non-production resources with automated cleanup.
No summary provided
What Happened:
A large enterprise had implemented a centralized cloud cost optimization strategy that included purchasing 3-year Reserved Instances for their predictable workloads. After a year, a cost analysis revealed that many of these RIs were significantly underutilized, with some having utilization rates below 30%. Despite the discount from on-demand pricing, the company was effectively wasting money on unused capacity. The issue was particularly severe in development and testing environments, where workloads were often turned off outside of business hours but the RIs continued to be billed.
Diagnosis Steps:
Analyzed RI utilization reports across all accounts.
Reviewed workload patterns and instance usage.
Examined the RI purchase decision process.
Compared actual usage with forecasted usage.
Investigated instance scheduling practices.
Root Cause:
The investigation revealed multiple issues with the RI strategy: 1. RI purchases were based on peak usage rather than average utilization 2. Development and testing environments were included in RI purchases despite their intermittent usage 3. There was no process for redistributing underutilized RIs across accounts 4. The company had purchased 3-year terms for all RIs without considering workload volatility 5. There was no regular review process for RI utilization
Fix/Workaround:
• Implemented immediate improvements to RI management
• Created a centralized RI management function
• Implemented instance scheduling for non-production environments
• Converted some RIs to more flexible Savings Plans
• Established regular RI utilization reviews
Lessons Learned:
Reserved Instances require careful planning, ongoing management, and regular optimization to achieve their cost-saving potential.
How to Avoid:
Match RI terms to workload stability (longer terms for stable workloads only).
Implement instance scheduling for non-production environments.
Create a centralized RI management function with regular reviews.
Consider more flexible discount options like Savings Plans for variable workloads.
Establish clear processes for redistributing underutilized RIs.