The finance team reported a massive spike in AWS costs. Investigation revealed multiple unused resources running, including oversized instances and orphaned volumes.
# Cloud Platforms Scenarios
No summary provided
What Happened:
Diagnosis Steps:
Analyzed AWS Cost Explorer reports to identify cost drivers.
Used AWS Trusted Advisor to find optimization opportunities.
Reviewed recent infrastructure changes in Terraform state.
Discovered test environments that were created but never torn down.
Root Cause:
Multiple issues contributed: 1. Terraform destroy commands were failing silently in CI/CD pipelines 2. Developers were creating test resources manually without tagging 3. No cost monitoring or alerting was in place 4. Auto-scaling was misconfigured, keeping minimum instances too high
Fix/Workaround:
• Implemented AWS budget alerts for unusual spending patterns.
• Created an emergency cleanup plan for unused resources.
• Fixed Terraform scripts to properly destroy test environments.
• Implemented mandatory tagging policy with expiration dates for temporary resources.
Lessons Learned:
Cloud costs can spiral quickly without proper governance and monitoring.
How to Avoid:
Implement cost monitoring and alerting for early detection.
Use infrastructure as code with proper cleanup procedures.
Enforce resource tagging for ownership and purpose.
Schedule regular cost reviews and cleanup of unused resources.
Implement auto-scaling with appropriate minimum and maximum values.
No summary provided
What Happened:
A security audit revealed that an application had access to sensitive customer data it didn't need. Further investigation showed that many services had overly permissive IAM roles with wildcard permissions.
Diagnosis Steps:
Reviewed IAM policies using AWS IAM Access Analyzer.
Analyzed CloudTrail logs for unusual access patterns.
Audited Terraform code that created the IAM roles.
Used AWS Config to identify non-compliant resources.
Root Cause:
IAM roles were created with wildcard permissions (e.g., "s3:*") instead of specific actions. Terraform modules were using default permissions that were too broad.
Fix/Workaround:
• Implemented least privilege principle by restricting permissions to only what was needed.
• Created custom IAM policies based on CloudTrail logs of actual usage.
• Updated Terraform modules to use more restrictive default permissions.
• Implemented regular permission reviews and automated compliance checks.
Lessons Learned:
Default and wildcard permissions create significant security risks.
How to Avoid:
Follow the principle of least privilege for all IAM roles.
Use IAM Access Analyzer to identify overly permissive policies.
Implement infrastructure as code with security best practices.
Conduct regular security audits and permission reviews.
Use tools like CloudTrail Insights to detect unusual access patterns.
No summary provided
What Happened:
During an AWS regional outage, the application was supposed to automatically failover to a secondary region. However, the service remained unavailable, and the failover didn't occur as expected.
Diagnosis Steps:
Checked Route 53 health check configurations and logs.
Reviewed CloudWatch alarms and metrics for failover triggers.
Tested manual failover process and found configuration issues.
Analyzed recent infrastructure changes that might have affected failover.
Root Cause:
Multiple issues prevented successful failover: 1. Route 53 health checks were only monitoring the load balancer, not the actual application health 2. Database replication to the secondary region had silently failed weeks ago 3. IAM roles in the secondary region were outdated after a recent permission change
Fix/Workaround:
• Implemented application-level health checks that verify critical functionality.
• Set up monitoring for database replication lag and status.
• Automated testing of failover scenarios on a regular schedule.
• Updated IAM roles in all regions simultaneously using infrastructure as code.
Lessons Learned:
Failover mechanisms need regular testing and monitoring to be reliable.
How to Avoid:
Test failover procedures regularly with chaos engineering practices.
Monitor all components of the failover system, not just infrastructure.
Automate deployment of configuration changes across all regions.
Implement canary deployments to detect issues before they affect all users.
No summary provided
What Happened:
During a scheduled disaster recovery test, the team discovered that S3 objects were not being replicated to the backup region. This posed a significant risk to the organization's business continuity plan.
Diagnosis Steps:
Checked S3 bucket replication configuration in the AWS Console.
Verified IAM roles and permissions for replication.
Examined CloudTrail logs for replication-related events.
Tested manual object copying between regions.
Reviewed AWS service health dashboard for any S3 issues.
Root Cause:
The IAM role used for S3 replication had insufficient permissions. Specifically, it was missing the s3:ReplicateObject
permission for the destination bucket. Additionally, object ownership settings in the destination bucket were preventing the replication process from setting the correct ACLs.
Fix/Workaround:
• Updated the IAM role policy with the correct permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetReplicationConfiguration",
"s3:ListBucket",
"s3:GetObjectVersionForReplication",
"s3:GetObjectVersionAcl",
"s3:GetObjectVersionTagging"
],
"Resource": [
"arn:aws:s3:::source-bucket",
"arn:aws:s3:::source-bucket/*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:ReplicateObject",
"s3:ReplicateDelete",
"s3:ReplicateTags",
"s3:GetObjectVersionTagging"
],
"Resource": "arn:aws:s3:::destination-bucket/*"
}
]
}
• Modified the destination bucket's Object Ownership setting to "Bucket owner preferred".
• Implemented a one-time sync to replicate existing objects:
aws s3 sync s3://source-bucket s3://destination-bucket --source-region us-east-1 --region us-west-2
• Added CloudWatch alarms to monitor replication metrics.
Lessons Learned:
S3 replication requires specific permissions and configuration that must be regularly tested.
How to Avoid:
Implement regular testing of disaster recovery procedures.
Use infrastructure as code to ensure consistent bucket configurations.
Set up monitoring for replication metrics with alerts for failures.
Document and version control all IAM policies.
Include replication validation in change management processes.
No summary provided
What Happened:
Services in two peered Azure virtual networks were experiencing intermittent connectivity issues. Some connections would time out while others worked fine, with no clear pattern to the failures.
Diagnosis Steps:
Verified virtual network peering status in Azure Portal.
Tested connectivity using Network Watcher and packet captures.
Examined network security groups (NSGs) and route tables.
Analyzed application logs for specific error patterns.
Reviewed recent network configuration changes.
Root Cause:
The issue was caused by asymmetric routing due to custom route tables applied to subnets in one of the virtual networks. Traffic was flowing correctly in one direction but taking an unexpected path on the return journey through a network virtual appliance (NVA) that was dropping some packets.
Fix/Workaround:
• Updated route tables to ensure symmetric routing:
{
"name": "symmetric-route-table",
"properties": {
"routes": [
{
"name": "to-vnet2",
"properties": {
"addressPrefix": "10.2.0.0/16",
"nextHopType": "VnetLocal"
}
},
{
"name": "default-route",
"properties": {
"addressPrefix": "0.0.0.0/0",
"nextHopType": "VirtualAppliance",
"nextHopIpAddress": "10.1.1.4"
}
}
]
}
}
• Configured the NVA to properly handle the traffic:
# Example iptables configuration for Linux-based NVA
iptables -t nat -A POSTROUTING -s 10.1.0.0/16 -d 10.2.0.0/16 -j ACCEPT
iptables -A FORWARD -s 10.1.0.0/16 -d 10.2.0.0/16 -j ACCEPT
iptables -A FORWARD -s 10.2.0.0/16 -d 10.1.0.0/16 -j ACCEPT
• Implemented network monitoring with Azure Network Watcher.
• Added health checks between services to detect routing issues.
Lessons Learned:
Virtual network peering requires careful route table configuration to avoid asymmetric routing.
How to Avoid:
Design network topology with routing symmetry in mind.
Document and review all custom routes before implementation.
Test connectivity thoroughly after network changes.
Use infrastructure as code to manage network configurations.
Implement continuous network monitoring with alerts for anomalies.