A successful marketing campaign drove 5x normal traffic to the application. The services scaled horizontally as designed, but the database became a bottleneck, causing timeouts and errors.
# Scalability and High Availability Scenarios
No summary provided
What Happened:
Diagnosis Steps:
Monitored RDS metrics in CloudWatch and observed max connections reached.
Analyzed slow query logs and found increasing query times.
Checked connection pooling configuration in application services.
Reviewed database scaling parameters and limitations.
Root Cause:
Multiple issues combined: 1. Connection pooling was improperly configured, creating too many database connections 2. No connection limits were set per service, allowing a single service to consume all connections 3. Read-heavy queries weren't being directed to read replicas
Fix/Workaround:
• Implemented proper connection pooling with PgBouncer.
• Added read replicas and configured the application to use them for read queries.
• Set connection limits per service to prevent resource monopolization.
• Optimized the most expensive queries with proper indexing.
Lessons Learned:
Database connections are a finite resource that must be carefully managed for scalability.
How to Avoid:
Implement proper connection pooling from the start.
Use read replicas for read-heavy workloads.
Set and enforce connection limits per service.
Load test with at least 5x expected peak traffic.
Consider database sharding for very large scale applications.
No summary provided
What Happened:
During a marketing campaign, the application started experiencing high latency and eventually became completely unresponsive. Monitoring showed normal CPU and memory usage, but database response times were extremely high.
Diagnosis Steps:
Checked application logs for database-related errors.
Monitored active database connections with
SELECT count(*) FROM pg_stat_activity;
.Examined connection pool metrics from the application.
Analyzed query patterns and execution times.
Reviewed recent code changes that might affect database usage.
Root Cause:
The application was configured with a fixed database connection pool size that was insufficient for the increased traffic. Additionally, some database operations weren't properly releasing connections back to the pool, causing connection leakage.
Fix/Workaround:
• Short-term: Increased the connection pool size and restarted the application:
// HikariCP configuration update
HikariConfig config = new HikariConfig();
config.setJdbcUrl("jdbc:postgresql://db.example.com:5432/mydb");
config.setUsername("app_user");
config.setPassword("********");
config.setMaximumPoolSize(50); // Increased from 20
config.setMinimumIdle(10);
config.setIdleTimeout(30000);
config.setConnectionTimeout(10000);
config.setLeakDetectionThreshold(60000); // Added leak detection
• Long-term: Fixed connection leaks in the application code:
// Before: Connection leak
public void processData(String id) {
Connection conn = null;
try {
conn = dataSource.getConnection();
// Process data
} catch (SQLException e) {
logger.error("Error processing data", e);
}
// Missing finally block to close connection!
}
// After: Proper connection handling with try-with-resources
public void processData(String id) {
try (Connection conn = dataSource.getConnection()) {
// Process data
} catch (SQLException e) {
logger.error("Error processing data", e);
}
// Connection automatically closed
}
• Implemented dynamic connection pool sizing based on load:
// Dynamic connection pool sizing
public class AdaptiveConnectionPool {
private final HikariDataSource dataSource;
private final ScheduledExecutorService scheduler;
public AdaptiveConnectionPool(HikariConfig config) {
this.dataSource = new HikariDataSource(config);
this.scheduler = Executors.newScheduledThreadPool(1);
// Adjust pool size every 30 seconds based on usage
scheduler.scheduleAtFixedRate(this::adjustPoolSize, 30, 30, TimeUnit.SECONDS);
}
private void adjustPoolSize() {
int activeConnections = dataSource.getHikariPoolMXBean().getActiveConnections();
int totalConnections = dataSource.getHikariPoolMXBean().getTotalConnections();
int maxPoolSize = dataSource.getHikariPoolMXBean().getMaximumPoolSize();
// If using more than 80% of connections, increase pool size
if (activeConnections > 0.8 * totalConnections && totalConnections < 100) {
int newSize = Math.min(maxPoolSize + 10, 100);
dataSource.setMaximumPoolSize(newSize);
logger.info("Increased connection pool size to {}", newSize);
}
// If using less than 30% of connections for a while, decrease pool size
if (activeConnections < 0.3 * totalConnections && totalConnections > 20) {
int newSize = Math.max(maxPoolSize - 5, 20);
dataSource.setMaximumPoolSize(newSize);
logger.info("Decreased connection pool size to {}", newSize);
}
}
}
Lessons Learned:
Connection pool sizing is critical for application scalability and requires careful tuning.
How to Avoid:
Implement proper connection handling with try-with-resources or similar patterns.
Configure connection leak detection and timeout settings.
Size connection pools based on expected peak load.
Monitor connection pool metrics and set up alerts.
Consider implementing dynamic connection pool sizing for variable loads.
No summary provided
What Happened:
A web application was experiencing unstable performance despite having autoscaling enabled. Monitoring showed that instances were constantly being added and removed, causing disruption to user sessions and increasing latency.
Diagnosis Steps:
Analyzed CloudWatch metrics for the Auto Scaling Group.
Reviewed scaling policies and thresholds.
Examined instance startup and shutdown times.
Monitored application performance during scaling events.
Checked for correlation between scaling events and performance issues.
Root Cause:
The autoscaling policy was configured with thresholds that were too close together and cooldown periods that were too short. This caused the system to rapidly scale out during brief traffic spikes and then immediately scale back in, only to scale out again when the next spike occurred. This "thrashing" behavior created constant disruption.
Fix/Workaround:
• Short-term: Modified the Auto Scaling Group configuration with more conservative settings:
{
"AutoScalingGroupName": "web-app-asg",
"MinSize": 4,
"MaxSize": 20,
"DesiredCapacity": 4,
"DefaultCooldown": 300,
"AvailabilityZones": [
"us-east-1a",
"us-east-1b",
"us-east-1c"
],
"HealthCheckType": "ELB",
"HealthCheckGracePeriod": 300
}
• Updated scaling policies with wider thresholds and longer evaluation periods:
{
"AutoScalingGroupName": "web-app-asg",
"PolicyName": "scale-out-policy",
"PolicyType": "TargetTrackingScaling",
"TargetTrackingConfiguration": {
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ASGAverageCPUUtilization"
},
"TargetValue": 70.0,
"DisableScaleIn": false
},
"EstimatedInstanceWarmup": 180
}
• Implemented predictive scaling using AWS Auto Scaling:
resource "aws_autoscaling_policy" "predictive_scaling" {
name = "predictive-scaling-policy"
autoscaling_group_name = aws_autoscaling_group.web_app.name
policy_type = "PredictiveScaling"
predictive_scaling_configuration {
metric_specification {
target_value = 70.0
predefined_metric_pair_specification {
predefined_metric_type = "ASGCPUUtilization"
}
}
mode = "ForecastAndScale"
scheduling_buffer_time = 300
max_capacity_breach_behavior = "IncreaseMaxCapacity"
max_capacity_buffer = 10
}
}
Lessons Learned:
Autoscaling requires careful tuning to avoid instability from rapid scaling changes.
How to Avoid:
Implement appropriate cooldown periods between scaling actions.
Use target tracking policies instead of simple threshold-based policies.
Consider predictive scaling for workloads with predictable patterns.
Implement gradual scaling with step adjustments.
Monitor and alert on excessive scaling events.
No summary provided
What Happened:
During a major sales event, an e-commerce platform experienced increasing response times that eventually led to service outages. The application consisted of dozens of microservices, each maintaining its own database connection pool to a shared PostgreSQL cluster. As traffic increased, services began failing with database connection errors. The failures cascaded across the platform as dependent services also failed. Database monitoring showed connection counts at the configured maximum, but many connections were idle or in an inconsistent state.
Diagnosis Steps:
Analyzed database connection metrics and active queries.
Examined connection pool configurations across services.
Reviewed application logs for connection handling patterns.
Traced request flows through the microservices architecture.
Monitored database server resource utilization.
Root Cause:
The investigation revealed multiple issues with connection management: 1. Each microservice maintained its own connection pool without coordination 2. Connection pools were sized based on individual service needs, not system-wide capacity 3. Some services failed to properly release connections during error conditions 4. Long-running transactions held connections unnecessarily 5. No circuit breaking or backpressure mechanisms were implemented
Fix/Workaround:
• Implemented immediate fixes to restore service
• Optimized connection pool sizes across services
• Improved connection handling with proper release in all code paths
• Implemented circuit breaking and backpressure mechanisms
• Created a centralized connection management strategy
Lessons Learned:
Database connection management in microservices requires system-wide coordination and proper resource governance.
How to Avoid:
Implement centralized connection pool management or governance.
Size connection pools based on database capacity, not individual service needs.
Use connection monitoring and alerting to detect potential issues.
Implement circuit breaking and backpressure mechanisms.
Regularly review and optimize database access patterns.