During a marketing campaign, users reported intermittent application failures. The application logs showed database connection timeouts, but the database server itself showed low resource utilization. The issue occurred only during peak traffic periods and would resolve itself after traffic decreased.
# Database Management in DevOps Scenarios
No summary provided
What Happened:
Diagnosis Steps:
Examined application logs for error patterns.
Checked database server metrics (CPU, memory, disk I/O, connections).
Analyzed connection pool configuration in the application.
Monitored active connections and connection acquisition times.
Traced request flows through the application to identify connection usage patterns.
Root Cause:
The application was configured with a fixed connection pool size of 10 connections per pod, but during peak traffic, each pod was handling more concurrent requests than the pool could accommodate. Additionally, some database operations were not properly releasing connections back to the pool, causing connection leaks. The database server itself had a max_connections setting of 100, which was being reached during peak loads.
Fix/Workaround:
• Short-term: Increased the connection pool size and added connection timeout handling:
// Before: Insufficient connection pool configuration
@Bean
public DataSource dataSource() {
HikariConfig config = new HikariConfig();
config.setJdbcUrl(jdbcUrl);
config.setUsername(username);
config.setPassword(password);
config.setMaximumPoolSize(10);
return new HikariDataSource(config);
}
// After: Improved connection pool configuration
@Bean
public DataSource dataSource() {
HikariConfig config = new HikariConfig();
config.setJdbcUrl(jdbcUrl);
config.setUsername(username);
config.setPassword(password);
config.setMaximumPoolSize(20);
config.setMinimumIdle(5);
config.setIdleTimeout(30000);
config.setConnectionTimeout(10000);
config.setMaxLifetime(1800000);
config.setLeakDetectionThreshold(60000);
config.addDataSourceProperty("cachePrepStmts", "true");
config.addDataSourceProperty("prepStmtCacheSize", "250");
config.addDataSourceProperty("prepStmtCacheSqlLimit", "2048");
return new HikariDataSource(config);
}
• Fixed connection leaks in the application code:
// Before: Connection leak in service method
@Service
public class ProductService {
private final JdbcTemplate jdbcTemplate;
public ProductService(JdbcTemplate jdbcTemplate) {
this.jdbcTemplate = jdbcTemplate;
}
public List<Product> findProductsByCategory(String category) {
Connection conn = null;
try {
conn = jdbcTemplate.getDataSource().getConnection();
PreparedStatement stmt = conn.prepareStatement(
"SELECT * FROM products WHERE category = ?");
stmt.setString(1, category);
ResultSet rs = stmt.executeQuery();
List<Product> products = new ArrayList<>();
while (rs.next()) {
products.add(mapRowToProduct(rs));
}
return products;
// Connection not closed in finally block!
} catch (SQLException e) {
throw new RuntimeException("Error fetching products", e);
}
}
}
// After: Proper connection handling with try-with-resources
@Service
public class ProductService {
private final JdbcTemplate jdbcTemplate;
public ProductService(JdbcTemplate jdbcTemplate) {
this.jdbcTemplate = jdbcTemplate;
}
public List<Product> findProductsByCategory(String category) {
try (Connection conn = jdbcTemplate.getDataSource().getConnection();
PreparedStatement stmt = conn.prepareStatement(
"SELECT * FROM products WHERE category = ?")) {
stmt.setString(1, category);
try (ResultSet rs = stmt.executeQuery()) {
List<Product> products = new ArrayList<>();
while (rs.next()) {
products.add(mapRowToProduct(rs));
}
return products;
}
} catch (SQLException e) {
throw new RuntimeException("Error fetching products", e);
}
}
}
• Long-term: Implemented a comprehensive database connection management strategy:
# PostgreSQL configuration in postgresql.conf
max_connections = 500
shared_buffers = 4GB
effective_cache_size = 12GB
work_mem = 16MB
maintenance_work_mem = 1GB
max_worker_processes = 8
max_parallel_workers_per_gather = 4
max_parallel_workers = 8
wal_buffers = 16MB
checkpoint_completion_target = 0.9
random_page_cost = 1.1
effective_io_concurrency = 200
• Implemented connection pooling at the database level with PgBouncer:
# pgbouncer.ini
[databases]
* = host=localhost port=5432
[pgbouncer]
listen_addr = *
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction
max_client_conn = 10000
default_pool_size = 100
min_pool_size = 10
reserve_pool_size = 10
reserve_pool_timeout = 5
max_db_connections = 100
max_user_connections = 100
server_reset_query = DISCARD ALL
server_check_delay = 30
server_check_query = SELECT 1
server_lifetime = 3600
server_idle_timeout = 600
client_idle_timeout = 0
log_connections = 1
log_disconnections = 1
application_name_add_host = 1
stats_period = 60
• Updated Kubernetes deployment to use PgBouncer as a sidecar:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 3
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: app
image: my-app:1.0.0
env:
- name: SPRING_DATASOURCE_URL
value: jdbc:postgresql://localhost:6432/mydb
- name: SPRING_DATASOURCE_USERNAME
valueFrom:
secretKeyRef:
name: db-credentials
key: username
- name: SPRING_DATASOURCE_PASSWORD
valueFrom:
secretKeyRef:
name: db-credentials
key: password
resources:
limits:
cpu: "1"
memory: "1Gi"
requests:
cpu: "500m"
memory: "512Mi"
- name: pgbouncer
image: edoburu/pgbouncer:1.15.0
env:
- name: DB_HOST
value: postgres.database.svc.cluster.local
- name: DB_PORT
value: "5432"
- name: DB_USER
valueFrom:
secretKeyRef:
name: db-credentials
key: username
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: db-credentials
key: password
- name: POOL_MODE
value: transaction
- name: MAX_CLIENT_CONN
value: "1000"
- name: DEFAULT_POOL_SIZE
value: "50"
ports:
- containerPort: 6432
resources:
limits:
cpu: "500m"
memory: "256Mi"
requests:
cpu: "100m"
memory: "128Mi"
• Implemented connection pool monitoring with Prometheus and Grafana:
// Connection pool metrics configuration
@Configuration
public class MetricsConfig {
@Bean
public MeterBinder hikariMetrics(DataSource dataSource) {
if (dataSource instanceof HikariDataSource) {
return new HikariCPMetrics(((HikariDataSource) dataSource).getHikariPoolMXBean());
}
return null;
}
}
• Created a Rust-based connection pool health checker:
// connection_health_checker.rs
use std::env;
use std::time::{Duration, Instant};
use tokio::time::sleep;
use tokio_postgres::{Client, NoTls};
use prometheus::{register_gauge, register_histogram, Gauge, Histogram};
use warp::Filter;
// Prometheus metrics
lazy_static::lazy_static! {
static ref CONNECTION_ACQUISITION_TIME: Histogram = register_histogram!(
"db_connection_acquisition_seconds",
"Time to acquire a database connection",
vec![0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
).unwrap();
static ref QUERY_EXECUTION_TIME: Histogram = register_histogram!(
"db_query_execution_seconds",
"Time to execute a database query",
vec![0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
).unwrap();
static ref ACTIVE_CONNECTIONS: Gauge = register_gauge!(
"db_active_connections",
"Number of active database connections"
).unwrap();
static ref IDLE_CONNECTIONS: Gauge = register_gauge!(
"db_idle_connections",
"Number of idle database connections"
).unwrap();
static ref CONNECTION_ERRORS: Gauge = register_gauge!(
"db_connection_errors_total",
"Total number of database connection errors"
).unwrap();
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Get database connection info from environment variables
let db_host = env::var("DB_HOST").unwrap_or_else(|_| "localhost".to_string());
let db_port = env::var("DB_PORT").unwrap_or_else(|_| "5432".to_string());
let db_name = env::var("DB_NAME").unwrap_or_else(|_| "postgres".to_string());
let db_user = env::var("DB_USER").unwrap_or_else(|_| "postgres".to_string());
let db_password = env::var("DB_PASSWORD").unwrap_or_else(|_| "postgres".to_string());
let connection_string = format!(
"host={} port={} dbname={} user={} password={}",
db_host, db_port, db_name, db_user, db_password
);
// Start metrics server
let metrics_route = warp::path("metrics").map(|| {
let encoder = prometheus::TextEncoder::new();
let metric_families = prometheus::gather();
let mut buffer = Vec::new();
encoder.encode(&metric_families, &mut buffer).unwrap();
String::from_utf8(buffer).unwrap()
});
let server = warp::serve(metrics_route).run(([0, 0, 0, 0], 8080));
// Run server in the background
tokio::spawn(server);
// Periodically check database connection health
loop {
check_connection_health(&connection_string).await;
sleep(Duration::from_secs(10)).await;
}
}
async fn check_connection_health(connection_string: &str) {
// Measure connection acquisition time
let start = Instant::now();
let connection_result = tokio_postgres::connect(connection_string, NoTls).await;
let acquisition_time = start.elapsed();
CONNECTION_ACQUISITION_TIME.observe(acquisition_time.as_secs_f64());
match connection_result {
Ok((client, connection)) => {
// Spawn the connection task
tokio::spawn(async move {
if let Err(e) = connection.await {
eprintln!("Connection error: {}", e);
CONNECTION_ERRORS.inc();
}
});
// Execute a simple query to check database health
let query_start = Instant::now();
match client.query("SELECT count(*) FROM pg_stat_activity", &[]).await {
Ok(rows) => {
let query_time = query_start.elapsed();
QUERY_EXECUTION_TIME.observe(query_time.as_secs_f64());
if let Some(row) = rows.first() {
let count: i64 = row.get(0);
println!("Total connections: {}", count);
}
// Get connection stats
if let Ok(stats_rows) = client.query(
"SELECT state, count(*) FROM pg_stat_activity GROUP BY state",
&[]
).await {
for row in stats_rows {
let state: String = row.get(0);
let count: i64 = row.get(1);
if state == "active" {
ACTIVE_CONNECTIONS.set(count as f64);
} else if state == "idle" {
IDLE_CONNECTIONS.set(count as f64);
}
println!("State {}: {}", state, count);
}
}
}
Err(e) => {
eprintln!("Query error: {}", e);
CONNECTION_ERRORS.inc();
}
}
}
Err(e) => {
eprintln!("Failed to connect to database: {}", e);
CONNECTION_ERRORS.inc();
}
}
}
Lessons Learned:
Database connection management requires careful configuration at both the application and database levels.
How to Avoid:
Configure connection pools based on expected concurrent requests, not just instance count.
Implement connection leak detection and prevention mechanisms.
Use connection pooling middleware like PgBouncer for high-traffic applications.
Monitor connection usage patterns and adjust pool sizes accordingly.
Implement proper error handling and connection release in application code.
No summary provided
What Happened:
During a marketing campaign, users reported intermittent application failures. The application logs showed database connection timeouts, but the database server itself showed low resource utilization. The issue occurred only during peak traffic periods and would resolve itself after traffic subsided.
Diagnosis Steps:
Analyzed application logs for error patterns and timing.
Monitored database connection metrics during peak periods.
Reviewed application connection pool configuration.
Examined database server connection limits and active connections.
Traced connection lifecycle in the application code.
Root Cause:
Multiple issues contributed to the connection pool exhaustion: 1. Connection pool size was too small for peak traffic 2. Connection leaks in application code were not properly closing connections 3. Database connection timeout was too long, keeping idle connections open 4. Connection validation was not enabled, leading to stale connections 5. No circuit breaker pattern to handle database connection failures gracefully
Fix/Workaround:
• Short-term: Optimized connection pool configuration:
// Before: Problematic HikariCP configuration
@Bean
public DataSource dataSource() {
HikariConfig config = new HikariConfig();
config.setJdbcUrl(jdbcUrl);
config.setUsername(username);
config.setPassword(password);
config.setMaximumPoolSize(10); // Too small for peak load
return new HikariDataSource(config);
}
// After: Optimized HikariCP configuration
@Bean
public DataSource dataSource() {
HikariConfig config = new HikariConfig();
config.setJdbcUrl(jdbcUrl);
config.setUsername(username);
config.setPassword(password);
// Increased pool size based on calculation:
// (core_count * 2) + effective_spindle_count
config.setMaximumPoolSize(32);
// Connection lifecycle management
config.setMinimumIdle(5);
config.setIdleTimeout(60000); // 60 seconds
config.setMaxLifetime(1800000); // 30 minutes
// Connection validation
config.setConnectionTestQuery("SELECT 1");
config.setValidationTimeout(5000); // 5 seconds
// Leak detection
config.setLeakDetectionThreshold(30000); // 30 seconds
return new HikariDataSource(config);
}
• Fixed connection leaks in application code:
// Before: Connection leak in service code
@Service
public class ProductServiceImpl implements ProductService {
private final JdbcTemplate jdbcTemplate;
@Autowired
public ProductServiceImpl(JdbcTemplate jdbcTemplate) {
this.jdbcTemplate = jdbcTemplate;
}
public Product getProduct(Long id) {
// Connection leak: not using try-with-resources
Connection conn = DataSourceUtils.getConnection(jdbcTemplate.getDataSource());
try {
PreparedStatement stmt = conn.prepareStatement("SELECT * FROM products WHERE id = ?");
stmt.setLong(1, id);
ResultSet rs = stmt.executeQuery();
if (rs.next()) {
return mapResultSetToProduct(rs);
}
return null;
} catch (SQLException e) {
throw new RuntimeException("Error fetching product", e);
}
// Missing finally block to close resources
}
}
// After: Fixed connection handling with Spring JdbcTemplate
@Service
public class ProductServiceImpl implements ProductService {
private final JdbcTemplate jdbcTemplate;
@Autowired
public ProductServiceImpl(JdbcTemplate jdbcTemplate) {
this.jdbcTemplate = jdbcTemplate;
}
public Product getProduct(Long id) {
// Using JdbcTemplate properly handles connection lifecycle
return jdbcTemplate.queryForObject(
"SELECT * FROM products WHERE id = ?",
new Object[]{id},
(rs, rowNum) -> mapResultSetToProduct(rs)
);
}
}
• Long-term: Implemented a comprehensive database connection management strategy:
# Database connection management configuration in application.yaml
spring:
datasource:
type: com.zaxxer.hikari.HikariDataSource
hikari:
pool-name: app-connection-pool
maximum-pool-size: 32
minimum-idle: 5
idle-timeout: 60000
max-lifetime: 1800000
connection-timeout: 30000
validation-timeout: 5000
leak-detection-threshold: 30000
data-source-properties:
cachePrepStmts: true
prepStmtCacheSize: 250
prepStmtCacheSqlLimit: 2048
useServerPrepStmts: true
health-check-properties:
connectivityCheckTimeoutMs: 1000
expected99thPercentileMs: 10
# Circuit breaker configuration
cloud:
circuitbreaker:
resilience4j:
instances:
databaseCircuitBreaker:
registerHealthIndicator: true
slidingWindowSize: 10
minimumNumberOfCalls: 5
permittedNumberOfCallsInHalfOpenState: 3
automaticTransitionFromOpenToHalfOpenEnabled: true
waitDurationInOpenState: 5s
failureRateThreshold: 50
eventConsumerBufferSize: 10
• Implemented database connection monitoring:
// DatabaseConnectionHealthIndicator.java
@Component
public class DatabaseConnectionHealthIndicator extends AbstractHealthIndicator {
private final DataSource dataSource;
private final MeterRegistry meterRegistry;
public DatabaseConnectionHealthIndicator(DataSource dataSource, MeterRegistry meterRegistry) {
this.dataSource = dataSource;
this.meterRegistry = meterRegistry;
// Register metrics for connection pool
if (dataSource instanceof HikariDataSource) {
HikariDataSource hikariDataSource = (HikariDataSource) dataSource;
HikariPoolMXBean poolMXBean = hikariDataSource.getHikariPoolMXBean();
Gauge.builder("hikaricp.connections.active", poolMXBean, HikariPoolMXBean::getActiveConnections)
.description("Active connections")
.tag("pool", hikariDataSource.getPoolName())
.register(meterRegistry);
Gauge.builder("hikaricp.connections.idle", poolMXBean, HikariPoolMXBean::getIdleConnections)
.description("Idle connections")
.tag("pool", hikariDataSource.getPoolName())
.register(meterRegistry);
Gauge.builder("hikaricp.connections.total", poolMXBean, HikariPoolMXBean::getTotalConnections)
.description("Total connections")
.tag("pool", hikariDataSource.getPoolName())
.register(meterRegistry);
Gauge.builder("hikaricp.connections.pending", poolMXBean, HikariPoolMXBean::getThreadsAwaitingConnection)
.description("Pending threads")
.tag("pool", hikariDataSource.getPoolName())
.register(meterRegistry);
}
}
@Override
protected void doHealthCheck(Health.Builder builder) throws Exception {
int activeConnections = 0;
int idleConnections = 0;
int maxConnections = 0;
if (dataSource instanceof HikariDataSource) {
HikariDataSource hikariDataSource = (HikariDataSource) dataSource;
HikariPoolMXBean poolMXBean = hikariDataSource.getHikariPoolMXBean();
activeConnections = poolMXBean.getActiveConnections();
idleConnections = poolMXBean.getIdleConnections();
maxConnections = hikariDataSource.getMaximumPoolSize();
}
// Check if we're approaching connection pool exhaustion
double utilizationRatio = (double) activeConnections / maxConnections;
if (utilizationRatio > 0.9) {
builder.down()
.withDetail("activeConnections", activeConnections)
.withDetail("idleConnections", idleConnections)
.withDetail("maxConnections", maxConnections)
.withDetail("utilizationRatio", String.format("%.2f", utilizationRatio))
.withDetail("message", "Connection pool is nearly exhausted");
} else {
builder.up()
.withDetail("activeConnections", activeConnections)
.withDetail("idleConnections", idleConnections)
.withDetail("maxConnections", maxConnections)
.withDetail("utilizationRatio", String.format("%.2f", utilizationRatio));
}
}
}
• Created a database connection pool sizing calculator:
// db_pool_calculator.go
package main
import (
"fmt"
"math"
"os"
"strconv"
)
// PoolSizeCalculator calculates optimal connection pool size
type PoolSizeCalculator struct {
// System parameters
CPUCores int
SpindleCount int
MemoryGB float64
// Application parameters
AvgConnectionUsageMs float64
PeakQueriesPerSecond float64
AvgQueryComplexity float64 // 1.0 = simple, 2.0 = medium, 3.0 = complex
// Database parameters
MaxConnections int
ConnectionOverheadKB int
}
// Calculate returns the recommended connection pool size
func (c *PoolSizeCalculator) Calculate() int {
// Calculate theoretical max based on CPU and disk I/O
theoreticalMax := (c.CPUCores * 2) + c.SpindleCount
// Calculate based on query load
connectionTimeRatio := c.AvgConnectionUsageMs / 1000.0
loadBasedSize := int(math.Ceil(c.PeakQueriesPerSecond * connectionTimeRatio * c.AvgQueryComplexity))
// Calculate memory constraint
memoryLimitKB := int(c.MemoryGB * 1024 * 1024)
memoryBasedMax := memoryLimitKB / (c.ConnectionOverheadKB * 2) // Using 50% of memory max
// Take the minimum of all calculations
recommendedSize := min(theoreticalMax, loadBasedSize, memoryBasedMax, c.MaxConnections)
// Add some buffer, but don't exceed max connections
bufferedSize := int(float64(recommendedSize) * 1.2) // 20% buffer
return min(bufferedSize, c.MaxConnections)
}
// PrintRecommendation prints a detailed recommendation
func (c *PoolSizeCalculator) PrintRecommendation() {
theoreticalMax := (c.CPUCores * 2) + c.SpindleCount
connectionTimeRatio := c.AvgConnectionUsageMs / 1000.0
loadBasedSize := int(math.Ceil(c.PeakQueriesPerSecond * connectionTimeRatio * c.AvgQueryComplexity))
memoryLimitKB := int(c.MemoryGB * 1024 * 1024)
memoryBasedMax := memoryLimitKB / (c.ConnectionOverheadKB * 2)
fmt.Println("Database Connection Pool Size Calculator")
fmt.Println("=======================================")
fmt.Println()
fmt.Println("Input Parameters:")
fmt.Printf("- CPU Cores: %d\n", c.CPUCores)
fmt.Printf("- Disk Spindles: %d\n", c.SpindleCount)
fmt.Printf("- Memory: %.1f GB\n", c.MemoryGB)
fmt.Printf("- Avg Connection Usage: %.1f ms\n", c.AvgConnectionUsageMs)
fmt.Printf("- Peak Queries/Second: %.1f\n", c.PeakQueriesPerSecond)
fmt.Printf("- Query Complexity Factor: %.1f\n", c.AvgQueryComplexity)
fmt.Printf("- DB Max Connections: %d\n", c.MaxConnections)
fmt.Printf("- Connection Overhead: %d KB\n", c.ConnectionOverheadKB)
fmt.Println()
fmt.Println("Calculations:")
fmt.Printf("- CPU/Disk Formula: (cores * 2) + spindles = %d\n", theoreticalMax)
fmt.Printf("- Load Formula: peak_qps * conn_time_ratio * complexity = %.1f\n",
c.PeakQueriesPerSecond * connectionTimeRatio * c.AvgQueryComplexity)
fmt.Printf("- Memory Formula: (memory_kb / conn_overhead_kb) / 2 = %d\n", memoryBasedMax)
fmt.Println()
fmt.Println("Results:")
fmt.Printf("- Theoretical Max (CPU/Disk): %d\n", theoreticalMax)
fmt.Printf("- Load-based Size: %d\n", loadBasedSize)
fmt.Printf("- Memory-constrained Max: %d\n", memoryBasedMax)
fmt.Printf("- Database Max: %d\n", c.MaxConnections)
fmt.Println()
recommendedSize := c.Calculate()
fmt.Printf("RECOMMENDED POOL SIZE: %d\n", recommendedSize)
// Provide per-instance recommendation for microservices
fmt.Println()
fmt.Println("Microservices Deployment Recommendations:")
for _, instances := range []int{2, 3, 5, 10} {
perInstanceSize := max(1, recommendedSize / instances)
fmt.Printf("- For %d instances: %d connections per instance\n", instances, perInstanceSize)
}
}
func min(values ...int) int {
result := values[0]
for _, v := range values[1:] {
if v < result {
result = v
}
}
return result
}
func max(a, b int) int {
if a > b {
return a
}
return b
}
func main() {
calculator := &PoolSizeCalculator{
CPUCores: 8,
SpindleCount: 4,
MemoryGB: 32,
AvgConnectionUsageMs: 50,
PeakQueriesPerSecond: 1000,
AvgQueryComplexity: 1.5,
MaxConnections: 500,
ConnectionOverheadKB: 1024,
}
// Override with command line arguments if provided
if len(os.Args) > 1 {
for i := 1; i < len(os.Args); i += 2 {
if i+1 >= len(os.Args) {
break
}
key := os.Args[i]
value := os.Args[i+1]
switch key {
case "--cpu":
if v, err := strconv.Atoi(value); err == nil {
calculator.CPUCores = v
}
case "--spindles":
if v, err := strconv.Atoi(value); err == nil {
calculator.SpindleCount = v
}
case "--memory":
if v, err := strconv.ParseFloat(value, 64); err == nil {
calculator.MemoryGB = v
}
case "--conn-usage":
if v, err := strconv.ParseFloat(value, 64); err == nil {
calculator.AvgConnectionUsageMs = v
}
case "--peak-qps":
if v, err := strconv.ParseFloat(value, 64); err == nil {
calculator.PeakQueriesPerSecond = v
}
case "--complexity":
if v, err := strconv.ParseFloat(value, 64); err == nil {
calculator.AvgQueryComplexity = v
}
case "--max-conn":
if v, err := strconv.Atoi(value); err == nil {
calculator.MaxConnections = v
}
case "--conn-overhead":
if v, err := strconv.Atoi(value); err == nil {
calculator.ConnectionOverheadKB = v
}
}
}
}
calculator.PrintRecommendation()
}
Lessons Learned:
Proper database connection pool configuration is critical for application stability under load.
How to Avoid:
Size connection pools based on workload and system resources.
Implement connection leak detection and prevention.
Monitor connection pool metrics and set up alerts.
Use circuit breakers to handle database connection failures gracefully.
Test application behavior under connection pool pressure.
No summary provided
What Happened:
A critical microservice began experiencing intermittent failures that gradually increased in frequency. The service would work normally after restarts but would degrade over time. During peak traffic periods, the service would become completely unresponsive, with logs showing database connection timeout errors. The issue was particularly puzzling because the service used a connection pool that should have prevented connection exhaustion.
Diagnosis Steps:
Analyzed application logs for error patterns and timing.
Monitored database connection counts using PostgreSQL's pg_stat_activity.
Reviewed connection pool configuration and behavior.
Profiled the application to identify connection usage patterns.
Examined code changes that coincided with the onset of issues.
Root Cause:
The investigation revealed multiple issues: 1. A recent code change introduced a connection leak in an exception handling path 2. The connection pool was configured with a very high maximum connection limit 3. The database had a lower max_connections setting than the application's connection pool 4. Connection timeouts were set too high, allowing zombie connections to persist 5. The application lacked proper connection leak detection and recovery mechanisms
Fix/Workaround:
• Short-term: Implemented an immediate fix by properly closing connections in exception paths:
// Before: Connection leak in exception handling
public List<Customer> getCustomersByRegion(String region) {
Connection conn = null;
PreparedStatement stmt = null;
ResultSet rs = null;
List<Customer> customers = new ArrayList<>();
try {
conn = dataSource.getConnection();
stmt = conn.prepareStatement("SELECT * FROM customers WHERE region = ?");
stmt.setString(1, region);
rs = stmt.executeQuery();
while (rs.next()) {
Customer customer = new Customer();
customer.setId(rs.getLong("id"));
customer.setName(rs.getString("name"));
customer.setEmail(rs.getString("email"));
customer.setRegion(rs.getString("region"));
customers.add(customer);
}
return customers;
} catch (SQLException e) {
logger.error("Error retrieving customers by region", e);
throw new DatabaseException("Failed to retrieve customers", e);
} finally {
try {
if (rs != null) rs.close();
if (stmt != null) stmt.close();
// Connection not closed if an exception occurs in the finally block
if (conn != null) conn.close();
} catch (SQLException e) {
logger.error("Error closing database resources", e);
}
}
}
// After: Properly closing connections with try-with-resources
public List<Customer> getCustomersByRegion(String region) {
String sql = "SELECT * FROM customers WHERE region = ?";
List<Customer> customers = new ArrayList<>();
try (Connection conn = dataSource.getConnection();
PreparedStatement stmt = conn.prepareStatement(sql);
) {
stmt.setString(1, region);
try (ResultSet rs = stmt.executeQuery()) {
while (rs.next()) {
Customer customer = new Customer();
customer.setId(rs.getLong("id"));
customer.setName(rs.getString("name"));
customer.setEmail(rs.getString("email"));
customer.setRegion(rs.getString("region"));
customers.add(customer);
}
}
return customers;
} catch (SQLException e) {
logger.error("Error retrieving customers by region", e);
throw new DatabaseException("Failed to retrieve customers", e);
}
}
• Optimized connection pool configuration:
# Before: Problematic HikariCP configuration
spring:
datasource:
url: jdbc:postgresql://db.example.com:5432/customerdb
username: ${DB_USER}
password: ${DB_PASSWORD}
hikari:
maximum-pool-size: 100
minimum-idle: 10
idle-timeout: 600000
connection-timeout: 30000
max-lifetime: 1800000
# After: Optimized HikariCP configuration
spring:
datasource:
url: jdbc:postgresql://db.example.com:5432/customerdb
username: ${DB_USER}
password: ${DB_PASSWORD}
hikari:
maximum-pool-size: 20
minimum-idle: 5
idle-timeout: 300000
connection-timeout: 10000
max-lifetime: 600000
leak-detection-threshold: 60000
validation-timeout: 5000
register-mbeans: true
• Implemented a connection leak detector in Java:
// ConnectionLeakDetector.java
package com.example.dbmonitoring;
import com.zaxxer.hikari.HikariDataSource;
import com.zaxxer.hikari.HikariPoolMXBean;
import io.micrometer.core.instrument.Gauge;
import io.micrometer.core.instrument.MeterRegistry;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;
import javax.sql.DataSource;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
import java.util.concurrent.TimeUnit;
@Component
public class ConnectionLeakDetector {
private static final Logger logger = LoggerFactory.getLogger(ConnectionLeakDetector.class);
private final DataSource dataSource;
private final MeterRegistry meterRegistry;
public ConnectionLeakDetector(DataSource dataSource, MeterRegistry meterRegistry) {
this.dataSource = dataSource;
this.meterRegistry = meterRegistry;
if (dataSource instanceof HikariDataSource) {
HikariDataSource hikariDataSource = (HikariDataSource) dataSource;
// Register metrics
Gauge.builder("hikaricp.connections.active", hikariDataSource, ds -> {
try {
return ds.getHikariPoolMXBean().getActiveConnections();
} catch (Exception e) {
logger.error("Error getting active connections", e);
return 0;
}
}).register(meterRegistry);
Gauge.builder("hikaricp.connections.idle", hikariDataSource, ds -> {
try {
return ds.getHikariPoolMXBean().getIdleConnections();
} catch (Exception e) {
logger.error("Error getting idle connections", e);
return 0;
}
}).register(meterRegistry);
Gauge.builder("hikaricp.connections.total", hikariDataSource, ds -> {
try {
return ds.getHikariPoolMXBean().getTotalConnections();
} catch (Exception e) {
logger.error("Error getting total connections", e);
return 0;
}
}).register(meterRegistry);
Gauge.builder("hikaricp.connections.pending", hikariDataSource, ds -> {
try {
return ds.getHikariPoolMXBean().getThreadsAwaitingConnection();
} catch (Exception e) {
logger.error("Error getting pending connections", e);
return 0;
}
}).register(meterRegistry);
}
}
@Scheduled(fixedRate = 60000)
public void checkForConnectionLeaks() {
if (!(dataSource instanceof HikariDataSource)) {
return;
}
HikariDataSource hikariDataSource = (HikariDataSource) dataSource;
HikariPoolMXBean poolMXBean = hikariDataSource.getHikariPoolMXBean();
int active = poolMXBean.getActiveConnections();
int idle = poolMXBean.getIdleConnections();
int total = poolMXBean.getTotalConnections();
int awaiting = poolMXBean.getThreadsAwaitingConnection();
logger.info("Connection pool stats - Active: {}, Idle: {}, Total: {}, Awaiting: {}",
active, idle, total, awaiting);
// Check for potential leaks
if (active > (total * 0.9) && awaiting > 0) {
logger.warn("Potential connection leak detected! Active connections at {}% of total",
(active * 100) / total);
// Query database for long-running connections
queryLongRunningConnections();
}
}
private void queryLongRunningConnections() {
String sql = "SELECT pid, usename, application_name, client_addr, backend_start, state, " +
"query_start, wait_event_type, wait_event, query " +
"FROM pg_stat_activity " +
"WHERE state != 'idle' AND query_start < NOW() - INTERVAL '5 minutes' " +
"ORDER BY query_start";
try (Connection conn = dataSource.getConnection();
Statement stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery(sql)) {
int count = 0;
while (rs.next()) {
count++;
logger.warn("Long-running connection: PID={}, User={}, App={}, State={}, Started={}, Query={}",
rs.getString("pid"),
rs.getString("usename"),
rs.getString("application_name"),
rs.getString("state"),
rs.getTimestamp("query_start"),
rs.getString("query").substring(0, Math.min(100, rs.getString("query").length())));
}
if (count > 0) {
logger.warn("Found {} long-running connections", count);
}
} catch (SQLException e) {
logger.error("Error querying long-running connections", e);
}
}
@Scheduled(fixedRate = 3600000)
public void performConnectionPoolMaintenance() {
if (!(dataSource instanceof HikariDataSource)) {
return;
}
HikariDataSource hikariDataSource = (HikariDataSource) dataSource;
logger.info("Performing connection pool maintenance");
try {
// Soft restart of the connection pool
hikariDataSource.getHikariPoolMXBean().softEvictConnections();
logger.info("Successfully evicted idle connections");
} catch (Exception e) {
logger.error("Error performing connection pool maintenance", e);
}
}
}
• Implemented a database connection monitoring dashboard in Grafana:
{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": "-- Grafana --",
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
}
]
},
"editable": true,
"gnetId": null,
"graphTooltip": 0,
"id": 10,
"links": [],
"panels": [
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"hiddenSeries": false,
"id": 2,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": true,
"total": false,
"values": false
},
"lines": true,
"linewidth": 1,
"nullPointMode": "null",
"options": {
"alertThreshold": true
},
"percentage": false,
"pluginVersion": "7.3.7",
"pointradius": 2,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "hikaricp_connections_active{application=\"$application\", instance=\"$instance\"}",
"interval": "",
"legendFormat": "Active",
"refId": "A"
},
{
"expr": "hikaricp_connections_idle{application=\"$application\", instance=\"$instance\"}",
"interval": "",
"legendFormat": "Idle",
"refId": "B"
},
{
"expr": "hikaricp_connections_total{application=\"$application\", instance=\"$instance\"}",
"interval": "",
"legendFormat": "Total",
"refId": "C"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "HikariCP Connections",
"tooltip": {
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"hiddenSeries": false,
"id": 4,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": true,
"total": false,
"values": false
},
"lines": true,
"linewidth": 1,
"nullPointMode": "null",
"options": {
"alertThreshold": true
},
"percentage": false,
"pluginVersion": "7.3.7",
"pointradius": 2,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "hikaricp_connections_pending{application=\"$application\", instance=\"$instance\"}",
"interval": "",
"legendFormat": "Pending",
"refId": "A"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "Pending Connections",
"tooltip": {
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"hiddenSeries": false,
"id": 6,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": true,
"total": false,
"values": false
},
"lines": true,
"linewidth": 1,
"nullPointMode": "null",
"options": {
"alertThreshold": true
},
"percentage": false,
"pluginVersion": "7.3.7",
"pointradius": 2,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "hikaricp_connections_usage_seconds_max{application=\"$application\", instance=\"$instance\"}",
"interval": "",
"legendFormat": "Max Usage Time",
"refId": "A"
},
{
"expr": "hikaricp_connections_usage_seconds_avg{application=\"$application\", instance=\"$instance\"}",
"interval": "",
"legendFormat": "Avg Usage Time",
"refId": "B"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "Connection Usage Time",
"tooltip": {
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "s",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 8
},
"hiddenSeries": false,
"id": 8,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": true,
"total": false,
"values": false
},
"lines": true,
"linewidth": 1,
"nullPointMode": "null",
"options": {
"alertThreshold": true
},
"percentage": false,
"pluginVersion": "7.3.7",
"pointradius": 2,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "hikaricp_connections_creation_seconds_max{application=\"$application\", instance=\"$instance\"}",
"interval": "",
"legendFormat": "Max Creation Time",
"refId": "A"
},
{
"expr": "hikaricp_connections_creation_seconds_avg{application=\"$application\", instance=\"$instance\"}",
"interval": "",
"legendFormat": "Avg Creation Time",
"refId": "B"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "Connection Creation Time",
"tooltip": {
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "s",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 16
},
"hiddenSeries": false,
"id": 10,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": true,
"total": false,
"values": false
},
"lines": true,
"linewidth": 1,
"nullPointMode": "null",
"options": {
"alertThreshold": true
},
"percentage": false,
"pluginVersion": "7.3.7",
"pointradius": 2,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "hikaricp_connections_acquire_seconds_max{application=\"$application\", instance=\"$instance\"}",
"interval": "",
"legendFormat": "Max Acquire Time",
"refId": "A"
},
{
"expr": "hikaricp_connections_acquire_seconds_avg{application=\"$application\", instance=\"$instance\"}",
"interval": "",
"legendFormat": "Avg Acquire Time",
"refId": "B"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "Connection Acquisition Time",
"tooltip": {
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "s",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 16
},
"hiddenSeries": false,
"id": 12,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": true,
"total": false,
"values": false
},
"lines": true,
"linewidth": 1,
"nullPointMode": "null",
"options": {
"alertThreshold": true
},
"percentage": false,
"pluginVersion": "7.3.7",
"pointradius": 2,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "pg_stat_activity_count{application=\"$application\", instance=\"$instance\", state=\"active\"}",
"interval": "",
"legendFormat": "Active",
"refId": "A"
},
{
"expr": "pg_stat_activity_count{application=\"$application\", instance=\"$instance\", state=\"idle\"}",
"interval": "",
"legendFormat": "Idle",
"refId": "B"
},
{
"expr": "pg_stat_activity_count{application=\"$application\", instance=\"$instance\", state=\"idle in transaction\"}",
"interval": "",
"legendFormat": "Idle in Transaction",
"refId": "C"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "PostgreSQL Connections",
"tooltip": {
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
}
],
"refresh": "5s",
"schemaVersion": 26,
"style": "dark",
"tags": [
"database",
"connections",
"hikaricp"
],
"templating": {
"list": [
{
"allValue": null,
"current": {
"selected": false,
"text": "customer-service",
"value": "customer-service"
},
"datasource": "Prometheus",
"definition": "label_values(application)",
"hide": 0,
"includeAll": false,
"label": "Application",
"multi": false,
"name": "application",
"options": [],
"query": "label_values(application)",
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"sort": 0,
"tagValuesQuery": "",
"tags": [],
"tagsQuery": "",
"type": "query",
"useTags": false
},
{
"allValue": null,
"current": {
"selected": false,
"text": "customer-service-7f8b9c5d6f-2xvqp:8080",
"value": "customer-service-7f8b9c5d6f-2xvqp:8080"
},
"datasource": "Prometheus",
"definition": "label_values(hikaricp_connections_active{application=\"$application\"}, instance)",
"hide": 0,
"includeAll": false,
"label": "Instance",
"multi": false,
"name": "instance",
"options": [],
"query": "label_values(hikaricp_connections_active{application=\"$application\"}, instance)",
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"sort": 0,
"tagValuesQuery": "",
"tags": [],
"tagsQuery": "",
"type": "query",
"useTags": false
}
]
},
"time": {
"from": "now-1h",
"to": "now"
},
"timepicker": {
"refresh_intervals": [
"5s",
"10s",
"30s",
"1m",
"5m",
"15m",
"30m",
"1h",
"2h",
"1d"
]
},
"timezone": "",
"title": "Database Connection Monitoring",
"uid": "db-connections",
"version": 1
}
• Long-term: Implemented a comprehensive database connection management strategy:
- Created a connection leak detection and reporting system
- Implemented automated connection pool monitoring and alerting
- Added circuit breakers for database connections
- Documented best practices for connection management
- Implemented regular connection pool maintenance
Lessons Learned:
Proper database connection management is critical for application stability and performance.
How to Avoid:
Use try-with-resources for all database operations.
Configure connection pools with appropriate limits and timeouts.
Implement connection leak detection and monitoring.
Use circuit breakers to prevent cascading failures.
Regularly review and test connection management code.
No summary provided
What Happened:
Users reported seeing outdated data when accessing the application through different regions. Some transactions appeared to be missing or incomplete when viewed through certain application instances. The issue was intermittent and more pronounced during peak traffic periods. The application team initially suspected a caching issue, but investigation revealed a database replication problem.
Diagnosis Steps:
Analyzed application logs for transaction patterns and error messages.
Examined database server metrics and replication status.
Monitored network traffic between primary and replica instances.
Tested read consistency across different application nodes.
Reviewed recent infrastructure changes and database configuration updates.
Root Cause:
The investigation revealed multiple issues contributing to replication lag: 1. Write-heavy batch jobs were running during peak traffic periods 2. Network bandwidth between data centers was insufficient for replication traffic 3. Replica servers were undersized compared to the primary 4. The application was not using synchronous replication for critical transactions 5. Monitoring was not properly configured to alert on replication lag
Fix/Workaround:
• Short-term: Implemented immediate fixes to reduce replication lag:
-- Before: Default PostgreSQL replication configuration
-- postgresql.conf on primary
wal_level = replica
max_wal_senders = 10
wal_keep_segments = 64
max_replication_slots = 10
-- After: Optimized PostgreSQL replication configuration
-- postgresql.conf on primary
wal_level = logical
max_wal_senders = 20
wal_keep_size = 2048 -- 2GB in MB
max_replication_slots = 20
synchronous_commit = on
synchronous_standby_names = 'FIRST 1 (app_replica1, app_replica2)'
wal_sender_timeout = 60s
wal_receiver_timeout = 60s
wal_log_hints = on
hot_standby_feedback = on
• Implemented synchronous replication for critical transactions:
-- Create a synchronous replication group for critical transactions
ALTER SYSTEM SET synchronous_standby_names = 'FIRST 1 (app_replica1, app_replica2)';
SELECT pg_reload_conf();
-- Create a function to dynamically control synchronous replication
CREATE OR REPLACE FUNCTION set_synchronous_commit_for_session(level text) RETURNS void AS $$
BEGIN
EXECUTE 'SET synchronous_commit TO ' || level;
END;
$$ LANGUAGE plpgsql;
-- Usage in application code
SELECT set_synchronous_commit_for_session('on'); -- For critical transactions
SELECT set_synchronous_commit_for_session('local'); -- For non-critical transactions
• Implemented a replication lag monitoring solution:
-- Create extension for monitoring
CREATE EXTENSION pg_stat_statements;
-- Create a function to check replication lag
CREATE OR REPLACE FUNCTION check_replication_lag() RETURNS TABLE (
replica_name text,
lag_bytes bigint,
lag_seconds interval
) AS $$
BEGIN
RETURN QUERY
SELECT
application_name AS replica_name,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes,
CASE
WHEN replay_lag IS NULL THEN interval '0'
ELSE replay_lag
END AS lag_seconds
FROM pg_stat_replication;
END;
$$ LANGUAGE plpgsql;
-- Create a view for monitoring
CREATE VIEW replication_status AS
SELECT
current_timestamp AS check_time,
application_name AS replica_name,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes,
CASE
WHEN replay_lag IS NULL THEN interval '0'
ELSE replay_lag
END AS lag_seconds,
CASE
WHEN pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) > 1073741824 THEN 'CRITICAL' -- 1GB
WHEN pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) > 104857600 THEN 'WARNING' -- 100MB
ELSE 'OK'
END AS status
FROM pg_stat_replication;
• Implemented a Go-based replication monitoring service:
// replication_monitor.go
package main
import (
"database/sql"
"fmt"
"log"
"os"
"time"
"github.com/jmoiron/sqlx"
_ "github.com/lib/pq"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
"net/http"
)
// ReplicationStatus represents the replication status of a replica
type ReplicationStatus struct {
ReplicaName string `db:"replica_name"`
LagBytes int64 `db:"lag_bytes"`
LagSeconds time.Duration `db:"lag_seconds"`
Status string `db:"status"`
}
var (
replicationLagBytes = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "postgresql_replication_lag_bytes",
Help: "PostgreSQL replication lag in bytes",
},
[]string{"replica"},
)
replicationLagSeconds = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "postgresql_replication_lag_seconds",
Help: "PostgreSQL replication lag in seconds",
},
[]string{"replica"},
)
)
func init() {
prometheus.MustRegister(replicationLagBytes)
prometheus.MustRegister(replicationLagSeconds)
}
func main() {
// Get database connection parameters from environment
dbHost := getEnv("DB_HOST", "localhost")
dbPort := getEnv("DB_PORT", "5432")
dbUser := getEnv("DB_USER", "postgres")
dbPassword := getEnv("DB_PASSWORD", "")
dbName := getEnv("DB_NAME", "postgres")
checkInterval := getEnvAsInt("CHECK_INTERVAL", 15)
alertThresholdBytes := getEnvAsInt64("ALERT_THRESHOLD_BYTES", 104857600) // 100MB
alertThresholdSeconds := getEnvAsInt("ALERT_THRESHOLD_SECONDS", 30)
// Create database connection string
connStr := fmt.Sprintf("host=%s port=%s user=%s password=%s dbname=%s sslmode=require",
dbHost, dbPort, dbUser, dbPassword, dbName)
// Connect to the database
db, err := sqlx.Connect("postgres", connStr)
if err != nil {
log.Fatalf("Failed to connect to database: %v", err)
}
defer db.Close()
// Start HTTP server for Prometheus metrics
http.Handle("/metrics", promhttp.Handler())
go func() {
log.Fatal(http.ListenAndServe(":8080", nil))
}()
// Start monitoring replication lag
ticker := time.NewTicker(time.Duration(checkInterval) * time.Second)
defer ticker.Stop()
log.Printf("Replication monitor started. Checking every %d seconds", checkInterval)
for range ticker.C {
statuses, err := checkReplicationStatus(db)
if err != nil {
log.Printf("Error checking replication status: %v", err)
continue
}
for _, status := range statuses {
// Update Prometheus metrics
replicationLagBytes.WithLabelValues(status.ReplicaName).Set(float64(status.LagBytes))
replicationLagSeconds.WithLabelValues(status.ReplicaName).Set(status.LagSeconds.Seconds())
// Log status
log.Printf("Replica: %s, Lag: %d bytes (%.2f MB), %.2f seconds, Status: %s",
status.ReplicaName, status.LagBytes, float64(status.LagBytes)/(1024*1024),
status.LagSeconds.Seconds(), status.Status)
// Check if we need to alert
if status.LagBytes > alertThresholdBytes || status.LagSeconds.Seconds() > float64(alertThresholdSeconds) {
sendAlert(status)
}
}
}
}
// checkReplicationStatus queries the database for replication status
func checkReplicationStatus(db *sqlx.DB) ([]ReplicationStatus, error) {
var statuses []ReplicationStatus
query := `
SELECT
application_name AS replica_name,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes,
CASE
WHEN replay_lag IS NULL THEN interval '0'
ELSE replay_lag
END AS lag_seconds,
CASE
WHEN pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) > 1073741824 THEN 'CRITICAL' -- 1GB
WHEN pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) > 104857600 THEN 'WARNING' -- 100MB
ELSE 'OK'
END AS status
FROM pg_stat_replication
`
err := db.Select(&statuses, query)
if err != nil {
return nil, fmt.Errorf("failed to query replication status: %w", err)
}
return statuses, nil
}
// sendAlert sends an alert for high replication lag
func sendAlert(status ReplicationStatus) {
// In a real implementation, this would send an alert to a monitoring system
// For this example, we'll just log the alert
log.Printf("ALERT: High replication lag detected for replica %s: %d bytes (%.2f MB), %.2f seconds",
status.ReplicaName, status.LagBytes, float64(status.LagBytes)/(1024*1024),
status.LagSeconds.Seconds())
// Example of how to send an alert to Slack
// slackWebhookURL := os.Getenv("SLACK_WEBHOOK_URL")
// if slackWebhookURL != "" {
// message := fmt.Sprintf("ALERT: High replication lag detected for replica %s: %d bytes (%.2f MB), %.2f seconds",
// status.ReplicaName, status.LagBytes, float64(status.LagBytes)/(1024*1024),
// status.LagSeconds.Seconds())
// sendSlackAlert(slackWebhookURL, message)
// }
}
// Helper functions
func getEnv(key, defaultValue string) string {
value := os.Getenv(key)
if value == "" {
return defaultValue
}
return value
}
func getEnvAsInt(key string, defaultValue int) int {
valueStr := getEnv(key, fmt.Sprintf("%d", defaultValue))
value := defaultValue
fmt.Sscanf(valueStr, "%d", &value)
return value
}
func getEnvAsInt64(key string, defaultValue int64) int64 {
valueStr := getEnv(key, fmt.Sprintf("%d", defaultValue))
value := defaultValue
fmt.Sscanf(valueStr, "%d", &value)
return value
}
• Implemented application-level read consistency controls:
// Java application code for handling replication lag
public class DatabaseService {
private final DataSource primaryDataSource;
private final DataSource replicaDataSource;
private final ReplicationLagChecker lagChecker;
// Maximum acceptable lag for reads in milliseconds
private static final long MAX_ACCEPTABLE_LAG_MS = 1000;
public DatabaseService(DataSource primaryDataSource, DataSource replicaDataSource,
ReplicationLagChecker lagChecker) {
this.primaryDataSource = primaryDataSource;
this.replicaDataSource = replicaDataSource;
this.lagChecker = lagChecker;
}
/**
* Executes a read query with consistency guarantees
*/
public <T> T executeReadQuery(String sql, RowMapper<T> rowMapper,
ReadConsistency consistency) {
switch (consistency) {
case EVENTUAL:
// Always read from replica
return executeQuery(replicaDataSource, sql, rowMapper);
case STRONG:
// Always read from primary
return executeQuery(primaryDataSource, sql, rowMapper);
case BOUNDED_STALENESS:
// Check lag and decide where to read from
if (isReplicaAcceptable()) {
return executeQuery(replicaDataSource, sql, rowMapper);
} else {
return executeQuery(primaryDataSource, sql, rowMapper);
}
default:
throw new IllegalArgumentException("Unknown consistency level: " + consistency);
}
}
/**
* Checks if replica lag is within acceptable bounds
*/
private boolean isReplicaAcceptable() {
try {
long lagMillis = lagChecker.getReplicationLagMillis();
return lagMillis <= MAX_ACCEPTABLE_LAG_MS;
} catch (Exception e) {
// If we can't check lag, fall back to primary
log.warn("Failed to check replication lag, falling back to primary", e);
return false;
}
}
/**
* Executes a write query (always on primary)
*/
public <T> T executeWriteQuery(String sql, Object[] params) {
// Always write to primary
JdbcTemplate template = new JdbcTemplate(primaryDataSource);
return template.update(sql, params);
}
// Helper method to execute a query on a specific data source
private <T> T executeQuery(DataSource dataSource, String sql, RowMapper<T> rowMapper) {
JdbcTemplate template = new JdbcTemplate(dataSource);
return template.queryForObject(sql, rowMapper);
}
// Enum for read consistency levels
public enum ReadConsistency {
EVENTUAL, // Read from replica, may see stale data
BOUNDED_STALENESS, // Read from replica if lag is acceptable, otherwise primary
STRONG // Read from primary, always see latest data
}
}
• Long-term: Implemented a comprehensive database replication management strategy:
- Upgraded network infrastructure between data centers
- Implemented proper resource allocation for replica servers
- Scheduled batch jobs outside of peak traffic periods
- Developed a comprehensive monitoring and alerting system
- Implemented application-level read consistency controls
Lessons Learned:
Database replication lag requires careful monitoring and application-level handling to prevent data inconsistency.
How to Avoid:
Implement proper monitoring and alerting for replication lag.
Use synchronous replication for critical transactions.
Size replica servers appropriately for the workload.
Implement application-level read consistency controls.
Schedule batch jobs to minimize impact on replication.
No summary provided
What Happened:
During a major sales event, customers reported seeing items available for purchase that were actually out of stock. Some customers were able to place orders for products that had already been sold out, leading to order cancellations and customer complaints. The operations team noticed that the issue was intermittent and seemed to affect only some customers. Investigation revealed that the database replication lag between master and read replicas had increased significantly during peak traffic periods.
Diagnosis Steps:
Analyzed replication lag metrics across all database replicas.
Examined query patterns and load distribution during peak traffic.
Reviewed application code for read/write splitting implementation.
Checked network performance between database servers.
Monitored transaction sizes and frequency on the master database.
Root Cause:
The investigation revealed multiple issues contributing to replication lag: 1. Large transactions on the master database were blocking replication 2. Read replicas had insufficient resources to keep up with write volume 3. Application code was not routing queries appropriately based on consistency requirements 4. No monitoring or circuit breaking for replication lag was implemented 5. Database buffer pool and other configuration settings were suboptimal
Fix/Workaround:
• Implemented immediate fixes to reduce replication lag
• Optimized MySQL configuration for better replication performance
• Implemented application-level consistency controls
• Added monitoring and alerting for replication lag
• Scaled up read replica resources to handle peak load
Lessons Learned:
Database replication lag must be actively monitored and managed in high-traffic applications.
How to Avoid:
Implement proper monitoring and alerting for replication lag.
Design application code to handle eventual consistency appropriately.
Size database replicas based on write volume, not just read capacity.
Test replication performance under peak load conditions.
Consider semi-synchronous replication for critical data paths.
No summary provided
What Happened:
During a routine application deployment, the automated database schema migration process failed halfway through execution. The application became unavailable, and database queries began returning errors. The operations team had to quickly respond to restore service while preserving data integrity. The incident occurred during business hours, affecting thousands of users and causing significant business impact.
Diagnosis Steps:
Examined database migration logs for error messages.
Checked the state of the database schema and migration history table.
Reviewed the failed migration script and its dependencies.
Analyzed database locks and active connections.
Verified application logs for related errors.
Root Cause:
The investigation revealed multiple issues with the database migration: 1. The migration script contained a schema change that conflicted with existing data 2. The migration tool was configured without proper timeout settings 3. No pre-deployment validation was performed in a staging environment 4. The migration lacked proper transaction handling for rollback 5. Database connection pooling settings prevented clean termination of connections
Fix/Workaround:
• Implemented immediate fixes to restore service
• Rolled back the partial migration using backup data
• Created a corrected migration script with proper data handling
• Deployed the fixed migration during a maintenance window
• Improved the migration validation process
Lessons Learned:
Database schema migrations require careful planning, validation, and rollback strategies.
How to Avoid:
Implement comprehensive pre-deployment validation in staging environments.
Design migrations with proper transaction handling and rollback capabilities.
Create data validation steps before executing schema changes.
Configure appropriate timeouts and error handling in migration tools.
Establish clear procedures for database migration failures.
No summary provided
What Happened:
A retail company's e-commerce platform experienced a major outage during a promotional event. The application consisted of dozens of microservices, each maintaining its own database connection pool to a shared PostgreSQL cluster. As traffic increased, services began failing with database connection errors. The failures cascaded across the platform as dependent services also failed. Database monitoring showed connection counts at the configured maximum, but many connections were idle or in an inconsistent state.
Diagnosis Steps:
Analyzed database connection metrics and active queries.
Examined connection pool configurations across services.
Reviewed application logs for connection handling patterns.
Traced request flows through the microservices architecture.
Monitored database server resource utilization.
Root Cause:
The investigation revealed multiple issues with connection management: 1. Each microservice maintained its own connection pool without coordination 2. Connection pools were sized based on individual service needs, not system-wide capacity 3. Some services failed to properly release connections during error conditions 4. Long-running transactions held connections unnecessarily 5. No circuit breaking or backpressure mechanisms were implemented
Fix/Workaround:
• Implemented immediate fixes to restore service
• Optimized connection pool sizes across services
• Improved connection handling with proper release in all code paths
• Implemented circuit breaking and backpressure mechanisms
• Created a centralized connection management strategy
Lessons Learned:
Database connection management in microservices requires system-wide coordination and proper resource governance.
How to Avoid:
Implement centralized connection pool management or governance.
Size connection pools based on database capacity, not individual service needs.
Use connection monitoring and alerting to detect potential issues.
Implement circuit breaking and backpressure mechanisms.
Regularly review and optimize database access patterns.
```sql
-- Example SQL to monitor PostgreSQL connection usage
SELECT
datname AS database_name,
usename AS username,
application_name,
client_addr AS client_ip,
state,
wait_event_type,
wait_event,
query,
backend_type,
COUNT(*) AS connection_count,
MAX(EXTRACT(EPOCH FROM (now() - state_change))) AS max_idle_time_seconds,
MAX(EXTRACT(EPOCH FROM (now() - xact_start))) AS max_transaction_time_seconds
FROM
pg_stat_activity
WHERE
backend_type = 'client backend'
GROUP BY
datname, usename, application_name, client_addr, state, wait_event_type, wait_event, query, backend_type
ORDER BY
connection_count DESC,
max_transaction_time_seconds DESC;
```
```java
// Example Java code for proper connection pool configuration with HikariCP
import com.zaxxer.hikari.HikariConfig;
import com.zaxxer.hikari.HikariDataSource;
import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import io.vavr.control.Try;
import java.sql.Connection;
import java.sql.SQLException;
import java.time.Duration;
import java.util.concurrent.TimeUnit;
public class OptimizedConnectionPool {
private final HikariDataSource dataSource;
private final CircuitBreaker circuitBreaker;
public OptimizedConnectionPool(String serviceName, int maxPoolSize, int maxSystemConnections) {
// Calculate appropriate pool size based on service needs and system capacity
int calculatedPoolSize = calculateOptimalPoolSize(serviceName, maxPoolSize, maxSystemConnections);
HikariConfig config = new HikariConfig();
config.setJdbcUrl("jdbc:postgresql://db.example.com:5432/mydb");
config.setUsername("username");
config.setPassword("password");
// Connection pool tuning
config.setMaximumPoolSize(calculatedPoolSize);
config.setMinimumIdle(calculatedPoolSize / 4);
config.setIdleTimeout(TimeUnit.MINUTES.toMillis(10));
config.setMaxLifetime(TimeUnit.MINUTES.toMillis(30));
// Connection validation and leak detection
config.setConnectionTimeout(TimeUnit.SECONDS.toMillis(5));
config.setValidationTimeout(TimeUnit.SECONDS.toMillis(3));
config.setLeakDetectionThreshold(TimeUnit.SECONDS.toMillis(60));
// Proper connection initialization
config.setInitializationFailTimeout(TimeUnit.SECONDS.toMillis(30));
config.setConnectionInitSql("SET statement_timeout = '30s'");
// Metrics for monitoring
config.setMetricRegistry(getMetricRegistry());
config.setHealthCheckRegistry(getHealthCheckRegistry());
// Pool name for identification in monitoring
config.setPoolName("HikariPool-" + serviceName);
dataSource = new HikariDataSource(config);
// Circuit breaker to prevent cascading failures
CircuitBreakerConfig circuitBreakerConfig = CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.waitDurationInOpenState(Duration.ofSeconds(30))
.permittedNumberOfCallsInHalfOpenState(5)
.slidingWindowSize(10)
.recordExceptions(SQLException.class)
.build();
circuitBreaker = CircuitBreaker.of("database-" + serviceName, circuitBreakerConfig);
}
public Connection getConnection() throws SQLException {
return Try.ofSupplier(
CircuitBreaker.decorateSupplier(circuitBreaker, dataSource::getConnection)
).getOrElseThrow(e -> {
if (e instanceof SQLException) {
return (SQLException) e;
}
return new SQLException("Failed to get connection", e);
});
}
private int calculateOptimalPoolSize(String serviceName, int requestedSize, int systemMaxConnections) {
// Implementation would consider:
// 1. Total system connection capacity
// 2. Number of service instances
// 3. Service priority/importance
// 4. Historical connection usage patterns
return Math.min(requestedSize, systemMaxConnections / getServiceCount());
}
private int getServiceCount() {
// Implementation would determine the number of services sharing the database
return 10;
}
private Object getMetricRegistry() {
// Implementation would return the application's metric registry
return null;
}
private Object getHealthCheckRegistry() {
// Implementation would return the application's health check registry
return null;
}
public void close() {
dataSource.close();
}
}
```
No summary provided
What Happened:
A financial services company was deploying a new feature that required significant database schema changes. The deployment included adding new tables, modifying existing ones, and creating new indexes. The migration was executed during a maintenance window using Liquibase through their CI/CD pipeline. Shortly after the migration started, the database became unresponsive, and monitoring alerts triggered for high lock contention. The application experienced complete downtime, affecting thousands of users. The operations team attempted to roll back the migration, but the rollback also failed due to lock conflicts. The incident lasted for over 3 hours before being resolved.
Diagnosis Steps:
Analyzed PostgreSQL logs for lock-related errors and deadlocks.
Examined the active queries and locks using pg_stat_activity and pg_locks.
Reviewed the Liquibase changelog and migration scripts.
Checked the application connection pool configuration.
Investigated the database load and performance metrics before and during the incident.
Root Cause:
The investigation revealed multiple issues with the database migration: 1. The migration attempted to create indexes on large tables without using CONCURRENTLY option 2. Multiple interdependent schema changes were bundled in a single transaction 3. The application connection pool was not drained before migration, leaving active connections 4. The migration didn't account for foreign key constraints that created lock dependencies 5. The rollback procedure wasn't properly tested with realistic data volumes
Fix/Workaround:
• Implemented immediate fixes to restore service
• Terminated blocking queries and connections to release locks
• Split the migration into smaller, independent changes
• Implemented proper connection draining before migrations
• Created a more robust migration strategy with proper testing
-- Original problematic migration
-- File: V1_23__add_transaction_features.sql
-- This runs in a single transaction, causing lock contention
CREATE TABLE transaction_details (
id SERIAL PRIMARY KEY,
transaction_id INTEGER NOT NULL REFERENCES transactions(id),
detail_type VARCHAR(50) NOT NULL,
detail_value JSONB NOT NULL,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);
-- Adding index without CONCURRENTLY (requires exclusive lock)
CREATE INDEX idx_transactions_user_id ON transactions(user_id);
-- Adding columns to existing table (requires ACCESS EXCLUSIVE lock)
ALTER TABLE transactions
ADD COLUMN risk_score DECIMAL(5,2),
ADD COLUMN processed_by VARCHAR(100),
ADD COLUMN external_ref VARCHAR(100);
-- Creating foreign key (requires exclusive lock on both tables)
ALTER TABLE transaction_details
ADD CONSTRAINT fk_transaction_details_transaction
FOREIGN KEY (transaction_id) REFERENCES transactions(id);
-- Improved migration approach
-- File: V1_23_1__add_transaction_details_table.sql
CREATE TABLE transaction_details (
id SERIAL PRIMARY KEY,
transaction_id INTEGER NOT NULL,
detail_type VARCHAR(50) NOT NULL,
detail_value JSONB NOT NULL,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);
-- File: V1_23_2__add_transactions_user_id_index.sql
-- Note: This runs in its own transaction
CREATE INDEX CONCURRENTLY idx_transactions_user_id ON transactions(user_id);
-- File: V1_23_3__add_transactions_columns.sql
-- Note: Split column additions to minimize lock time
ALTER TABLE transactions ADD COLUMN risk_score DECIMAL(5,2);
ALTER TABLE transactions ADD COLUMN processed_by VARCHAR(100);
ALTER TABLE transactions ADD COLUMN external_ref VARCHAR(100);
-- File: V1_23_4__add_transaction_details_fk.sql
-- Note: Add foreign key after all other operations
ALTER TABLE transaction_details
ADD CONSTRAINT fk_transaction_details_transaction
FOREIGN KEY (transaction_id) REFERENCES transactions(id);
# Liquibase Configuration with Improved Migration Strategy
# File: liquibase.yaml
changeLogFile: db/changelog/db.changelog-master.yaml
url: jdbc:postgresql://db.example.com:5432/financial_db
username: ${LIQUIBASE_USERNAME}
password: ${LIQUIBASE_PASSWORD}
# Connection pool settings
liquibaseProLicense: true
liquibase.command.pool.enabled: true
liquibase.command.pool.connections.max: 5
liquibase.command.pool.connections.min: 1
# Locking settings
liquibase.databaseChangeLogLockTableName: databasechangeloglock
liquibase.lockWaitTime: 300
liquibase.changeLogLockWaitTimeInMinutes: 5
# Performance settings
liquibase.batchSize: 1000
liquibase.databaseChangeLogTableName: databasechangelog
# Contexts for different environments
contexts: ${LIQUIBASE_CONTEXTS}
# Improved logging
logLevel: INFO
logFile: liquibase-${LIQUIBASE_COMMAND}-${LIQUIBASE_CONTEXTS}.log
# Validation settings
validateOnUpdate: true
# Database Migration CI/CD Pipeline Configuration
# File: database-migration-pipeline.yaml
version: '1.0'
stages:
- validate
- test
- backup
- drain
- migrate
- verify
- rollback
jobs:
validate:
stage: validate
image: liquibase/liquibase:4.17
script:
- liquibase --changeLogFile=db/changelog/db.changelog-master.yaml validate
- liquibase --changeLogFile=db/changelog/db.changelog-master.yaml updateSQL > migration.sql
- python3 scripts/analyze_migration_locks.py migration.sql
artifacts:
paths:
- migration.sql
- migration_analysis.json
test:
stage: test
image: postgres:14
services:
- postgres:14
variables:
POSTGRES_DB: test_db
POSTGRES_USER: test_user
POSTGRES_PASSWORD: test_password
script:
- psql -h postgres -U test_user -d test_db -f scripts/seed_test_data.sql
- liquibase --url=jdbc:postgresql://postgres:5432/test_db --username=test_user --password=test_password update
- python3 scripts/verify_migration.py
- python3 scripts/test_rollback.py
backup:
stage: backup
image: postgres:14
script:
- echo "Creating database backup..."
- pg_dump -h ${DB_HOST} -U ${DB_USER} -d ${DB_NAME} -F c -f backup.dump
- aws s3 cp backup.dump s3://${BACKUP_BUCKET}/migrations/$(date +%Y%m%d_%H%M%S)_pre_migration.dump
artifacts:
paths:
- backup.dump
drain:
stage: drain
image: alpine:3.16
script:
- echo "Draining connection pools..."
- curl -X POST ${APP_HEALTH_ENDPOINT}/prepare-for-maintenance
- python3 scripts/wait_for_connections_drain.py --host=${DB_HOST} --user=${DB_USER} --password=${DB_PASSWORD} --db=${DB_NAME} --timeout=300
- echo "Connections drained successfully"
migrate:
stage: migrate
image: liquibase/liquibase:4.17
script:
- echo "Running database migration..."
- liquibase --url=jdbc:postgresql://${DB_HOST}:5432/${DB_NAME} --username=${DB_USER} --password=${DB_PASSWORD} update
when: manual
allow_failure: false
verify:
stage: verify
image: alpine:3.16
script:
- echo "Verifying migration..."
- python3 scripts/verify_production_migration.py --host=${DB_HOST} --user=${DB_USER} --password=${DB_PASSWORD} --db=${DB_NAME}
- curl -X POST ${APP_HEALTH_ENDPOINT}/health-check
- echo "Migration verified successfully"
rollback:
stage: rollback
image: liquibase/liquibase:4.17
script:
- echo "Rolling back migration..."
- liquibase --url=jdbc:postgresql://${DB_HOST}:5432/${DB_NAME} --username=${DB_USER} --password=${DB_PASSWORD} rollbackCount 1
when: manual
allow_failure: false
#!/usr/bin/env python3
# File: analyze_migration_locks.py
# Purpose: Analyze SQL migration for potential lock issues
import re
import sys
import json
from collections import defaultdict
# High-risk operations that can cause lock contention
HIGH_RISK_OPERATIONS = [
(r'CREATE\s+INDEX\s+(?!CONCURRENTLY)', 'CREATE INDEX without CONCURRENTLY'),
(r'ALTER\s+TABLE\s+\w+\s+ADD\s+CONSTRAINT', 'Adding constraints'),
(r'ALTER\s+TABLE\s+\w+\s+ADD\s+COLUMN', 'Adding columns'),
(r'ALTER\s+TABLE\s+\w+\s+ALTER\s+COLUMN', 'Altering columns'),
(r'DROP\s+TABLE', 'Dropping tables'),
(r'TRUNCATE\s+TABLE', 'Truncating tables'),
(r'UPDATE\s+\w+\s+SET', 'Bulk updates'),
(r'CREATE\s+TRIGGER', 'Creating triggers'),
(r'CREATE\s+OR\s+REPLACE\s+FUNCTION', 'Creating functions'),
(r'VACUUM\s+FULL', 'VACUUM FULL'),
]
# Tables that are frequently accessed and should be treated with care
CRITICAL_TABLES = [
'transactions',
'accounts',
'users',
'payments',
'audit_log',
]
def analyze_sql_file(filename):
with open(filename, 'r') as f:
sql_content = f.read()
# Split into individual statements
statements = re.split(r';[\s\n]+', sql_content)
analysis = {
'high_risk_operations': [],
'affected_critical_tables': defaultdict(list),
'transaction_boundaries': [],
'recommendations': []
}
# Analyze each statement
for stmt in statements:
if not stmt.strip():
continue
# Check for high-risk operations
for pattern, description in HIGH_RISK_OPERATIONS:
if re.search(pattern, stmt, re.IGNORECASE):
analysis['high_risk_operations'].append({
'description': description,
'statement': stmt.strip()
})
# Check for critical tables
for table in CRITICAL_TABLES:
table_pattern = r'\b' + table + r'\b'
if re.search(table_pattern, stmt, re.IGNORECASE):
analysis['affected_critical_tables'][table].append(stmt.strip())
# Check for transaction boundaries
if re.search(r'BEGIN|START\s+TRANSACTION', sql_content, re.IGNORECASE) and re.search(r'COMMIT', sql_content, re.IGNORECASE):
analysis['transaction_boundaries'].append("Multiple operations wrapped in a single transaction")
# Generate recommendations
if analysis['high_risk_operations']:
analysis['recommendations'].append("Split high-risk operations into separate migrations")
if len(analysis['affected_critical_tables']) > 0:
analysis['recommendations'].append("Schedule migrations affecting critical tables during low-traffic periods")
if "CREATE INDEX without CONCURRENTLY" in [op['description'] for op in analysis['high_risk_operations']]:
analysis['recommendations'].append("Use CREATE INDEX CONCURRENTLY for indexes on existing tables")
if "Adding constraints" in [op['description'] for op in analysis['high_risk_operations']]:
analysis['recommendations'].append("Add foreign key constraints after all tables and data are created")
if "Transaction boundaries found" in analysis['transaction_boundaries']:
analysis['recommendations'].append("Consider breaking large transactions into smaller ones")
# Calculate risk score
risk_score = len(analysis['high_risk_operations']) * 10
risk_score += len(analysis['affected_critical_tables']) * 15
risk_score += len(analysis['transaction_boundaries']) * 5
analysis['risk_score'] = min(risk_score, 100)
analysis['risk_level'] = "High" if risk_score > 70 else "Medium" if risk_score > 30 else "Low"
return analysis
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python analyze_migration_locks.py <sql_file>")
sys.exit(1)
sql_file = sys.argv[1]
analysis = analyze_sql_file(sql_file)
# Write analysis to file
with open('migration_analysis.json', 'w') as f:
json.dump(analysis, f, indent=2)
# Print summary
print(f"Migration Risk Analysis: {analysis['risk_level']} (Score: {analysis['risk_score']})")
print(f"High-risk operations: {len(analysis['high_risk_operations'])}")
print(f"Critical tables affected: {len(analysis['affected_critical_tables'])}")
if analysis['risk_level'] == "High":
print("\nWARNING: This migration has a high risk of causing lock contention!")
print("Recommendations:")
for rec in analysis['recommendations']:
print(f"- {rec}")
sys.exit(1)
sys.exit(0)
#!/usr/bin/env python3
# File: wait_for_connections_drain.py
# Purpose: Wait for database connections to drain before migration
import argparse
import time
import sys
import psycopg2
from psycopg2 import sql
def parse_args():
parser = argparse.ArgumentParser(description='Wait for database connections to drain')
parser.add_argument('--host', required=True, help='Database host')
parser.add_argument('--user', required=True, help='Database user')
parser.add_argument('--password', required=True, help='Database password')
parser.add_argument('--db', required=True, help='Database name')
parser.add_argument('--timeout', type=int, default=300, help='Timeout in seconds')
parser.add_argument('--max-connections', type=int, default=5,
help='Maximum number of connections to allow (excluding this script)')
parser.add_argument('--exclude-ips', help='Comma-separated list of IPs to exclude from count')
parser.add_argument('--exclude-users', help='Comma-separated list of users to exclude from count')
return parser.parse_args()
def get_active_connections(conn, exclude_ips=None, exclude_users=None):
exclude_ips = exclude_ips or []
exclude_users = exclude_users or []
# Convert to lists if they're strings
if isinstance(exclude_ips, str):
exclude_ips = [ip.strip() for ip in exclude_ips.split(',')]
if isinstance(exclude_users, str):
exclude_users = [user.strip() for user in exclude_users.split(',')]
# Add current connection's user to exclusions
cursor = conn.cursor()
cursor.execute("SELECT current_user")
current_user = cursor.fetchone()[0]
exclude_users.append(current_user)
# Build the query with exclusions
query = sql.SQL("""
SELECT count(*)
FROM pg_stat_activity
WHERE datname = %s
AND state = 'active'
AND pid != pg_backend_pid()
""")
params = [args.db]
# Add IP exclusions if provided
if exclude_ips:
placeholders = ', '.join(['%s'] * len(exclude_ips))
query = sql.SQL(f"{query.as_string(conn)} AND client_addr NOT IN ({placeholders})")
params.extend(exclude_ips)
# Add user exclusions if provided
if exclude_users:
placeholders = ', '.join(['%s'] * len(exclude_users))
query = sql.SQL(f"{query.as_string(conn)} AND usename NOT IN ({placeholders})")
params.extend(exclude_users)
cursor.execute(query, params)
count = cursor.fetchone()[0]
cursor.close()
return count
def wait_for_connections_drain(args):
start_time = time.time()
conn = psycopg2.connect(
host=args.host,
user=args.user,
password=args.password,
dbname=args.db
)
try:
while True:
active_connections = get_active_connections(
conn, args.exclude_ips, args.exclude_users
)
elapsed = time.time() - start_time
print(f"Active connections: {active_connections} (elapsed: {elapsed:.1f}s)")
if active_connections <= args.max_connections:
print("Connection count below threshold, proceeding with migration")
break
if elapsed > args.timeout:
print(f"Timeout reached after {args.timeout} seconds")
print(f"Still have {active_connections} active connections")
print("Proceeding with migration anyway (warning: this may cause issues)")
break
# Wait before checking again
time.sleep(5)
finally:
conn.close()
if __name__ == "__main__":
args = parse_args()
try:
wait_for_connections_drain(args)
except Exception as e:
print(f"Error: {e}")
sys.exit(1)
sys.exit(0)
Lessons Learned:
Database schema migrations require careful planning and testing to avoid lock contention and downtime.
How to Avoid:
Split large migrations into smaller, independent changes.
Use CONCURRENTLY option for index creation on large tables.
Implement proper connection draining before migrations.
Test migrations with realistic data volumes and load patterns.
Create automated tools to analyze migration scripts for potential lock issues.
No summary provided
What Happened:
A large e-commerce platform experienced a sudden increase in response times and error rates during a promotional event. Users reported timeouts and errors when trying to complete purchases. The application logs showed numerous database connection timeout errors. The database itself showed normal CPU and memory utilization, but the number of active connections was at the configured maximum. The issue worsened over time as the application servers repeatedly attempted to establish new connections, further straining the connection pool.
Diagnosis Steps:
Analyzed application logs for error patterns.
Examined database metrics for connection counts and utilization.
Reviewed connection pool configuration in the application.
Monitored connection acquisition times and usage patterns.
Traced connection lifecycle through the application code.
Root Cause:
The investigation revealed multiple issues with the connection pool management: 1. The connection pool size was configured too small for the peak traffic 2. Connection leaks in the application code were not properly releasing connections 3. Connection timeout settings were too long, keeping idle connections occupied 4. No connection validation was configured, leading to stale connections 5. The application had no circuit breaker to prevent cascading failures
Fix/Workaround:
• Implemented immediate improvements to connection management
• Increased the connection pool size to accommodate peak traffic
• Fixed connection leaks in the application code
• Adjusted connection timeout settings
• Implemented connection validation to detect and replace stale connections
• Added a circuit breaker to prevent cascading failures
Lessons Learned:
Database connection pool configuration is critical for application performance and resilience under load.
How to Avoid:
Size connection pools based on expected peak traffic and database capacity.
Implement proper connection lifecycle management in application code.
Configure appropriate timeouts and validation for connections.
Monitor connection pool metrics and alert on high utilization.
Implement circuit breakers to prevent cascading failures.