After migrating a large Oracle database to AWS RDS PostgreSQL, users reported significant performance degradation for key business transactions. Some queries that previously completed in milliseconds were now taking several seconds, causing application timeouts and poor user experience.
# Cloud Migration Scenarios
No summary provided
What Happened:
Diagnosis Steps:
Compared query execution plans between the source and target databases.
Analyzed database metrics (CPU, memory, I/O, connections).
Reviewed application logs for timeout patterns.
Examined network latency between application and database.
Profiled the slowest queries in the new environment.
Root Cause:
The migration process had successfully moved the data but failed to properly optimize the database for the cloud environment. Specifically: 1. Missing indexes that existed in the Oracle database were not recreated in PostgreSQL 2. Query patterns optimized for Oracle's query optimizer were inefficient in PostgreSQL 3. The RDS instance was undersized for the workload 4. Connection pooling was not properly configured, causing excessive connection overhead
Fix/Workaround:
• Short-term: Added missing indexes and optimized the most critical queries:
-- Create missing indexes for frequently accessed columns
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
CREATE INDEX idx_order_items_order_id ON order_items(order_id);
CREATE INDEX idx_transactions_date ON transactions(transaction_date);
-- Optimize slow query
-- Before: Inefficient query using Oracle-specific patterns
SELECT o.order_id, o.order_date, c.customer_name,
SUM(oi.quantity * oi.unit_price) as total_amount
FROM orders o, customers c, order_items oi
WHERE o.customer_id = c.customer_id
AND o.order_id = oi.order_id
AND o.order_date BETWEEN '2023-01-01' AND '2023-01-31'
GROUP BY o.order_id, o.order_date, c.customer_name;
-- After: PostgreSQL-optimized query with explicit joins
SELECT o.order_id, o.order_date, c.customer_name,
SUM(oi.quantity * oi.unit_price) as total_amount
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
WHERE o.order_date BETWEEN '2023-01-01' AND '2023-01-31'
GROUP BY o.order_id, o.order_date, c.customer_name;
• Implemented proper connection pooling in the application:
// Before: Direct connection management
@Configuration
public class DatabaseConfig {
@Bean
public DataSource dataSource() {
DriverManagerDataSource dataSource = new DriverManagerDataSource();
dataSource.setDriverClassName("org.postgresql.Driver");
dataSource.setUrl("jdbc:postgresql://mydb.amazonaws.com:5432/mydb");
dataSource.setUsername("dbuser");
dataSource.setPassword("dbpassword");
return dataSource;
}
}
// After: HikariCP connection pooling
@Configuration
public class DatabaseConfig {
@Bean
public DataSource dataSource() {
HikariConfig config = new HikariConfig();
config.setJdbcUrl("jdbc:postgresql://mydb.amazonaws.com:5432/mydb");
config.setUsername("dbuser");
config.setPassword("dbpassword");
config.setMaximumPoolSize(20);
config.setMinimumIdle(5);
config.setIdleTimeout(30000);
config.setConnectionTimeout(10000);
config.setMaxLifetime(1800000);
// PostgreSQL specific optimizations
config.addDataSourceProperty("cachePrepStmts", "true");
config.addDataSourceProperty("prepStmtCacheSize", "250");
config.addDataSourceProperty("prepStmtCacheSqlLimit", "2048");
return new HikariDataSource(config);
}
}
• Long-term: Implemented a comprehensive database optimization strategy:
#!/usr/bin/env python3
# db_performance_analyzer.py
import psycopg2
import json
import csv
from datetime import datetime
# Database connection parameters
DB_PARAMS = {
'dbname': 'mydb',
'user': 'dbuser',
'password': 'dbpassword',
'host': 'mydb.amazonaws.com',
'port': '5432'
}
def analyze_slow_queries():
"""Analyze slow queries from PostgreSQL logs"""
conn = psycopg2.connect(**DB_PARAMS)
try:
with conn.cursor() as cur:
# Enable query logging
cur.execute("ALTER SYSTEM SET log_min_duration_statement = 1000;") # Log queries taking > 1s
cur.execute("SELECT pg_reload_conf();")
# Get current slow queries
cur.execute("""
SELECT query, calls, total_time, mean_time, rows
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 20;
""")
slow_queries = cur.fetchall()
# Write results to CSV
timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
with open(f'slow_queries_{timestamp}.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(['Query', 'Calls', 'Total Time (ms)', 'Mean Time (ms)', 'Rows'])
writer.writerows(slow_queries)
# Analyze missing indexes
cur.execute("""
SELECT
schemaname || '.' || relname as table,
indexrelname as index,
idx_scan as index_scans,
seq_scan as sequential_scans,
seq_tup_read as sequential_tuples_read,
idx_tup_fetch as index_tuples_fetched
FROM
pg_stat_user_tables t
LEFT JOIN pg_stat_user_indexes i ON t.relid = i.relid
WHERE
seq_scan > 0
ORDER BY
seq_scan DESC, seq_tup_read DESC;
""")
index_candidates = cur.fetchall()
# Write results to CSV
with open(f'index_candidates_{timestamp}.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(['Table', 'Index', 'Index Scans', 'Sequential Scans', 'Sequential Tuples Read', 'Index Tuples Fetched'])
writer.writerows(index_candidates)
print(f"Analysis complete. Results written to slow_queries_{timestamp}.csv and index_candidates_{timestamp}.csv")
finally:
conn.close()
def generate_optimization_recommendations():
"""Generate optimization recommendations based on database analysis"""
conn = psycopg2.connect(**DB_PARAMS)
try:
with conn.cursor() as cur:
recommendations = []
# Check for tables without primary keys
cur.execute("""
SELECT t.table_schema, t.table_name
FROM information_schema.tables t
LEFT JOIN information_schema.table_constraints c
ON c.table_schema = t.table_schema
AND c.table_name = t.table_name
AND c.constraint_type = 'PRIMARY KEY'
WHERE t.table_schema NOT IN ('pg_catalog', 'information_schema')
AND t.table_type = 'BASE TABLE'
AND c.constraint_name IS NULL;
""")
tables_without_pk = cur.fetchall()
for schema, table in tables_without_pk:
recommendations.append({
'type': 'missing_primary_key',
'schema': schema,
'table': table,
'recommendation': f"Add a primary key to {schema}.{table}"
})
# Check for tables that might benefit from partitioning
cur.execute("""
SELECT schemaname, relname, n_live_tup
FROM pg_stat_user_tables
WHERE n_live_tup > 10000000
ORDER BY n_live_tup DESC;
""")
large_tables = cur.fetchall()
for schema, table, rows in large_tables:
recommendations.append({
'type': 'partitioning_candidate',
'schema': schema,
'table': table,
'rows': rows,
'recommendation': f"Consider partitioning {schema}.{table} which has {rows} rows"
})
# Check for unused indexes
cur.execute("""
SELECT schemaname, relname, indexrelname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND indexrelname NOT LIKE 'pg_%';
""")
unused_indexes = cur.fetchall()
for schema, table, index, scans in unused_indexes:
recommendations.append({
'type': 'unused_index',
'schema': schema,
'table': table,
'index': index,
'recommendation': f"Consider dropping unused index {index} on {schema}.{table}"
})
# Write recommendations to JSON
timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
with open(f'optimization_recommendations_{timestamp}.json', 'w') as f:
json.dump(recommendations, f, indent=2)
print(f"Generated {len(recommendations)} recommendations in optimization_recommendations_{timestamp}.json")
finally:
conn.close()
if __name__ == "__main__":
analyze_slow_queries()
generate_optimization_recommendations()
• Resized the RDS instance based on workload analysis:
# Before: Undersized RDS instance
resource "aws_db_instance" "database" {
identifier = "production-db"
engine = "postgres"
engine_version = "13.4"
instance_class = "db.t3.large"
allocated_storage = 100
max_allocated_storage = 500
name = "mydb"
username = "dbuser"
password = var.db_password
parameter_group_name = "default.postgres13"
backup_retention_period = 7
multi_az = true
skip_final_snapshot = false
final_snapshot_identifier = "prod-db-final-snapshot"
}
# After: Properly sized RDS instance with optimized parameters
resource "aws_db_parameter_group" "postgres_optimized" {
name = "postgres-optimized"
family = "postgres13"
parameter {
name = "shared_buffers"
value = "{DBInstanceClassMemory/32768}MB"
}
parameter {
name = "work_mem"
value = "16MB"
}
parameter {
name = "maintenance_work_mem"
value = "256MB"
}
parameter {
name = "effective_cache_size"
value = "{DBInstanceClassMemory/16384}MB"
}
parameter {
name = "random_page_cost"
value = "1.1" # Optimized for SSD storage
}
}
resource "aws_db_instance" "database" {
identifier = "production-db"
engine = "postgres"
engine_version = "13.4"
instance_class = "db.r5.2xlarge" # Memory-optimized instance
allocated_storage = 500
max_allocated_storage = 1000
storage_type = "gp3"
iops = 12000
name = "mydb"
username = "dbuser"
password = var.db_password
parameter_group_name = aws_db_parameter_group.postgres_optimized.name
backup_retention_period = 7
multi_az = true
performance_insights_enabled = true
performance_insights_retention_period = 7
monitoring_interval = 60
monitoring_role_arn = aws_iam_role.rds_monitoring_role.arn
skip_final_snapshot = false
final_snapshot_identifier = "prod-db-final-snapshot"
}
Lessons Learned:
Database migrations require careful optimization beyond just moving the data.
How to Avoid:
Perform thorough performance testing before migration.
Create a database-specific migration plan that includes index and query optimization.
Size cloud resources based on actual workload, not just existing on-premises specifications.
Implement proper connection pooling and database parameter tuning.
Consider database-specific differences in query optimization between source and target platforms.
No summary provided
What Happened:
After migrating a set of microservices from an on-premises data center to AWS, users reported degraded performance. The application monitoring showed that inter-service communication latency had increased from an average of 5ms to over 100ms, causing cascading timeouts and poor user experience.
Diagnosis Steps:
Analyzed network traffic patterns between services.
Measured latency between different AWS availability zones.
Reviewed service placement and network configuration.
Examined VPC peering and transit gateway configurations.
Profiled application code for network calls and timeout settings.
Root Cause:
The migration had placed interdependent services in different AWS regions to align with data residency requirements. Additionally, the application was making excessive API calls between services due to a chatty API design that worked well in the low-latency on-premises environment but performed poorly in a distributed cloud environment.
Fix/Workaround:
• Short-term: Adjusted timeout settings and implemented retries:
// Before: Hard-coded short timeouts
RestTemplate restTemplate = new RestTemplate();
restTemplate.setRequestFactory(new SimpleClientHttpRequestFactory());
((SimpleClientHttpRequestFactory) restTemplate.getRequestFactory()).setConnectTimeout(1000);
((SimpleClientHttpRequestFactory) restTemplate.getRequestFactory()).setReadTimeout(1000);
// After: Environment-specific timeouts with retries
@Configuration
public class RestClientConfig {
@Value("${api.connect.timeout:5000}")
private int connectTimeout;
@Value("${api.read.timeout:10000}")
private int readTimeout;
@Bean
public RestTemplate restTemplate() {
RestTemplate restTemplate = new RestTemplate();
// Set timeouts
SimpleClientHttpRequestFactory requestFactory = new SimpleClientHttpRequestFactory();
requestFactory.setConnectTimeout(connectTimeout);
requestFactory.setReadTimeout(readTimeout);
restTemplate.setRequestFactory(requestFactory);
// Add retry capability
restTemplate.setInterceptors(Collections.singletonList(new RetryInterceptor()));
return restTemplate;
}
}
public class RetryInterceptor implements ClientHttpRequestInterceptor {
private static final Logger log = LoggerFactory.getLogger(RetryInterceptor.class);
private static final int MAX_RETRIES = 3;
@Override
public ClientHttpResponse intercept(HttpRequest request, byte[] body,
ClientHttpRequestExecution execution) throws IOException {
int attempts = 0;
IOException lastException = null;
while (attempts < MAX_RETRIES) {
try {
attempts++;
return execution.execute(request, body);
} catch (IOException e) {
lastException = e;
log.warn("Request failed (attempt {}): {}", attempts, e.getMessage());
if (attempts >= MAX_RETRIES) {
break;
}
try {
// Exponential backoff
Thread.sleep((long) Math.pow(2, attempts) * 100);
} catch (InterruptedException ie) {
Thread.currentThread().interrupt();
throw new IOException("Request interrupted", ie);
}
}
}
throw lastException;
}
}
• Long-term: Redesigned the application architecture for cloud optimization:
# AWS CloudFormation template for service placement optimization
AWSTemplateFormatVersion: '2010-09-09'
Description: 'Optimized service placement for cloud migration'
Resources:
# VPC for primary region
PrimaryVPC:
Type: AWS::EC2::VPC
Properties:
CidrBlock: 10.0.0.0/16
EnableDnsSupport: true
EnableDnsHostnames: true
Tags:
- Key: Name
Value: Primary-VPC
# VPC for secondary region
SecondaryVPC:
Type: AWS::EC2::VPC
Properties:
CidrBlock: 10.1.0.0/16
EnableDnsSupport: true
EnableDnsHostnames: true
Tags:
- Key: Name
Value: Secondary-VPC
# Transit Gateway for inter-region connectivity
TransitGateway:
Type: AWS::EC2::TransitGateway
Properties:
AmazonSideAsn: 64512
AutoAcceptSharedAttachments: enable
DefaultRouteTableAssociation: enable
DefaultRouteTablePropagation: enable
Description: Transit Gateway for inter-region connectivity
Tags:
- Key: Name
Value: Inter-Region-TGW
# Transit Gateway Attachment for Primary VPC
PrimaryVPCAttachment:
Type: AWS::EC2::TransitGatewayAttachment
Properties:
TransitGatewayId: !Ref TransitGateway
VpcId: !Ref PrimaryVPC
SubnetIds:
- !Ref PrimarySubnet1
- !Ref PrimarySubnet2
Tags:
- Key: Name
Value: Primary-VPC-Attachment
# Transit Gateway Attachment for Secondary VPC
SecondaryVPCAttachment:
Type: AWS::EC2::TransitGatewayAttachment
Properties:
TransitGatewayId: !Ref TransitGateway
VpcId: !Ref SecondaryVPC
SubnetIds:
- !Ref SecondarySubnet1
- !Ref SecondarySubnet2
Tags:
- Key: Name
Value: Secondary-VPC-Attachment
# CloudFront distribution for global content delivery
ContentDeliveryDistribution:
Type: AWS::CloudFront::Distribution
Properties:
DistributionConfig:
Origins:
- DomainName: !GetAtt ApiGateway.DomainName
Id: ApiGatewayOrigin
CustomOriginConfig:
HTTPPort: 80
HTTPSPort: 443
OriginProtocolPolicy: https-only
Enabled: true
DefaultCacheBehavior:
TargetOriginId: ApiGatewayOrigin
ViewerProtocolPolicy: redirect-to-https
CachePolicyId: 4135ea2d-6df8-44a3-9df3-4b5a84be39ad # CachingDisabled policy
OriginRequestPolicyId: 216adef6-5c7f-47e4-b989-5492eafa07d3 # AllViewer policy
• Implemented API aggregation to reduce chattiness:
// Before: Multiple API calls for a single operation
async function getUserProfile(userId: string): Promise<UserProfile> {
const user = await userService.getUser(userId);
const preferences = await preferencesService.getPreferences(userId);
const orders = await orderService.getRecentOrders(userId);
const notifications = await notificationService.getPendingNotifications(userId);
return {
user,
preferences,
orders,
notifications
};
}
// After: API Gateway with aggregation pattern
import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as apigw from 'aws-cdk-lib/aws-apigateway';
import * as lambda from 'aws-cdk-lib/aws-lambda';
export class ApiAggregationStack extends cdk.Stack {
constructor(scope: Construct, id: string, props?: cdk.StackProps) {
super(scope, id, props);
// Create API Gateway
const api = new apigw.RestApi(this, 'AggregationApi', {
deployOptions: {
stageName: 'prod',
cachingEnabled: true,
cacheClusterEnabled: true,
cacheClusterSize: '0.5'
}
});
// Create Lambda for aggregation
const aggregationFunction = new lambda.Function(this, 'AggregationFunction', {
runtime: lambda.Runtime.NODEJS_16_X,
handler: 'index.handler',
code: lambda.Code.fromAsset('lambda'),
timeout: cdk.Duration.seconds(30),
memorySize: 1024,
environment: {
USER_SERVICE_URL: 'https://user-service.example.com',
PREFERENCES_SERVICE_URL: 'https://preferences-service.example.com',
ORDER_SERVICE_URL: 'https://order-service.example.com',
NOTIFICATION_SERVICE_URL: 'https://notification-service.example.com'
}
});
// Create API endpoint
const userProfiles = api.root.addResource('user-profiles');
const userProfile = userProfiles.addResource('{userId}');
// Integrate with Lambda
userProfile.addMethod('GET', new apigw.LambdaIntegration(aggregationFunction));
// Add caching
const methodSettings = api.deploymentStage.methodSettings;
methodSettings.push({
httpMethod: 'GET',
resourcePath: '/user-profiles/{userId}',
cachingEnabled: true,
cacheTtl: cdk.Duration.minutes(5)
});
}
}
• Implemented data replication to reduce cross-region calls:
// Data replication service in Go
package main
import (
"context"
"encoding/json"
"log"
"os"
"time"
"github.com/aws/aws-sdk-go-v2/aws"
"github.com/aws/aws-sdk-go-v2/config"
"github.com/aws/aws-sdk-go-v2/service/dynamodb"
"github.com/aws/aws-sdk-go-v2/service/dynamodb/types"
"github.com/aws/aws-sdk-go-v2/service/sqs"
)
type ChangeEvent struct {
TableName string `json:"tableName"`
Operation string `json:"operation"` // INSERT, MODIFY, REMOVE
Keys map[string]interface{} `json:"keys"`
NewImage map[string]interface{} `json:"newImage,omitempty"`
}
func main() {
// Load configuration
sourceRegion := os.Getenv("SOURCE_REGION")
targetRegion := os.Getenv("TARGET_REGION")
queueURL := os.Getenv("CHANGE_QUEUE_URL")
// Configure AWS clients
sourceConfig, err := config.LoadDefaultConfig(context.TODO(), config.WithRegion(sourceRegion))
if err != nil {
log.Fatalf("Failed to load source region config: %v", err)
}
targetConfig, err := config.LoadDefaultConfig(context.TODO(), config.WithRegion(targetRegion))
if err != nil {
log.Fatalf("Failed to load target region config: %v", err)
}
sqsClient := sqs.NewFromConfig(sourceConfig)
dynamoClient := dynamodb.NewFromConfig(targetConfig)
// Process messages continuously
for {
// Receive messages from SQS
result, err := sqsClient.ReceiveMessage(context.TODO(), &sqs.ReceiveMessageInput{
QueueUrl: aws.String(queueURL),
MaxNumberOfMessages: 10,
WaitTimeSeconds: 20,
})
if err != nil {
log.Printf("Failed to receive messages: %v", err)
time.Sleep(5 * time.Second)
continue
}
// Process each message
for _, message := range result.Messages {
// Parse the change event
var event ChangeEvent
if err := json.Unmarshal([]byte(*message.Body), &event); err != nil {
log.Printf("Failed to parse message: %v", err)
continue
}
// Replicate the change to the target region
if err := replicateChange(dynamoClient, event); err != nil {
log.Printf("Failed to replicate change: %v", err)
continue
}
// Delete the processed message
_, err := sqsClient.DeleteMessage(context.TODO(), &sqs.DeleteMessageInput{
QueueUrl: aws.String(queueURL),
ReceiptHandle: message.ReceiptHandle,
})
if err != nil {
log.Printf("Failed to delete message: %v", err)
}
}
}
}
func replicateChange(client *dynamodb.Client, event ChangeEvent) error {
switch event.Operation {
case "INSERT", "MODIFY":
// Convert the new image to DynamoDB attribute values
item := make(map[string]types.AttributeValue)
// Implementation of conversion from map to DynamoDB attributes
// ...
// Put the item in the target table
_, err := client.PutItem(context.TODO(), &dynamodb.PutItemInput{
TableName: aws.String(event.TableName),
Item: item,
})
return err
case "REMOVE":
// Convert the keys to DynamoDB attribute values
key := make(map[string]types.AttributeValue)
// Implementation of conversion from map to DynamoDB attributes
// ...
// Delete the item from the target table
_, err := client.DeleteItem(context.TODO(), &dynamodb.DeleteItemInput{
TableName: aws.String(event.TableName),
Key: key,
})
return err
default:
return nil
}
}
Lessons Learned:
Cloud migration requires rethinking application architecture, not just lifting and shifting.
How to Avoid:
Conduct network latency testing before migration.
Design applications with cloud-native patterns like API aggregation.
Consider data locality and service placement in multi-region deployments.
Implement caching and data replication strategies.
Adjust timeout and retry settings for cloud environments.
No summary provided
What Happened:
During a planned migration of a critical PostgreSQL database from on-premises to AWS RDS, the application experienced severe performance degradation and intermittent connection failures. The migration was performed during a maintenance window, but the issues persisted well beyond the expected downtime, affecting business operations.
Diagnosis Steps:
Analyzed database connection logs from the application servers.
Monitored RDS performance metrics in CloudWatch.
Reviewed network traffic between application servers and the new RDS instance.
Examined PostgreSQL logs for errors and slow queries.
Compared connection pool configurations between the old and new environments.
Root Cause:
Multiple issues contributed to the outage: 1. The connection pool settings in the application were optimized for the on-premises database but were inappropriate for RDS. 2. Network latency between application servers and RDS was significantly higher than in the on-premises setup. 3. The RDS instance was undersized compared to the on-premises database. 4. The PostgreSQL parameter group settings in RDS differed from the on-premises configuration. 5. Connection timeouts were not properly configured, causing connection leaks.
Fix/Workaround:
• Short-term: Adjusted connection pool settings and increased RDS instance size:
# Application connection pool configuration (Spring Boot)
spring:
datasource:
hikari:
# Before migration
maximum-pool-size: 100
minimum-idle: 10
connection-timeout: 30000
idle-timeout: 600000
# After optimization for RDS
maximum-pool-size: 20
minimum-idle: 5
connection-timeout: 5000
idle-timeout: 300000
max-lifetime: 1800000
validation-timeout: 3000
leak-detection-threshold: 60000
• Optimized PostgreSQL parameter group settings:
// RDS Parameter Group changes
{
"max_connections": "200",
"shared_buffers": "4GB",
"effective_cache_size": "12GB",
"work_mem": "16MB",
"maintenance_work_mem": "1GB",
"max_worker_processes": "8",
"max_parallel_workers_per_gather": "4",
"max_parallel_workers": "8",
"wal_buffers": "16MB",
"checkpoint_timeout": "15min",
"checkpoint_completion_target": "0.9",
"random_page_cost": "1.1",
"effective_io_concurrency": "200",
"idle_in_transaction_session_timeout": "60000",
"statement_timeout": "60000"
}
• Long-term: Implemented a comprehensive database migration strategy:
// database_migration_validator.go
package main
import (
"context"
"database/sql"
"encoding/json"
"fmt"
"log"
"os"
"sync"
"time"
"github.com/aws/aws-sdk-go-v2/config"
"github.com/aws/aws-sdk-go-v2/service/cloudwatch"
"github.com/aws/aws-sdk-go-v2/service/rds"
_ "github.com/lib/pq"
"gopkg.in/yaml.v3"
)
type DatabaseConfig struct {
SourceDB struct {
Host string `yaml:"host"`
Port int `yaml:"port"`
User string `yaml:"user"`
Password string `yaml:"password"`
DBName string `yaml:"dbname"`
SSLMode string `yaml:"sslmode"`
} `yaml:"sourceDB"`
TargetDB struct {
Host string `yaml:"host"`
Port int `yaml:"port"`
User string `yaml:"user"`
Password string `yaml:"password"`
DBName string `yaml:"dbname"`
SSLMode string `yaml:"sslmode"`
} `yaml:"targetDB"`
Tests struct {
Queries []string `yaml:"queries"`
Tables []string `yaml:"tables"`
RowCountDelta float64 `yaml:"rowCountDelta"`
ResponseTimeSLA float64 `yaml:"responseTimeSLA"`
Concurrency int `yaml:"concurrency"`
Duration string `yaml:"duration"`
} `yaml:"tests"`
AWS struct {
Region string `yaml:"region"`
RDSInstanceID string `yaml:"rdsInstanceId"`
AlarmThresholds struct {
CPUUtilization float64 `yaml:"cpuUtilization"`
DatabaseConnections int `yaml:"databaseConnections"`
FreeableMemory int64 `yaml:"freeableMemory"`
ReadIOPS int `yaml:"readIOPS"`
WriteIOPS int `yaml:"writeIOPS"`
} `yaml:"alarmThresholds"`
} `yaml:"aws"`
}
type TestResult struct {
Query string
SourceTime time.Duration
TargetTime time.Duration
SourceRowCount int
TargetRowCount int
RowCountMatch bool
PerformanceOK bool
Error string
}
func main() {
// Load configuration
configFile, err := os.ReadFile("migration_config.yaml")
if err != nil {
log.Fatalf("Failed to read config file: %v", err)
}
var config DatabaseConfig
if err := yaml.Unmarshal(configFile, &config); err != nil {
log.Fatalf("Failed to parse config: %v", err)
}
// Connect to source database
sourceConnStr := fmt.Sprintf("host=%s port=%d user=%s password=%s dbname=%s sslmode=%s",
config.SourceDB.Host, config.SourceDB.Port, config.SourceDB.User,
config.SourceDB.Password, config.SourceDB.DBName, config.SourceDB.SSLMode)
sourceDB, err := sql.Open("postgres", sourceConnStr)
if err != nil {
log.Fatalf("Failed to connect to source database: %v", err)
}
defer sourceDB.Close()
// Connect to target database
targetConnStr := fmt.Sprintf("host=%s port=%d user=%s password=%s dbname=%s sslmode=%s",
config.TargetDB.Host, config.TargetDB.Port, config.TargetDB.User,
config.TargetDB.Password, config.TargetDB.DBName, config.TargetDB.SSLMode)
targetDB, err := sql.Open("postgres", targetConnStr)
if err != nil {
log.Fatalf("Failed to connect to target database: %v", err)
}
defer targetDB.Close()
// Validate database connections
if err := sourceDB.Ping(); err != nil {
log.Fatalf("Failed to ping source database: %v", err)
}
if err := targetDB.Ping(); err != nil {
log.Fatalf("Failed to ping target database: %v", err)
}
// Compare table row counts
tableResults, err := compareTableRowCounts(sourceDB, targetDB, config.Tests.Tables, config.Tests.RowCountDelta)
if err != nil {
log.Fatalf("Failed to compare table row counts: %v", err)
}
// Run query performance tests
duration, err := time.ParseDuration(config.Tests.Duration)
if err != nil {
log.Fatalf("Failed to parse test duration: %v", err)
}
queryResults, err := runQueryTests(sourceDB, targetDB, config.Tests.Queries, config.Tests.Concurrency, duration, config.Tests.ResponseTimeSLA)
if err != nil {
log.Fatalf("Failed to run query tests: %v", err)
}
// Check RDS metrics
if config.AWS.RDSInstanceID != "" {
err := checkRDSMetrics(config.AWS.Region, config.AWS.RDSInstanceID, config.AWS.AlarmThresholds)
if err != nil {
log.Printf("Warning: Failed to check RDS metrics: %v", err)
}
}
// Output results
outputResults(tableResults, queryResults)
}
func compareTableRowCounts(sourceDB, targetDB *sql.DB, tables []string, allowedDelta float64) (map[string]TestResult, error) {
results := make(map[string]TestResult)
for _, table := range tables {
query := fmt.Sprintf("SELECT COUNT(*) FROM %s", table)
// Count rows in source
var sourceCount int
err := sourceDB.QueryRow(query).Scan(&sourceCount)
if err != nil {
results[table] = TestResult{
Error: fmt.Sprintf("Failed to count rows in source table: %v", err),
}
continue
}
// Count rows in target
var targetCount int
err = targetDB.QueryRow(query).Scan(&targetCount)
if err != nil {
results[table] = TestResult{
Error: fmt.Sprintf("Failed to count rows in target table: %v", err),
}
continue
}
// Calculate delta percentage
var deltaPercent float64
if sourceCount > 0 {
deltaPercent = math.Abs(float64(targetCount-sourceCount)) / float64(sourceCount) * 100
} else if targetCount > 0 {
deltaPercent = 100 // If source is 0 but target has rows, that's 100% different
} else {
deltaPercent = 0 // Both are 0, so 0% different
}
results[table] = TestResult{
Query: query,
SourceRowCount: sourceCount,
TargetRowCount: targetCount,
RowCountMatch: deltaPercent <= allowedDelta,
}
}
return results, nil
}
func runQueryTests(sourceDB, targetDB *sql.DB, queries []string, concurrency int, duration time.Duration, sla float64) (map[string]TestResult, error) {
results := make(map[string]TestResult)
var wg sync.WaitGroup
resultChan := make(chan TestResult, len(queries))
// Run each query test
for _, query := range queries {
wg.Add(1)
go func(q string) {
defer wg.Done()
// Test source database
sourceStart := time.Now()
sourceRows, err := sourceDB.Query(q)
if err != nil {
resultChan <- TestResult{
Query: q,
Error: fmt.Sprintf("Failed to execute query on source: %v", err),
}
return
}
sourceRows.Close()
sourceTime := time.Since(sourceStart)
// Test target database
targetStart := time.Now()
targetRows, err := targetDB.Query(q)
if err != nil {
resultChan <- TestResult{
Query: q,
SourceTime: sourceTime,
Error: fmt.Sprintf("Failed to execute query on target: %v", err),
}
return
}
targetRows.Close()
targetTime := time.Since(targetStart)
// Compare performance
performanceRatio := float64(targetTime) / float64(sourceTime)
performanceOK := performanceRatio <= sla
resultChan <- TestResult{
Query: q,
SourceTime: sourceTime,
TargetTime: targetTime,
PerformanceOK: performanceOK,
}
}(query)
}
// Run load test
if concurrency > 0 && duration > 0 {
wg.Add(1)
go func() {
defer wg.Done()
runLoadTest(sourceDB, targetDB, queries, concurrency, duration, resultChan)
}()
}
// Wait for all tests to complete
go func() {
wg.Wait()
close(resultChan)
}()
// Collect results
for result := range resultChan {
results[result.Query] = result
}
return results, nil
}
func runLoadTest(sourceDB, targetDB *sql.DB, queries []string, concurrency int, duration time.Duration, resultChan chan<- TestResult) {
ctx, cancel := context.WithTimeout(context.Background(), duration)
defer cancel()
var wg sync.WaitGroup
for i := 0; i < concurrency; i++ {
wg.Add(1)
go func() {
defer wg.Done()
for {
select {
case <-ctx.Done():
return
default:
// Pick a random query
query := queries[rand.Intn(len(queries))]
// Execute on target DB only for load testing
start := time.Now()
rows, err := targetDB.QueryContext(ctx, query)
if err != nil {
if ctx.Err() == nil { // Only log if not due to context cancellation
log.Printf("Load test query error: %v", err)
}
continue
}
rows.Close()
_ = time.Since(start)
}
}
}()
}
wg.Wait()
}
func checkRDSMetrics(region, instanceID string, thresholds struct {
CPUUtilization float64
DatabaseConnections int
FreeableMemory int64
ReadIOPS int
WriteIOPS int
}) error {
// Initialize AWS SDK
cfg, err := config.LoadDefaultConfig(context.TODO(), config.WithRegion(region))
if err != nil {
return fmt.Errorf("failed to load AWS config: %v", err)
}
// Create CloudWatch client
cwClient := cloudwatch.NewFromConfig(cfg)
// Check CPU utilization
cpuMetric, err := getCloudWatchMetric(cwClient, "AWS/RDS", "CPUUtilization", "DBInstanceIdentifier", instanceID, 300)
if err != nil {
return err
}
if cpuMetric > thresholds.CPUUtilization {
log.Printf("Warning: CPU utilization (%.2f%%) exceeds threshold (%.2f%%)", cpuMetric, thresholds.CPUUtilization)
}
// Check database connections
connectionsMetric, err := getCloudWatchMetric(cwClient, "AWS/RDS", "DatabaseConnections", "DBInstanceIdentifier", instanceID, 300)
if err != nil {
return err
}
if int(connectionsMetric) > thresholds.DatabaseConnections {
log.Printf("Warning: Database connections (%d) exceeds threshold (%d)", int(connectionsMetric), thresholds.DatabaseConnections)
}
// Check freeable memory
memoryMetric, err := getCloudWatchMetric(cwClient, "AWS/RDS", "FreeableMemory", "DBInstanceIdentifier", instanceID, 300)
if err != nil {
return err
}
if int64(memoryMetric) < thresholds.FreeableMemory {
log.Printf("Warning: Freeable memory (%d bytes) below threshold (%d bytes)", int64(memoryMetric), thresholds.FreeableMemory)
}
// Check read IOPS
readIOPSMetric, err := getCloudWatchMetric(cwClient, "AWS/RDS", "ReadIOPS", "DBInstanceIdentifier", instanceID, 300)
if err != nil {
return err
}
if int(readIOPSMetric) > thresholds.ReadIOPS {
log.Printf("Warning: Read IOPS (%d) exceeds threshold (%d)", int(readIOPSMetric), thresholds.ReadIOPS)
}
// Check write IOPS
writeIOPSMetric, err := getCloudWatchMetric(cwClient, "AWS/RDS", "WriteIOPS", "DBInstanceIdentifier", instanceID, 300)
if err != nil {
return err
}
if int(writeIOPSMetric) > thresholds.WriteIOPS {
log.Printf("Warning: Write IOPS (%d) exceeds threshold (%d)", int(writeIOPSMetric), thresholds.WriteIOPS)
}
return nil
}
func getCloudWatchMetric(client *cloudwatch.Client, namespace, metricName, dimensionName, dimensionValue string, periodSeconds int32) (float64, error) {
now := time.Now()
input := &cloudwatch.GetMetricStatisticsInput{
Namespace: &namespace,
MetricName: &metricName,
Dimensions: []types.Dimension{
{
Name: &dimensionName,
Value: &dimensionValue,
},
},
StartTime: &types.Timestamp{Time: now.Add(-15 * time.Minute)},
EndTime: &types.Timestamp{Time: now},
Period: &periodSeconds,
Statistics: []types.Statistic{types.StatisticAverage},
}
output, err := client.GetMetricStatistics(context.TODO(), input)
if err != nil {
return 0, fmt.Errorf("failed to get metric %s: %v", metricName, err)
}
if len(output.Datapoints) == 0 {
return 0, fmt.Errorf("no datapoints returned for metric %s", metricName)
}
// Sort datapoints by timestamp and get the most recent
sort.Slice(output.Datapoints, func(i, j int) bool {
return output.Datapoints[i].Timestamp.After(output.Datapoints[j].Timestamp.Time)
})
return *output.Datapoints[0].Average, nil
}
func outputResults(tableResults, queryResults map[string]TestResult) {
// Count issues
var tableIssues, performanceIssues int
for _, result := range tableResults {
if !result.RowCountMatch || result.Error != "" {
tableIssues++
}
}
for _, result := range queryResults {
if !result.PerformanceOK || result.Error != "" {
performanceIssues++
}
}
// Print summary
fmt.Printf("Database Migration Validation Results\n")
fmt.Printf("=====================================\n")
fmt.Printf("Tables validated: %d (issues: %d)\n", len(tableResults), tableIssues)
fmt.Printf("Queries tested: %d (issues: %d)\n", len(queryResults), performanceIssues)
fmt.Println()
// Print table results
fmt.Printf("Table Row Count Comparison\n")
fmt.Printf("-------------------------\n")
for table, result := range tableResults {
if result.Error != "" {
fmt.Printf("❌ %s: ERROR - %s\n", table, result.Error)
} else if !result.RowCountMatch {
fmt.Printf("❌ %s: MISMATCH - Source: %d, Target: %d\n", table, result.SourceRowCount, result.TargetRowCount)
} else {
fmt.Printf("✅ %s: MATCH - Source: %d, Target: %d\n", table, result.SourceRowCount, result.TargetRowCount)
}
}
fmt.Println()
// Print query results
fmt.Printf("Query Performance Comparison\n")
fmt.Printf("---------------------------\n")
for query, result := range queryResults {
if result.Error != "" {
fmt.Printf("❌ Query: %s\n ERROR - %s\n", query, result.Error)
} else if !result.PerformanceOK {
fmt.Printf("❌ Query: %s\n Source: %v, Target: %v (%.2fx slower)\n",
query, result.SourceTime, result.TargetTime, float64(result.TargetTime)/float64(result.SourceTime))
} else {
fmt.Printf("✅ Query: %s\n Source: %v, Target: %v (%.2fx ratio)\n",
query, result.SourceTime, result.TargetTime, float64(result.TargetTime)/float64(result.SourceTime))
}
}
// Output JSON report
report := struct {
Timestamp string
TableResults map[string]TestResult
QueryResults map[string]TestResult
TableIssues int
QueryIssues int
OverallSuccess bool
}{
Timestamp: time.Now().Format(time.RFC3339),
TableResults: tableResults,
QueryResults: queryResults,
TableIssues: tableIssues,
QueryIssues: performanceIssues,
OverallSuccess: tableIssues == 0 && performanceIssues == 0,
}
jsonReport, err := json.MarshalIndent(report, "", " ")
if err != nil {
log.Printf("Failed to generate JSON report: %v", err)
} else {
err = os.WriteFile("migration_validation_report.json", jsonReport, 0644)
if err != nil {
log.Printf("Failed to write JSON report: %v", err)
} else {
fmt.Printf("\nDetailed report written to migration_validation_report.json\n")
}
}
}
• Created a comprehensive migration checklist and runbook:
# Database Migration Runbook: On-Premises to AWS RDS
## Pre-Migration Phase
### 1. Assessment and Planning
- [ ] Document current database specifications (version, size, extensions, etc.)
- [ ] Identify all applications and services that connect to the database
- [ ] Map dependencies between database objects and applications
- [ ] Determine migration window and acceptable downtime
- [ ] Select appropriate RDS instance type and storage based on workload
- [ ] Define success criteria for the migration
### 2. Environment Setup
- [ ] Create VPC with appropriate subnets and security groups
- [ ] Set up VPN or Direct Connect between on-premises and AWS
- [ ] Create RDS parameter group with settings matching on-premises configuration
- [ ] Create RDS instance with appropriate settings
- [ ] Configure monitoring and alerting for the RDS instance
- [ ] Set up database users and permissions
### 3. Application Configuration
- [ ] Review and optimize connection pool settings for cloud environment
- [ ] Update connection strings in application configurations
- [ ] Implement connection retry logic with exponential backoff
- [ ] Configure read/write splitting if using read replicas
- [ ] Update firewall rules to allow connections to RDS
### 4. Testing
- [ ] Create test environment with representative data
- [ ] Perform test migration to validate process
- [ ] Run performance tests comparing on-premises and RDS
- [ ] Test application functionality with RDS
- [ ] Simulate failure scenarios and validate recovery procedures
- [ ] Validate backup and restore procedures
## Migration Phase
### 1. Pre-Migration Tasks
- [ ] Freeze schema changes
- [ ] Take full backup of on-premises database
- [ ] Verify backup integrity
- [ ] Reduce TTL for DNS records
- [ ] Scale up RDS instance if needed for migration
- [ ] Notify all stakeholders of migration timeline
### 2. Data Migration
- [ ] Stop write operations to source database
- [ ] Take final backup or use logical replication
- [ ] Restore data to RDS instance
- [ ] Verify data integrity and consistency
- [ ] Create all necessary indexes and constraints
- [ ] Run validation queries to ensure data correctness
### 3. Cutover
- [ ] Stop all application connections to source database
- [ ] Update DNS or connection strings to point to RDS
- [ ] Restart applications with new connection settings
- [ ] Verify application connectivity and functionality
- [ ] Monitor performance and connection counts
## Post-Migration Phase
### 1. Validation
- [ ] Verify all applications are functioning correctly
- [ ] Compare performance metrics between old and new environments
- [ ] Validate that all data has been migrated correctly
- [ ] Check for any errors in application logs
- [ ] Verify that backups are running correctly
### 2. Optimization
- [ ] Tune RDS parameter group settings based on workload
- [ ] Optimize connection pool settings
- [ ] Adjust auto-scaling settings if needed
- [ ] Review and optimize slow queries
- [ ] Implement performance recommendations
### 3. Cleanup
- [ ] Remove temporary resources used during migration
- [ ] Update documentation with new database details
- [ ] Archive old database backups according to retention policy
- [ ] Review and update disaster recovery procedures
- [ ] Document lessons learned from the migration
## Rollback Plan
### Triggers for Rollback
- Critical application functionality not working
- Unacceptable performance degradation
- Data corruption or loss
- Extended connectivity issues
### Rollback Procedure
1. Stop all application connections to RDS
2. Restore applications to use on-premises database
3. Update DNS or connection strings to point back to on-premises
4. Restart applications with original connection settings
5. Verify application functionality with on-premises database
6. Notify all stakeholders of rollback
Lessons Learned:
Database migrations to the cloud require careful planning and optimization of connection handling.
How to Avoid:
Test connection pool settings with realistic load before migration.
Ensure RDS instance is properly sized based on workload characteristics.
Optimize PostgreSQL parameter groups for cloud environment.
Implement proper connection timeout and retry logic in applications.
Create a comprehensive migration runbook with validation steps.
No summary provided
What Happened:
During a planned migration of a business-critical database from on-premises Oracle to AWS Aurora PostgreSQL, unexpected data integrity issues were discovered after the cutover. The team attempted to roll back to the on-premises system, but the rollback procedure failed, resulting in several hours of unplanned downtime.
Diagnosis Steps:
Analyzed migration logs and data validation reports.
Examined rollback procedure documentation and execution logs.
Reviewed database state in both source and target environments.
Checked network connectivity and replication status.
Investigated application connection configurations.
Root Cause:
Multiple issues contributed to the rollback failure: 1. The rollback procedure assumed the source database was in read-only mode during migration, but it had been modified by a batch job that bypassed the application layer 2. The database schema on the source system had been altered by an automated database maintenance job during the migration 3. Network ACLs had been updated as part of the migration, blocking replication traffic back to the source 4. Backup files needed for rollback were stored in the target environment with no redundancy 5. The rollback procedure had never been fully tested in a production-like environment
Fix/Workaround:
• Short-term: Implemented an emergency recovery plan:
-- 1. Create point-in-time recovery snapshot of the target database
SELECT aurora_create_snapshot_for_migration('emergency_recovery_snapshot');
-- 2. Export critical data to flat files for manual recovery
\copy (SELECT * FROM critical_table WHERE last_updated > '2023-05-01') TO '/tmp/critical_data_export.csv' WITH CSV HEADER;
-- 3. Restore source database from the most recent backup
RMAN> restore database;
RMAN> recover database;
RMAN> alter database open resetlogs;
-- 4. Import critical transactions that occurred after the backup
-- Oracle SQL*Loader control file
LOAD DATA
INFILE '/tmp/critical_data_export.csv'
INTO TABLE critical_table
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
(id, name, value, last_updated DATE "YYYY-MM-DD HH24:MI:SS")
• Long-term: Implemented a comprehensive migration framework with proper rollback capabilities:
// migration_framework.go
package main
import (
"context"
"database/sql"
"encoding/json"
"fmt"
"log"
"os"
"time"
"github.com/aws/aws-sdk-go-v2/aws"
"github.com/aws/aws-sdk-go-v2/config"
"github.com/aws/aws-sdk-go-v2/service/rds"
"github.com/aws/aws-sdk-go-v2/service/s3"
_ "github.com/lib/pq"
_ "github.com/godror/godror"
)
// MigrationConfig represents the configuration for a database migration
type MigrationConfig struct {
MigrationID string `json:"migrationId"`
SourceType string `json:"sourceType"` // "oracle", "mysql", "postgres", etc.
SourceConnection struct {
Host string `json:"host"`
Port int `json:"port"`
Database string `json:"database"`
Username string `json:"username"`
Password string `json:"password"`
SID string `json:"sid,omitempty"` // Oracle specific
} `json:"sourceConnection"`
TargetType string `json:"targetType"` // "aurora-postgresql", "aurora-mysql", "rds-oracle", etc.
TargetConnection struct {
Host string `json:"host"`
Port int `json:"port"`
Database string `json:"database"`
Username string `json:"username"`
Password string `json:"password"`
} `json:"targetConnection"`
AWS struct {
Region string `json:"region"`
S3BackupBucket string `json:"s3BackupBucket"`
RDSClusterID string `json:"rdsClusterId"`
KMSKeyID string `json:"kmsKeyId"`
SecretsManagerID string `json:"secretsManagerId"`
} `json:"aws"`
SchemaOnly bool `json:"schemaOnly"`
IncludeTables []string `json:"includeTables"`
ExcludeTables []string `json:"excludeTables"`
ValidationQueries []struct {
Name string `json:"name"`
SourceQuery string `json:"sourceQuery"`
TargetQuery string `json:"targetQuery"`
Tolerance float64 `json:"tolerance"`
} `json:"validationQueries"`
Hooks struct {
PreMigration []string `json:"preMigration"`
PostMigration []string `json:"postMigration"`
PreCutover []string `json:"preCutover"`
PostCutover []string `json:"postCutover"`
PreRollback []string `json:"preRollback"`
PostRollback []string `json:"postRollback"`
} `json:"hooks"`
Rollback struct {
Strategy string `json:"strategy"` // "backup-restore", "dual-write", "logical-replication"
BackupRetentionDays int `json:"backupRetentionDays"`
TestRollback bool `json:"testRollback"`
} `json:"rollback"`
Monitoring struct {
Enabled bool `json:"enabled"`
SNSTopicARN string `json:"snsTopicArn"`
CloudWatchLogGroup string `json:"cloudWatchLogGroup"`
} `json:"monitoring"`
}
// MigrationState represents the state of a migration
type MigrationState struct {
MigrationID string `json:"migrationId"`
Status string `json:"status"` // "planning", "in-progress", "validating", "cutover", "completed", "failed", "rolled-back"
CurrentPhase string `json:"currentPhase"`
StartTime time.Time `json:"startTime"`
EndTime *time.Time `json:"endTime,omitempty"`
ValidationResults []struct {
Name string `json:"name"`
Passed bool `json:"passed"`
Details string `json:"details"`
} `json:"validationResults"`
BackupInfo struct {
SourceBackupID string `json:"sourceBackupId"`
SourceBackupLocation string `json:"sourceBackupLocation"`
TargetBackupID string `json:"targetBackupId"`
TargetBackupLocation string `json:"targetBackupLocation"`
} `json:"backupInfo"`
RollbackInfo struct {
RollbackTested bool `json:"rollbackTested"`
RollbackTestTime *time.Time `json:"rollbackTestTime,omitempty"`
RollbackTestResult string `json:"rollbackTestResult,omitempty"`
} `json:"rollbackInfo"`
Errors []string `json:"errors"`
}
// DatabaseMigrator handles database migrations
type DatabaseMigrator struct {
config MigrationConfig
state MigrationState
sourceDB *sql.DB
targetDB *sql.DB
awsConfig aws.Config
rdsClient *rds.Client
s3Client *s3.Client
logger *log.Logger
}
// NewDatabaseMigrator creates a new database migrator
func NewDatabaseMigrator(configPath string) (*DatabaseMigrator, error) {
// Read and parse config file
configData, err := os.ReadFile(configPath)
if err != nil {
return nil, fmt.Errorf("failed to read config file: %w", err)
}
var config MigrationConfig
if err := json.Unmarshal(configData, &config); err != nil {
return nil, fmt.Errorf("failed to parse config file: %w", err)
}
// Initialize state
state := MigrationState{
MigrationID: config.MigrationID,
Status: "planning",
CurrentPhase: "initialization",
StartTime: time.Now(),
}
// Set up logging
logger := log.New(os.Stdout, fmt.Sprintf("[Migration %s] ", config.MigrationID), log.LstdFlags)
// Load AWS config
awsConfig, err := config.LoadDefaultConfig(context.Background(),
config.WithRegion(config.AWS.Region),
)
if err != nil {
return nil, fmt.Errorf("failed to load AWS config: %w", err)
}
// Create AWS clients
rdsClient := rds.NewFromConfig(awsConfig)
s3Client := s3.NewFromConfig(awsConfig)
return &DatabaseMigrator{
config: config,
state: state,
awsConfig: awsConfig,
rdsClient: rdsClient,
s3Client: s3Client,
logger: logger,
}, nil
}
// Connect connects to source and target databases
func (m *DatabaseMigrator) Connect() error {
var err error
// Connect to source database
switch m.config.SourceType {
case "oracle":
connStr := fmt.Sprintf("user=%s password=%s connectString=%s:%d/%s",
m.config.SourceConnection.Username,
m.config.SourceConnection.Password,
m.config.SourceConnection.Host,
m.config.SourceConnection.Port,
m.config.SourceConnection.SID)
m.sourceDB, err = sql.Open("godror", connStr)
case "postgres":
connStr := fmt.Sprintf("host=%s port=%d user=%s password=%s dbname=%s sslmode=require",
m.config.SourceConnection.Host,
m.config.SourceConnection.Port,
m.config.SourceConnection.Username,
m.config.SourceConnection.Password,
m.config.SourceConnection.Database)
m.sourceDB, err = sql.Open("postgres", connStr)
default:
return fmt.Errorf("unsupported source database type: %s", m.config.SourceType)
}
if err != nil {
return fmt.Errorf("failed to connect to source database: %w", err)
}
// Test source connection
if err := m.sourceDB.Ping(); err != nil {
return fmt.Errorf("failed to ping source database: %w", err)
}
// Connect to target database
switch m.config.TargetType {
case "aurora-postgresql":
connStr := fmt.Sprintf("host=%s port=%d user=%s password=%s dbname=%s sslmode=require",
m.config.TargetConnection.Host,
m.config.TargetConnection.Port,
m.config.TargetConnection.Username,
m.config.TargetConnection.Password,
m.config.TargetConnection.Database)
m.targetDB, err = sql.Open("postgres", connStr)
default:
return fmt.Errorf("unsupported target database type: %s", m.config.TargetType)
}
if err != nil {
return fmt.Errorf("failed to connect to target database: %w", err)
}
// Test target connection
if err := m.targetDB.Ping(); err != nil {
return fmt.Errorf("failed to ping target database: %w", err)
}
m.logger.Println("Successfully connected to source and target databases")
return nil
}
// Validate validates the migration configuration and prerequisites
func (m *DatabaseMigrator) Validate() error {
// Check AWS permissions
// Check database permissions
// Validate schema compatibility
// Check network connectivity
// Verify backup capabilities
// Test rollback procedure if configured
// For brevity, implementation details are omitted
return nil
}
// BackupSourceDatabase creates a backup of the source database
func (m *DatabaseMigrator) BackupSourceDatabase() error {
// Implementation depends on database type
// For brevity, implementation details are omitted
return nil
}
// TestRollbackProcedure tests the rollback procedure
func (m *DatabaseMigrator) TestRollbackProcedure() error {
if !m.config.Rollback.TestRollback {
m.logger.Println("Rollback testing is disabled, skipping")
return nil
}
m.logger.Println("Testing rollback procedure...")
// Create a test environment
// Perform a test migration
// Execute rollback procedure
// Validate rollback success
// For brevity, implementation details are omitted
now := time.Now()
m.state.RollbackInfo.RollbackTested = true
m.state.RollbackInfo.RollbackTestTime = &now
m.state.RollbackInfo.RollbackTestResult = "success"
m.logger.Println("Rollback procedure tested successfully")
return nil
}
// Migrate performs the actual migration
func (m *DatabaseMigrator) Migrate() error {
// Implementation depends on database type and migration strategy
// For brevity, implementation details are omitted
return nil
}
// Validate validates the migration results
func (m *DatabaseMigrator) ValidateMigration() error {
m.logger.Println("Validating migration results...")
for _, validation := range m.config.ValidationQueries {
m.logger.Printf("Running validation: %s", validation.Name)
// Execute source query
var sourceResult float64
err := m.sourceDB.QueryRow(validation.SourceQuery).Scan(&sourceResult)
if err != nil {
return fmt.Errorf("failed to execute source validation query %s: %w", validation.Name, err)
}
// Execute target query
var targetResult float64
err = m.targetDB.QueryRow(validation.TargetQuery).Scan(&targetResult)
if err != nil {
return fmt.Errorf("failed to execute target validation query %s: %w", validation.Name, err)
}
// Compare results
diff := (sourceResult - targetResult) / sourceResult
if diff < 0 {
diff = -diff
}
passed := diff <= validation.Tolerance
m.state.ValidationResults = append(m.state.ValidationResults, struct {
Name string `json:"name"`
Passed bool `json:"passed"`
Details string `json:"details"`
}{
Name: validation.Name,
Passed: passed,
Details: fmt.Sprintf("Source: %.2f, Target: %.2f, Diff: %.2f%%, Tolerance: %.2f%%",
sourceResult, targetResult, diff*100, validation.Tolerance*100),
})
if !passed {
m.logger.Printf("Validation failed: %s - %s", validation.Name,
fmt.Sprintf("Source: %.2f, Target: %.2f, Diff: %.2f%%, Tolerance: %.2f%%",
sourceResult, targetResult, diff*100, validation.Tolerance*100))
return fmt.Errorf("validation failed: %s", validation.Name)
}
m.logger.Printf("Validation passed: %s", validation.Name)
}
m.logger.Println("All validations passed")
return nil
}
// Cutover performs the cutover to the target database
func (m *DatabaseMigrator) Cutover() error {
m.logger.Println("Performing cutover...")
// Execute pre-cutover hooks
for _, hook := range m.config.Hooks.PreCutover {
// Execute hook
// For brevity, implementation details are omitted
}
// Backup target database before cutover
// Update application connection strings
// Verify connectivity
// Execute post-cutover hooks
for _, hook := range m.config.Hooks.PostCutover {
// Execute hook
// For brevity, implementation details are omitted
}
m.logger.Println("Cutover completed successfully")
return nil
}
// Rollback rolls back the migration
func (m *DatabaseMigrator) Rollback() error {
m.logger.Println("Rolling back migration...")
// Execute pre-rollback hooks
for _, hook := range m.config.Hooks.PreRollback {
// Execute hook
// For brevity, implementation details are omitted
}
switch m.config.Rollback.Strategy {
case "backup-restore":
// Restore from backup
// For brevity, implementation details are omitted
case "dual-write":
// Switch back to source database
// For brevity, implementation details are omitted
case "logical-replication":
// Reverse replication direction
// For brevity, implementation details are omitted
default:
return fmt.Errorf("unsupported rollback strategy: %s", m.config.Rollback.Strategy)
}
// Execute post-rollback hooks
for _, hook := range m.config.Hooks.PostRollback {
// Execute hook
// For brevity, implementation details are omitted
}
m.state.Status = "rolled-back"
now := time.Now()
m.state.EndTime = &now
m.logger.Println("Rollback completed successfully")
return nil
}
// SaveState saves the current migration state
func (m *DatabaseMigrator) SaveState() error {
stateData, err := json.MarshalIndent(m.state, "", " ")
if err != nil {
return fmt.Errorf("failed to marshal state: %w", err)
}
stateFile := fmt.Sprintf("migration_%s_state.json", m.config.MigrationID)
if err := os.WriteFile(stateFile, stateData, 0644); err != nil {
return fmt.Errorf("failed to write state file: %w", err)
}
// Also save to S3 if configured
if m.config.AWS.S3BackupBucket != "" {
// For brevity, implementation details are omitted
}
return nil
}
// Run runs the complete migration process
func (m *DatabaseMigrator) Run() error {
defer func() {
if err := m.SaveState(); err != nil {
m.logger.Printf("Failed to save state: %v", err)
}
}()
// Connect to databases
if err := m.Connect(); err != nil {
m.state.Status = "failed"
m.state.Errors = append(m.state.Errors, err.Error())
return fmt.Errorf("connection failed: %w", err)
}
// Validate configuration and prerequisites
m.state.CurrentPhase = "validation"
if err := m.Validate(); err != nil {
m.state.Status = "failed"
m.state.Errors = append(m.state.Errors, err.Error())
return fmt.Errorf("validation failed: %w", err)
}
// Backup source database
m.state.CurrentPhase = "backup"
if err := m.BackupSourceDatabase(); err != nil {
m.state.Status = "failed"
m.state.Errors = append(m.state.Errors, err.Error())
return fmt.Errorf("backup failed: %w", err)
}
// Test rollback procedure
m.state.CurrentPhase = "rollback-testing"
if err := m.TestRollbackProcedure(); err != nil {
m.state.Status = "failed"
m.state.Errors = append(m.state.Errors, err.Error())
return fmt.Errorf("rollback testing failed: %w", err)
}
// Perform migration
m.state.Status = "in-progress"
m.state.CurrentPhase = "migration"
if err := m.Migrate(); err != nil {
m.state.Status = "failed"
m.state.Errors = append(m.state.Errors, err.Error())
return fmt.Errorf("migration failed: %w", err)
}
// Validate migration results
m.state.CurrentPhase = "validation"
if err := m.ValidateMigration(); err != nil {
m.state.Status = "failed"
m.state.Errors = append(m.state.Errors, err.Error())
// Attempt rollback
m.logger.Println("Migration validation failed, attempting rollback")
if rollbackErr := m.Rollback(); rollbackErr != nil {
m.logger.Printf("Rollback failed: %v", rollbackErr)
m.state.Errors = append(m.state.Errors, fmt.Sprintf("rollback failed: %v", rollbackErr))
return fmt.Errorf("migration validation failed and rollback failed: %w, rollback error: %v", err, rollbackErr)
}
return fmt.Errorf("migration validation failed, rolled back: %w", err)
}
// Perform cutover
m.state.CurrentPhase = "cutover"
if err := m.Cutover(); err != nil {
m.state.Status = "failed"
m.state.Errors = append(m.state.Errors, err.Error())
// Attempt rollback
m.logger.Println("Cutover failed, attempting rollback")
if rollbackErr := m.Rollback(); rollbackErr != nil {
m.logger.Printf("Rollback failed: %v", rollbackErr)
m.state.Errors = append(m.state.Errors, fmt.Sprintf("rollback failed: %v", rollbackErr))
return fmt.Errorf("cutover failed and rollback failed: %w, rollback error: %v", err, rollbackErr)
}
return fmt.Errorf("cutover failed, rolled back: %w", err)
}
// Migration completed successfully
m.state.Status = "completed"
now := time.Now()
m.state.EndTime = &now
m.logger.Println("Migration completed successfully")
return nil
}
func main() {
if len(os.Args) < 2 {
log.Fatalf("Usage: %s <config_path>", os.Args[0])
}
configPath := os.Args[1]
migrator, err := NewDatabaseMigrator(configPath)
if err != nil {
log.Fatalf("Failed to create migrator: %v", err)
}
if err := migrator.Run(); err != nil {
log.Fatalf("Migration failed: %v", err)
}
}
• Created a comprehensive migration checklist and runbook:
# database_migration_checklist.yaml
---
migration:
name: "Oracle to Aurora PostgreSQL Migration"
id: "DB-MIG-2023-05"
owner: "Database Migration Team"
pre_migration:
planning:
- task: "Define migration scope and objectives"
owner: "Project Manager"
status: "completed"
- task: "Identify all applications using the database"
owner: "Application Team"
status: "completed"
- task: "Document schema and data dependencies"
owner: "Database Team"
status: "completed"
- task: "Assess data volume and migration time requirements"
owner: "Database Team"
status: "completed"
- task: "Define success criteria and validation methods"
owner: "QA Team"
status: "completed"
technical_preparation:
- task: "Set up AWS infrastructure for target database"
owner: "Cloud Team"
status: "completed"
- task: "Configure network connectivity between source and target"
owner: "Network Team"
status: "completed"
- task: "Set up monitoring for migration process"
owner: "Operations Team"
status: "completed"
- task: "Create migration IAM roles and permissions"
owner: "Security Team"
status: "completed"
- task: "Configure AWS DMS or alternative migration tool"
owner: "Database Team"
status: "completed"
testing:
- task: "Perform test migration with subset of data"
owner: "Database Team"
status: "completed"
- task: "Validate data integrity in test environment"
owner: "QA Team"
status: "completed"
- task: "Test application compatibility with target database"
owner: "Application Team"
status: "completed"
- task: "Measure performance in test environment"
owner: "Performance Team"
status: "completed"
- task: "Test rollback procedure"
owner: "Database Team"
status: "not_started"
rollback_preparation:
- task: "Create full backup of source database"
owner: "Backup Team"
status: "not_started"
- task: "Verify backup restoration process"
owner: "Backup Team"
status: "not_started"
- task: "Document rollback decision criteria"
owner: "Project Manager"
status: "not_started"
- task: "Prepare rollback scripts and procedures"
owner: "Database Team"
status: "not_started"
- task: "Test rollback procedure in isolation"
owner: "Database Team"
status: "not_started"
- task: "Create redundant copies of backups in multiple locations"
owner: "Backup Team"
status: "not_started"
communication:
- task: "Notify all stakeholders of migration schedule"
owner: "Project Manager"
status: "not_started"
- task: "Schedule maintenance window"
owner: "Operations Team"
status: "not_started"
- task: "Prepare customer communications"
owner: "Customer Support"
status: "not_started"
- task: "Conduct pre-migration review meeting"
owner: "Project Manager"
status: "not_started"
migration:
execution:
- task: "Implement read-only mode for source database"
owner: "Database Team"
status: "not_started"
- task: "Verify no writes occurring to source database"
owner: "Database Team"
status: "not_started"
- task: "Take final backup of source database"
owner: "Backup Team"
status: "not_started"
- task: "Start data migration process"
owner: "Database Team"
status: "not_started"
- task: "Monitor migration progress"
owner: "Operations Team"
status: "not_started"
- task: "Validate data integrity in target database"
owner: "QA Team"
status: "not_started"
cutover:
- task: "Stop all application connections to source database"
owner: "Application Team"
status: "not_started"
- task: "Verify no active connections to source database"
owner: "Database Team"
status: "not_started"
- task: "Perform final data synchronization"
owner: "Database Team"
status: "not_started"
- task: "Validate final data integrity"
owner: "QA Team"
status: "not_started"
- task: "Update application connection strings to target database"
owner: "Application Team"
status: "not_started"
- task: "Start applications with new database connections"
owner: "Application Team"
status: "not_started"
- task: "Verify application functionality with target database"
owner: "QA Team"
status: "not_started"
rollback:
decision_criteria:
- "Data integrity issues detected in target database"
- "Application functionality issues with target database"
- "Performance degradation beyond acceptable thresholds"
- "Security or compliance issues with target database"
procedure:
- task: "Make rollback decision based on criteria"
owner: "Project Manager"
status: "not_started"
- task: "Stop all application connections to target database"
owner: "Application Team"
status: "not_started"
- task: "Verify source database is still intact and operational"
owner: "Database Team"
status: "not_started"
- task: "Restore source database from backup if needed"
owner: "Database Team"
status: "not_started"
- task: "Revert application connection strings to source database"
owner: "Application Team"
status: "not_started"
- task: "Start applications with original database connections"
owner: "Application Team"
status: "not_started"
- task: "Verify application functionality with source database"
owner: "QA Team"
status: "not_started"
- task: "Notify all stakeholders of rollback completion"
owner: "Project Manager"
status: "not_started"
post_migration:
verification:
- task: "Perform comprehensive application testing"
owner: "QA Team"
status: "not_started"
- task: "Verify all data has been migrated correctly"
owner: "Database Team"
status: "not_started"
- task: "Monitor application performance with new database"
owner: "Performance Team"
status: "not_started"
- task: "Verify backup and recovery procedures for new database"
owner: "Backup Team"
status: "not_started"
cleanup:
- task: "Decommission migration infrastructure"
owner: "Cloud Team"
status: "not_started"
- task: "Update documentation with new database details"
owner: "Documentation Team"
status: "not_started"
- task: "Archive migration logs and artifacts"
owner: "Operations Team"
status: "not_started"
- task: "Conduct post-migration review meeting"
owner: "Project Manager"
status: "not_started"
Lessons Learned:
Database migrations require comprehensive rollback testing and redundant recovery options.
How to Avoid:
Test rollback procedures in production-like environments.
Maintain multiple backup copies in different locations.
Implement change freezes on source systems during migration.
Create detailed checklists for both migration and rollback.
Conduct dry runs of the entire process including rollback.
No summary provided
What Happened:
During a planned migration of a business-critical application from on-premises to AWS, the database migration component failed. The migration involved moving from Oracle to PostgreSQL, with a 6-hour maintenance window allocated for the final data sync and cutover. After 4 hours of the cutover process, the migration tool encountered errors with complex stored procedures and custom data types, leaving the database in an inconsistent state. The team was forced to roll back to the on-premises system, extending the outage beyond the planned maintenance window.
Diagnosis Steps:
Analyzed migration tool logs for specific error patterns.
Examined database schema differences between source and target.
Reviewed data validation reports from pre-migration testing.
Checked network connectivity and throughput during the migration.
Analyzed resource utilization on both source and target databases.
Root Cause:
The investigation revealed multiple issues: 1. Several complex stored procedures used Oracle-specific PL/SQL features without PostgreSQL equivalents 2. Custom data types in the Oracle database weren't properly mapped to PostgreSQL types 3. The migration tool's handling of large BLOB data was inefficient, causing timeouts 4. Pre-migration testing was performed with a subset of data that didn't include edge cases 5. The rollback procedure wasn't fully tested before the migration attempt
Fix/Workaround:
• Short-term: Implemented a revised migration strategy with better testing
• Created a comprehensive database migration framework with proper transaction handling
• Developed custom data type converters for Oracle-specific types
• Implemented a phased migration approach with incremental validation
• Created a robust rollback procedure with automated verification
Lessons Learned:
Database migrations between different database engines require thorough testing with production-like data volumes and edge cases.
How to Avoid:
Perform thorough schema compatibility analysis before migration
Test with production-sized datasets including all edge cases
Implement and test rollback procedures before migration
Use a phased migration approach with incremental validation
Allow for extended maintenance windows when migrating between different database engines
No summary provided
What Happened:
During a planned weekend migration of a financial services database from on-premises Oracle to AWS RDS PostgreSQL, the process encountered multiple unexpected issues. The migration was scheduled for a 12-hour maintenance window, but ended up taking 20 hours to complete. The extended downtime impacted trading operations when markets opened on Monday, resulting in financial losses and customer complaints. The migration team had to work continuously through the night to resolve issues and complete the migration.
Diagnosis Steps:
Analyzed migration logs to identify bottlenecks and failures.
Reviewed data validation errors and schema conversion issues.
Examined network throughput and data transfer rates.
Assessed the effectiveness of the rollback plan.
Evaluated pre-migration testing and validation procedures.
Root Cause:
The investigation revealed multiple issues with the migration planning and execution: 1. Data volume was significantly underestimated due to incomplete inventory 2. Schema conversion tools missed several complex stored procedures and triggers 3. Network bandwidth between on-premises and AWS was throttled unexpectedly 4. Data validation processes were more time-consuming than anticipated 5. The rollback plan was inadequate for handling partial migration failures
Fix/Workaround:
• Implemented a phased migration approach for remaining databases
• Created a more accurate data inventory and sizing process
• Improved schema conversion testing with automated validation
• Established dedicated network connectivity with guaranteed bandwidth
• Developed a more robust rollback strategy with clear decision points
Lessons Learned:
Database migrations require extensive preparation, accurate data sizing, and robust contingency planning.
How to Avoid:
Conduct thorough data profiling and volume assessment before migration.
Test schema conversion tools with representative samples of all database objects.
Implement a phased migration approach for large or critical databases.
Establish clear go/no-go criteria and decision points during migration.
Create a detailed rollback plan with specific triggers and procedures.
No summary provided
What Happened:
A financial services company initiated a project to modernize their legacy trading platform, a monolithic Java application running on-premises. The plan was to containerize the application and migrate it to AWS ECS. After six months of work, the team had made minimal progress. The containerized application was unstable, performed poorly, and lacked critical functionality. The project was significantly behind schedule and over budget, with stakeholders losing confidence in the migration approach.
Diagnosis Steps:
Analyzed the application architecture and dependencies.
Reviewed containerization approach and implementation.
Examined performance metrics and stability issues.
Assessed team skills and migration strategy.
Evaluated project management and governance.
Root Cause:
The investigation revealed multiple issues with the modernization approach: 1. The application had undocumented dependencies on the underlying infrastructure 2. The team attempted a "lift and shift" without addressing architectural issues 3. Database connection management was incompatible with containerized environments 4. The application relied on local filesystem for state management 5. The team lacked experience with both the legacy system and modern cloud technologies
Fix/Workaround:
• Implemented a phased migration approach instead of all-at-once
• Created a detailed application dependency map
• Developed a strangler pattern implementation for gradual modernization
• Established clear success criteria for each migration phase
• Built a hybrid connectivity model for transitional period
Lessons Learned:
Legacy application modernization requires careful planning and incremental approaches.
How to Avoid:
Conduct thorough application assessment before migration planning.
Implement a phased approach with clear success criteria for each phase.
Address architectural issues before containerization attempts.
Build expertise in both legacy systems and target technologies.
Create a detailed dependency map including infrastructure dependencies.
No summary provided
What Happened:
A financial services company was migrating their core transaction database from an on-premises Oracle system to AWS RDS PostgreSQL as part of a larger cloud migration initiative. The migration strategy involved using AWS Database Migration Service (DMS) for initial data transfer, followed by a planned cutover during a weekend maintenance window. During the cutover, the team encountered unexpected data transformation issues and performance problems that prevented the migration from completing within the allocated downtime window. The team was forced to roll back to the on-premises system after 12 hours of unsuccessful migration attempts, causing significant business impact.
Diagnosis Steps:
Analyzed DMS task logs for error patterns and failure points.
Examined data validation reports between source and target databases.
Reviewed network throughput and latency metrics during the migration.
Checked database performance metrics on both source and target systems.
Analyzed the specific data transformation rules that were causing issues.
Root Cause:
The investigation revealed multiple issues that contributed to the migration failure: 1. Complex data types in the Oracle database were not properly mapped to PostgreSQL equivalents 2. Several stored procedures contained Oracle-specific SQL syntax that wasn't compatible with PostgreSQL 3. The volume of data was significantly larger than what was used in test migrations 4. Network bandwidth between on-premises and AWS was throttled during peak hours 5. The migration validation process was too strict, failing on non-critical discrepancies
Fix/Workaround:
• Implemented a revised migration strategy with the following components:
• Created custom data transformation scripts to handle complex data types
• Rewrote problematic stored procedures to use PostgreSQL-compatible syntax
• Implemented a phased migration approach with smaller data batches
• Established dedicated network connection with reserved bandwidth for migration
• Modified validation rules to allow non-critical discrepancies with post-migration reconciliation
-- Example of Oracle-specific SQL that needed transformation
-- Original Oracle procedure
CREATE OR REPLACE PROCEDURE calculate_interest(
p_account_id IN NUMBER,
p_interest_rate IN NUMBER,
p_result OUT NUMBER
) AS
BEGIN
SELECT balance * p_interest_rate / 100
INTO p_result
FROM accounts
WHERE account_id = p_account_id;
EXCEPTION
WHEN NO_DATA_FOUND THEN
p_result := 0;
END;
/
-- PostgreSQL compatible version
CREATE OR REPLACE FUNCTION calculate_interest(
p_account_id INTEGER,
p_interest_rate NUMERIC
) RETURNS NUMERIC AS $$
DECLARE
v_result NUMERIC;
BEGIN
SELECT balance * p_interest_rate / 100
INTO v_result
FROM accounts
WHERE account_id = p_account_id;
IF v_result IS NULL THEN
RETURN 0;
ELSE
RETURN v_result;
END IF;
END;
$$ LANGUAGE plpgsql;
# AWS DMS Task Configuration with Improved Settings
# File: improved-dms-task.yaml
SchemaMapping:
rules:
- rule-type: selection
rule-id: 1
rule-name: "Include all tables"
object-locator:
schema-name: "%"
table-name: "%"
rule-action: include
- rule-type: transformation
rule-id: 2
rule-name: "Convert CLOB to TEXT"
rule-action: convert-dtype
rule-target: column
object-locator:
schema-name: "%"
table-name: "%"
column-name: "%"
data-type:
source-type: "clob"
target-type: "text"
- rule-type: transformation
rule-id: 3
rule-name: "Handle special characters in identifiers"
rule-action: rename
rule-target: schema
object-locator:
schema-name: "%"
value: "${schema-name:lowercase}"
TaskSettings:
TargetMetadata:
BatchApplyEnabled: true
BatchApplyPreserveTransaction: true
BatchSize: 10000
FullLoadSettings:
CommitRate: 10000
MaxFullLoadSubTasks: 8
TransactionConsistencyTimeout: 1200
CreatePkAfterFullLoad: true
Logging:
EnableLogging: true
LogComponents:
- SOURCE_UNLOAD
- SOURCE_CAPTURE
- TARGET_LOAD
- TARGET_APPLY
- TASK_MANAGER
CloudWatchLogGroup: "/dms/migration-task-logs"
CloudWatchLogStream: "oracle-to-postgres-migration"
ControlTablesSettings:
ControlSchema: "dms_control"
HistoryTimeslotInMinutes: 5
HistoryTableEnabled: true
SuspendedTablesTableEnabled: true
StatusTableEnabled: true
StreamBufferSettings:
StreamBufferCount: 5
StreamBufferSizeInMB: 256
ChangeProcessingTuning:
BatchApplyPreserveTransaction: true
BatchApplyTimeoutMin: 1
BatchApplyTimeoutMax: 30
BatchApplyMemoryLimit: 500
BatchSplitSize: 0
MinTransactionSize: 1000
CommitTimeout: 1
Lessons Learned:
Database migrations require careful planning, thorough testing with production-like data volumes, and realistic timelines that account for unexpected issues.
How to Avoid:
Conduct thorough schema and data type compatibility analysis before migration.
Test migrations with production-scale data volumes, not just sample datasets.
Develop and test custom transformation scripts for complex data types.
Implement a phased migration approach rather than a single cutover.
Establish dedicated network resources for large-scale data migrations.
Create a comprehensive rollback plan with clear decision criteria.
No summary provided
What Happened:
A manufacturing company was migrating their legacy ERP system from an on-premises data center to AWS. The migration team had completed the initial assessment and created a migration plan based on application documentation and interviews with the development team. However, during the test migration, the application failed to function correctly in the cloud environment. Investigation revealed numerous undocumented dependencies on other systems, including legacy mainframe services, local file shares, and hardcoded network paths. These dependencies were not captured in the initial assessment, requiring a significant rework of the migration plan and architecture.
Diagnosis Steps:
Analyzed application logs to identify connection failures.
Used network monitoring tools to trace actual application connections.
Reviewed application source code for hardcoded dependencies.
Conducted comprehensive dependency mapping using automated tools.
Interviewed additional stakeholders, including operations staff and power users.
Root Cause:
The investigation revealed multiple issues with the dependency assessment: 1. The application documentation was outdated and incomplete 2. The original developers had left the company, taking tribal knowledge with them 3. The application had evolved over 15 years with minimal documentation updates 4. Many dependencies were implemented through undocumented "workarounds" 5. No comprehensive dependency mapping had been performed before migration planning
Fix/Workaround:
• Implemented a comprehensive dependency discovery process
• Used application performance monitoring tools to trace all connections
• Created a complete dependency map including all integrations
• Revised the migration architecture to accommodate discovered dependencies
• Implemented proxy services for legacy systems that couldn't be migrated
Lessons Learned:
Thorough dependency mapping is critical for successful cloud migrations, especially for legacy applications with limited documentation.
How to Avoid:
Implement comprehensive dependency mapping early in the migration planning process.
Use automated discovery tools to identify hidden dependencies.
Involve operations staff and power users in the assessment process, not just developers.
Conduct proof-of-concept migrations early to validate dependency assumptions.
Create a detailed application portfolio with dependency information.