Question

24_backup_disaster_recovery_failed_restore

Answer 1

output:

Backup and Disaster Recovery MySQL, AWS S3, Daily backup process

Summary:

No summary provided

What Happened:

A production database became corrupted after a storage failure. The team attempted to restore from backups but discovered that while the backups existed, they couldn't be restored successfully.

Diagnosis Steps:

Attempted to restore the latest backup to a new instance.
Encountered errors during restore process about missing binary logs.
Verified backup files in S3 and found only full backups, no transaction logs.
Tested restore process with available backups and found data inconsistency.

Root Cause:

The backup strategy only included daily full backups without binary logs or incremental backups. Additionally, backups were never tested in a restore scenario.

Fix/Workaround:

• Recovered what was possible from the latest full backup.

• Reconstructed missing data from application logs and secondary sources.

• Implemented a proper backup strategy with point-in-time recovery capability.

• Scheduled regular backup restoration tests.

Lessons Learned:

Backups are useless if they can't be restored successfully.

How to Avoid:

Implement a comprehensive backup strategy including transaction logs.
Test restore procedures regularly in a realistic environment.
Document and automate the restore process.
Consider multi-tier backup strategies (daily full + hourly incremental).
Implement database replication as an additional recovery option.

Answer 2

output:

Backup and Disaster Recovery PostgreSQL 14, AWS S3, Production environment

Summary:

No summary provided

What Happened:

During a planned disaster recovery test, the team attempted to restore the production database from backups. The restore process completed successfully, but application testing revealed that significant portions of data were corrupted or missing.

Diagnosis Steps:

Examined database restore logs for errors or warnings.
Compared backup file checksums with expected values.
Analyzed the backup process configuration and logs.
Tested restoring from older backup files.
Reviewed recent changes to the backup process.

Root Cause:

The backup process was silently failing to properly capture certain database objects due to a permissions issue. The backup verification step was only checking that the backup file existed and had a non-zero size, but wasn't validating the content or completeness of the backup.

Fix/Workaround:

• Short-term: Restored from the last known good backup and reconstructed recent data from transaction logs.

• Long-term: Implemented comprehensive backup validation:


#!/bin/bash
# validate_pg_backup.sh
set -euo pipefail
BACKUP_FILE=$1
TEMP_RESTORE_DIR="/tmp/pg_restore_test"
VALIDATION_DB="backup_validation"
echo "Validating backup file: $BACKUP_FILE"
# Create temporary directory for restore
mkdir -p $TEMP_RESTORE_DIR
# Create validation database
PGPASSWORD=$PG_PASSWORD psql -h localhost -U postgres -c "DROP DATABASE IF EXISTS $VALIDATION_DB;"
PGPASSWORD=$PG_PASSWORD psql -h localhost -U postgres -c "CREATE DATABASE $VALIDATION_DB;"
# Restore backup to validation database
pg_restore -h localhost -U postgres -d $VALIDATION_DB $BACKUP_FILE
# Validate database objects
echo "Checking database objects..."
TABLE_COUNT=$(PGPASSWORD=$PG_PASSWORD psql -h localhost -U postgres -d $VALIDATION_DB -t -c "SELECT count(*) FROM information_schema.tables WHERE table_schema='public';")
echo "Found $TABLE_COUNT tables"
if [ "$TABLE_COUNT" -lt 10 ]; then
    echo "ERROR: Expected at least 10 tables, found only $TABLE_COUNT"
    exit 1
fi
# Validate row counts in critical tables
echo "Checking row counts in critical tables..."
PGPASSWORD=$PG_PASSWORD psql -h localhost -U postgres -d $VALIDATION_DB -c "
SELECT table_name, 
       (SELECT count(*) FROM $VALIDATION_DB.public.\"table_name\") as row_count
FROM information_schema.tables 
WHERE table_schema='public' 
  AND table_name IN ('users', 'accounts', 'transactions');"
# Clean up
PGPASSWORD=$PG_PASSWORD psql -h localhost -U postgres -c "DROP DATABASE $VALIDATION_DB;"
rm -rf $TEMP_RESTORE_DIR
echo "Backup validation completed successfully"

• Added data integrity checks to the backup process:


-- Create a checksum table to verify data integrity
CREATE TABLE backup_checksums (
    table_name text PRIMARY KEY,
    row_count bigint,
    checksum text,
    updated_at timestamp DEFAULT now()
);
-- Function to calculate table checksums
CREATE OR REPLACE FUNCTION calculate_table_checksum(target_table text)
RETURNS text AS $$
DECLARE
    result text;
BEGIN
    EXECUTE format('SELECT md5(string_agg(t::text, '''' ORDER BY id)) FROM %I t', target_table) INTO result;
    RETURN result;
END;
$$ LANGUAGE plpgsql;
-- Procedure to update all checksums
CREATE OR REPLACE PROCEDURE update_all_checksums()
LANGUAGE plpgsql AS $$
DECLARE
    t record;
BEGIN
    FOR t IN 
        SELECT table_name 
        FROM information_schema.tables 
        WHERE table_schema = 'public' 
          AND table_type = 'BASE TABLE'
          AND table_name NOT IN ('backup_checksums')
    LOOP
        BEGIN
            EXECUTE format('
                INSERT INTO backup_checksums (table_name, row_count, checksum)
                VALUES (%L, (SELECT count(*) FROM %I), %L)
                ON CONFLICT (table_name) 
                DO UPDATE SET 
                    row_count = (SELECT count(*) FROM %I),
                    checksum = %L,
                    updated_at = now()',
                t.table_name,
                t.table_name,
                calculate_table_checksum(t.table_name),
                t.table_name,
                calculate_table_checksum(t.table_name)
            );
        EXCEPTION WHEN OTHERS THEN
            RAISE WARNING 'Failed to calculate checksum for %: %', t.table_name, SQLERRM;
        END;
    END LOOP;
END;
$$;
-- Call before backup
CALL update_all_checksums();

Lessons Learned:

Backup verification must include content validation, not just file existence checks.

How to Avoid:

Implement comprehensive backup validation procedures.
Test restores regularly in isolated environments.
Include data integrity checks in the backup process.
Monitor backup success with detailed validation metrics.
Document and regularly review the disaster recovery process.

Answer 3

output:

Backup and Disaster Recovery AWS S3, Terraform, Production environment

Summary:

No summary provided

What Happened:

During a recovery operation, the team discovered that backups older than 30 days were missing, despite the retention policy specifying a 1-year retention period. This prevented recovery to a point-in-time beyond 30 days.

Diagnosis Steps:

Checked S3 bucket lifecycle configuration.
Reviewed CloudTrail logs for deletion events.
Examined Terraform configuration for the S3 bucket.
Compared actual bucket settings with the expected configuration.
Analyzed recent infrastructure changes.

Root Cause:

A recent Terraform change had incorrectly applied a 30-day expiration lifecycle rule to all objects in the bucket, overriding the previous configuration that had different rules for different backup types. The change had passed code review because the reviewer didn't notice the impact on existing lifecycle rules.

Fix/Workaround:

• Short-term: Immediately updated the lifecycle policy to prevent further deletions:


aws s3api put-bucket-lifecycle-configuration \
  --bucket backup-bucket \
  --lifecycle-configuration '{
    "Rules": [
      {
        "ID": "KeepAllBackups",
        "Status": "Enabled",
        "Filter": {
          "Prefix": ""
        },
        "Expiration": {
          "Days": 365
        }
      }
    ]
  }'

• Long-term: Implemented proper lifecycle policies with Terraform:


resource "aws_s3_bucket" "backup_bucket" {
  bucket = "company-backups"
  acl    = "private"
  versioning {
    enabled = true
  }
  server_side_encryption_configuration {
    rule {
      apply_server_side_encryption_by_default {
        sse_algorithm = "AES256"
      }
    }
  }
  lifecycle_rule {
    id      = "daily-backups"
    enabled = true
    prefix  = "daily/"
    expiration {
      days = 30
    }
  }
  lifecycle_rule {
    id      = "weekly-backups"
    enabled = true
    prefix  = "weekly/"
    expiration {
      days = 90
    }
  }
  lifecycle_rule {
    id      = "monthly-backups"
    enabled = true
    prefix  = "monthly/"
    expiration {
      days = 365
    }
  }
  lifecycle_rule {
    id      = "yearly-backups"
    enabled = true
    prefix  = "yearly/"
    # No expiration for yearly backups
    transition {
      days          = 90
      storage_class = "GLACIER"
    }
  }
}

• Added monitoring and alerting for unexpected deletions:


# CloudWatch Alarm for S3 Deletions
Resources:
  S3DeletionAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: S3BackupDeletionAlarm
      AlarmDescription: Alarm if too many objects are deleted from backup bucket
      MetricName: NumberOfObjects
      Namespace: AWS/S3
      Statistic: Minimum
      Period: 86400  # 1 day
      EvaluationPeriods: 1
      Threshold: 1000  # Alert if more than 1000 objects deleted in a day
      ComparisonOperator: LessThanThreshold
      TreatMissingData: breaching
      Dimensions:
        - Name: BucketName
          Value: company-backups
        - Name: StorageType
          Value: AllStorageTypes
      AlarmActions:
        - !Ref SNSTopic

Lessons Learned:

Lifecycle policies require careful management and monitoring to prevent data loss.

How to Avoid:

Implement strict change management for backup configurations.
Use infrastructure as code with detailed comments for critical settings.
Add automated testing for backup retention policies.
Implement monitoring for unexpected backup deletions.
Create separate buckets for different retention requirements.

Answer 4

output:

Backup and Disaster Recovery AWS S3, PostgreSQL, Production environment

Summary:

No summary provided

What Happened:

During a planned database migration, a company attempted to restore their PostgreSQL database from the previous night's backup. The restoration process failed with corruption errors. Further investigation revealed that while the backup process had been running successfully and reporting completion, the backup files themselves were corrupted and unusable. This issue had been ongoing for several weeks, meaning the company had no valid recent backups of their production database.

Diagnosis Steps:

Analyzed backup logs and error messages.
Attempted to restore backups from different dates.
Examined the backup process and validation steps.
Reviewed changes to the backup system and database.
Tested the backup process in a controlled environment.

Root Cause:

The investigation revealed multiple issues with the backup process: 1. The backup validation step was only checking file existence, not content integrity 2. A recent database configuration change had affected the backup format 3. The backup process was not properly handling transaction logs 4. Monitoring was only checking process completion, not backup validity 5. Test restores were not part of the regular backup validation process

Fix/Workaround:

• Implemented immediate fixes to create valid backups

• Developed comprehensive backup validation including test restores

• Created proper monitoring for backup integrity

• Established regular backup restoration testing

• Implemented dual backup strategies with different tools

Lessons Learned:

Backup processes require comprehensive validation beyond simple completion checks.

How to Avoid:

Implement regular test restores as part of the backup validation process.
Create comprehensive integrity checks for backup files.
Monitor backup content and size, not just process completion.
Establish dual backup strategies with different tools and locations.
Regularly audit and test the entire backup and recovery process.

# Backup and Disaster Recovery Scenarios

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid: