Files

copilot-swe-agent[bot] 723e941ef5 Add comprehensive database backup and disaster recovery system

- Created automated backup scripts (full and incremental via WAL archiving)
- Added restore scripts supporting full, PITR, and partial table recovery
- Configured Docker Compose with backup service and WAL archiving
- Implemented backup verification and monitoring
- Added cron scheduling examples
- Created comprehensive documentation and runbook
- Included monitoring configuration examples (Prometheus, CloudWatch)
- All tests passing (245/245)

Co-authored-by: PatrickFanella <61631520+PatrickFanella@users.noreply.github.com>

2025-10-24 19:38:35 +00:00

10 KiB

Raw Permalink Blame History

Disaster Recovery Runbook

Overview

This runbook provides step-by-step procedures for recovering the Internet-ID database in various disaster scenarios.

Recovery Time Objective (RTO): 4 hours
Recovery Point Objective (RPO): 1 hour

Backup Strategy

Automated Backups

Full Backups: Daily at 2:00 AM (scheduled via cron)
- Uses pg_dump for logical backup
- Compressed with gzip
- Retained for 30 days
- Stored locally and in S3 (encrypted)
Incremental Backups: Hourly via WAL archiving
- PostgreSQL WAL files archived automatically
- Enables point-in-time recovery (PITR)
- Retained for 30 days
Backup Verification: Every 6 hours
- Integrity checks on recent backups
- Storage usage monitoring
- Automated alerts on failures

Storage Locations

Primary: Local volume /var/lib/postgresql/backups
Secondary: AWS S3 bucket (or compatible storage) in separate region
Encryption: AES256 server-side encryption on S3

Disaster Scenarios

Scenario 1: Accidental Data Deletion or Corruption

Detection:

User reports missing or incorrect data
Application errors indicating data inconsistency

Recovery Steps:

Assess the damage:

# Connect to database
psql -h localhost -U internetid -d internetid

# Check affected tables
SELECT * FROM <affected_table> WHERE <conditions>;

Determine recovery target time:
- Identify when corruption occurred
- Select timestamp just before the incident

Perform point-in-time recovery:

cd /opt/internet-id/ops/restore

# Set recovery target time (format: YYYY-MM-DD HH:MM:SS)
export RESTORE_TARGET_TIME="2025-10-24 18:30:00"

# Run PITR
sudo -u postgres ./restore-database.sh pitr

Follow manual PITR steps (output by script):
- Stop PostgreSQL
- Clear data directory
- Extract base backup
- Configure recovery.conf
- Copy WAL files
- Start PostgreSQL

Verify recovery:

# Check restored data
psql -h localhost -U internetid -d internetid \
  -c "SELECT COUNT(*) FROM <affected_table>;"

Estimated Recovery Time: 30-60 minutes

Scenario 2: Complete Database Loss

Detection:

Database server failure
Data directory corruption or disk failure
PostgreSQL won't start

Recovery Steps:

Prepare new database server (if hardware failure):

# Install PostgreSQL 16
sudo apt-get update
sudo apt-get install postgresql-16

# Or use Docker Compose
cd /opt/internet-id
docker compose up -d db

Restore from latest full backup:

cd /opt/internet-id/ops/restore

# Use default (latest) backup
sudo -u postgres ./restore-database.sh full

# Or specify a backup file
export BACKUP_FILE=/var/lib/postgresql/backups/full/backup_20251024_020000.dump.gz
sudo -u postgres ./restore-database.sh full

Verify database integrity:

# Check table counts
psql -h localhost -U internetid -d internetid \
  -c "SELECT schemaname, tablename, n_live_tup FROM pg_stat_user_tables;"

# Test application connectivity
cd /opt/internet-id
npm run start:api

Update application configuration if needed:

# Update DATABASE_URL in .env if hostname changed
DATABASE_URL="postgresql://internetid:internetid@new-host:5432/internetid"

Estimated Recovery Time: 1-2 hours

Scenario 3: Partial Table Recovery

Detection:

Specific table(s) corrupted or dropped accidentally
Other tables remain intact

Recovery Steps:

Identify affected tables:

# List tables in database
psql -h localhost -U internetid -d internetid -c "\dt"

Restore specific tables:

cd /opt/internet-id/ops/restore

# Set tables to restore (comma-separated)
export RESTORE_TABLES="Content,PlatformBinding,Verification"

# Run partial restore
sudo -u postgres ./restore-database.sh partial

Verify restored tables:

psql -h localhost -U internetid -d internetid \
  -c "SELECT COUNT(*) FROM Content;"

Estimated Recovery Time: 15-30 minutes

Scenario 4: Region-Wide Outage

Detection:

Primary AWS region unavailable
Cannot access primary database or backups

Recovery Steps:

Activate disaster recovery site in secondary region:

# Download backups from S3 in secondary region
aws s3 sync s3://internet-id-backup-secondary/full/ \
  /var/lib/postgresql/backups/full/ \
  --region us-west-2

Deploy database in secondary region:

# Use infrastructure as code (Terraform/CloudFormation)
# Or manual deployment with Docker Compose
cd /opt/internet-id
docker compose up -d db

Restore from S3 backup:

cd /opt/internet-id/ops/restore

# Set backup file location
export BACKUP_FILE=/var/lib/postgresql/backups/full/backup_latest.dump.gz
sudo -u postgres ./restore-database.sh full

Update DNS and load balancer:
- Point application to new database endpoint
- Update DATABASE_URL in application configuration
- Verify application functionality
Communicate with users:
- Post status update on status page
- Notify users via email/social media

Estimated Recovery Time: 2-4 hours

Scenario 5: Backup Verification Failure

Detection:

Automated backup verification alerts
Backup integrity check fails

Recovery Steps:

Investigate backup failure:

# Check verification logs
tail -100 /var/lib/postgresql/backups/verify.log

# Manually verify latest backup
cd /opt/internet-id/ops/backup
./verify-backup.sh

Test backup restoration:

# Create test database
psql -h localhost -U internetid -d postgres \
  -c "CREATE DATABASE test_restore;"

# Attempt restore to test database
export POSTGRES_DB=test_restore
cd /opt/internet-id/ops/restore
./restore-database.sh full

# Drop test database after verification
psql -h localhost -U internetid -d postgres \
  -c "DROP DATABASE test_restore;"

If backup is corrupted, trigger immediate full backup:

cd /opt/internet-id/ops/backup
sudo -u postgres ./backup-database.sh full

Investigate root cause:
- Check disk space
- Review backup logs
- Verify PostgreSQL is running correctly
- Check S3 credentials and connectivity

Estimated Recovery Time: 30-60 minutes

Pre-Requisites Checklist

Before disaster strikes, ensure:

Backup scripts are installed and have correct permissions
Cron jobs are configured and running
PostgreSQL user postgres can execute backup scripts
Backup directory has sufficient space (monitor usage)
S3 bucket is configured with correct permissions
AWS credentials are configured (for S3 backups)
Alert email is configured in backup scripts
Team has access to this runbook
Quarterly DR drills are scheduled

Monitoring and Alerts

Key Metrics to Monitor

Backup Success Rate
- Alert if backup fails 2 consecutive times
- Check: /var/lib/postgresql/backups/backup.log
Backup Age
- Alert if latest backup is > 26 hours old
- Check via: ./verify-backup.sh
Storage Usage
- Alert if > 85% disk usage
- Check: df -h /var/lib/postgresql/backups
WAL Archiving
- Alert if WAL files not being archived
- Check: Count of files in wal_archive/ directory

Alert Configuration

Configure monitoring system (e.g., Prometheus, CloudWatch) with:

# Example Prometheus alert rules
groups:
  - name: backup_alerts
    rules:
      - alert: BackupTooOld
        expr: time() - backup_last_success_timestamp > 93600
        for: 1h
        annotations:
          summary: "Database backup is too old"

      - alert: BackupFailed
        expr: backup_failure_count > 2
        annotations:
          summary: "Multiple backup failures detected"

      - alert: StorageAlmostFull
        expr: backup_storage_usage_percent > 85
        annotations:
          summary: "Backup storage usage high"

Testing and Validation

Quarterly DR Drill Procedure

Schedule: First Sunday of each quarter at 10:00 AM

Week before drill:
- Notify all team members
- Review and update this runbook
- Verify backup monitoring is working
Drill day:
- Select a disaster scenario (rotate each quarter)
- Follow runbook procedures
- Document time taken for each step
- Note any issues or deviations
Week after drill:
- Conduct post-drill review meeting
- Update runbook based on lessons learned
- Fix any identified issues
- Update RTO/RPO if needed

Test Restore Procedure (Monthly)

Run this test monthly to verify backup integrity:

#!/bin/bash
# Monthly test restore procedure

# 1. Create test database
psql -h localhost -U internetid -d postgres \
  -c "CREATE DATABASE test_restore_$(date +%Y%m);"

# 2. Restore latest backup
export POSTGRES_DB="test_restore_$(date +%Y%m)"
cd /opt/internet-id/ops/restore
./restore-database.sh full

# 3. Verify data
psql -h localhost -U internetid -d "test_restore_$(date +%Y%m)" \
  -c "SELECT COUNT(*) FROM Content;" \
  -c "SELECT COUNT(*) FROM User;" \
  -c "SELECT COUNT(*) FROM PlatformBinding;"

# 4. Cleanup
psql -h localhost -U internetid -d postgres \
  -c "DROP DATABASE test_restore_$(date +%Y%m);"

echo "Test restore completed successfully"

Contact Information

On-Call Engineer: [Contact details]
Database Team Lead: [Contact details]
Infrastructure Team: [Contact details]
Escalation: [Contact details]

References

PostgreSQL Backup Documentation: https://www.postgresql.org/docs/current/backup.html
AWS S3 Documentation: https://docs.aws.amazon.com/s3/
Project Repository: https://github.com/subculture-collective/internet-id

Revision History

Date	Version	Changes	Author
2025-10-24	1.0	Initial disaster recovery runbook	GitHub Copilot

Last Updated: 2025-10-24
Next Review Date: 2026-01-24

10 KiB Raw Permalink Blame History

Disaster Recovery Runbook

Overview

Backup Strategy

Automated Backups

Storage Locations

Disaster Scenarios

Scenario 1: Accidental Data Deletion or Corruption

Scenario 2: Complete Database Loss

Scenario 3: Partial Table Recovery

Scenario 4: Region-Wide Outage

Scenario 5: Backup Verification Failure

Pre-Requisites Checklist

Monitoring and Alerts

Key Metrics to Monitor

Alert Configuration

Testing and Validation

Quarterly DR Drill Procedure

Test Restore Procedure (Monthly)

Contact Information

References

Revision History

10 KiB

Raw Permalink Blame History