subculture-collective/discord-spywatcher

Files

Copilot 2aa4be44f7 [WIP] Create contributing guidelines for open source contributors (#170 )

* Initial plan

* docs: add comprehensive contributing guidelines and templates

Co-authored-by: onnwee <211922112+onnwee@users.noreply.github.com>

* docs: update README and SECURITY with better formatting and links

Co-authored-by: onnwee <211922112+onnwee@users.noreply.github.com>

* docs: finalize contributing guidelines and formatting

Co-authored-by: onnwee <211922112+onnwee@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: onnwee <211922112+onnwee@users.noreply.github.com>

2025-11-04 15:38:59 -06:00

14 KiB

Raw Permalink Blame History

Disaster Recovery Runbook

Overview

This document provides detailed procedures for recovering from various disaster scenarios. The procedures are designed to meet our Recovery Time Objective (RTO) of < 4 hours and Recovery Point Objective (RPO) of < 1 hour.

Quick Reference

Scenario	RTO	RPO	Primary Contact
Database Corruption	2 hours	1 hour	Database Admin
Complete Infrastructure Failure	4 hours	1 hour	DevOps Lead
Regional Outage	6 hours	1 hour	Cloud Architect
Ransomware Attack	3 hours	1 hour	Security Team

Prerequisites

Required Access

Database credentials (DB_PASSWORD)
AWS CLI configured with appropriate permissions
S3 bucket access (spywatcher-backups)
GPG keys for backup decryption
SSH access to production servers
Admin access to cloud provider console

Required Tools

PostgreSQL client tools (psql, pg_restore)
AWS CLI
GPG/OpenSSL
Docker (if using containerized deployments)
kubectl (if using Kubernetes)

Backup Strategy

Automated Backups

Our backup strategy includes:

Full Database Backups (Daily at 2 AM UTC)
- Compressed with gzip
- Encrypted with GPG
- Stored in primary and secondary S3 buckets
- Retention: 30 days daily, 12 months (monthly snapshots)
Incremental Backups (Every 6 hours)
- WAL archiving for point-in-time recovery
- Stored in S3
- Retention: 7 days
Configuration Backups (On change)
- Environment variables
- SSL certificates
- Application configuration files
- Infrastructure as Code (Terraform/CloudFormation)

Backup Locations

Primary: s3://spywatcher-backups/postgres/full/
Secondary: s3://spywatcher-backups-us-west/postgres/full/
WAL Archives: s3://spywatcher-backups/wal/
Local: /var/backups/spywatcher/ (7-day retention)

Recovery Procedures

Scenario 1: Database Corruption

Symptoms:

Data inconsistencies
Query errors
Failed integrity checks
Corrupted indexes

Recovery Steps:

Assess the Damage (10 minutes)

# Connect to database
psql -h $DB_HOST -U spywatcher -d spywatcher

# Check for errors in logs
tail -100 /var/log/postgresql/postgresql-15-main.log

# Run integrity checks
SELECT * FROM pg_stat_database WHERE datname = 'spywatcher';

Stop the Application (5 minutes)

# If using Kubernetes
kubectl scale deployment spywatcher-backend --replicas=0

# If using Docker Compose
docker-compose stop backend

# If using systemd
sudo systemctl stop spywatcher-backend

Identify Last Known Good Backup (5 minutes)

# List recent backups
aws s3 ls s3://spywatcher-backups/postgres/full/ --recursive | sort -r | head -10

# Check backup logs
cd $PROJECT_ROOT/backend
npm run db:backup-logs

Restore Database (60 minutes)

# Download and restore the backup
cd $PROJECT_ROOT/scripts

# Set environment variables
export DB_NAME="spywatcher"
export DB_USER="spywatcher"
export DB_PASSWORD="your_password"
export DB_HOST="localhost"
export S3_BUCKET="spywatcher-backups"

# Run restore
./restore.sh s3://spywatcher-backups/postgres/full/spywatcher_full_20240125_120000.dump.gz

Verify Data Integrity (15 minutes)

# Run data integrity checks
psql -h $DB_HOST -U spywatcher -d spywatcher -c "
SELECT
  (SELECT COUNT(*) FROM \"User\") as users,
  (SELECT COUNT(*) FROM \"Guild\") as guilds,
  (SELECT COUNT(*) FROM \"ApiKey\") as api_keys;
"

# Check for critical records
psql -h $DB_HOST -U spywatcher -d spywatcher -c "
SELECT * FROM \"User\" WHERE role = 'ADMIN' LIMIT 5;
"

Restart Application (15 minutes)

# If using Kubernetes
kubectl scale deployment spywatcher-backend --replicas=3

# If using Docker Compose
docker-compose up -d backend

# If using systemd
sudo systemctl start spywatcher-backend

Monitor for Errors (20 minutes)

# Watch application logs
kubectl logs -f deployment/spywatcher-backend

# Or with Docker
docker-compose logs -f backend

# Check health endpoint
curl https://api.spywatcher.com/health

Post-Recovery Verification (10 minutes)
- Test critical API endpoints
- Verify user logins
- Check data consistency
- Monitor error rates in Sentry
- Verify Discord bot connectivity

Total RTO: ~2 hours

Scenario 2: Complete Infrastructure Failure

Symptoms:

All services down
Cannot access servers
Cloud provider outage
Hardware failure

Recovery Steps:

Assess Infrastructure Status (15 minutes)
- Check cloud provider status page
- Verify network connectivity
- Identify affected resources
- Contact cloud support if needed

Activate Disaster Recovery Site (30 minutes)

# If using Terraform
cd infrastructure/

# Initialize Terraform with DR workspace
terraform workspace select disaster-recovery

# Review planned changes
terraform plan -out=dr.tfplan

# Apply infrastructure
terraform apply dr.tfplan

Restore Database in New Environment (90 minutes)

# Set new environment variables
export DB_HOST="new-db-host.region.rds.amazonaws.com"
export S3_BUCKET="spywatcher-backups"

# Restore from secondary backup location
cd $PROJECT_ROOT/scripts
./restore.sh s3://spywatcher-backups-us-west/postgres/full/latest.dump.gz

Deploy Application Containers (45 minutes)

# If using Kubernetes
kubectl config use-context disaster-recovery

# Apply Kubernetes manifests
kubectl apply -f k8s/namespace.yaml
kubectl apply -f k8s/secrets.yaml
kubectl apply -f k8s/configmaps.yaml
kubectl apply -f k8s/deployments.yaml
kubectl apply -f k8s/services.yaml
kubectl apply -f k8s/ingress.yaml

# If using Docker Compose
docker-compose -f docker-compose.prod.yml up -d

Update DNS Records (15 minutes)

# Update DNS to point to new infrastructure
# This depends on your DNS provider
# Example with AWS Route53:
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890ABC \
  --change-batch file://dns-update.json

Run Smoke Tests (20 minutes)

# Test critical endpoints
curl https://api.spywatcher.com/health
curl https://api.spywatcher.com/api/status

# Test authentication
curl -X POST https://api.spywatcher.com/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"username": "test", "password": "test"}'

# Test Discord bot
# (Check bot status in Discord server)

Monitor System Health (20 minutes)
- Check all services are running
- Verify database connections
- Monitor error rates
- Check Discord bot presence
- Verify frontend accessibility
Notify Stakeholders
- Update status page
- Send notification to users
- Post in Discord/Slack channels
- Document incident for post-mortem

Total RTO: ~4 hours

Scenario 3: Regional Outage

Symptoms:

Primary region unavailable
High latency to primary services
Cloud provider regional outage

Recovery Steps:

Confirm Regional Outage (10 minutes)
- Check cloud provider status page
- Verify other regions are operational
- Assess blast radius

Activate Secondary Region (30 minutes)

# Switch to secondary region infrastructure
cd infrastructure/
terraform workspace select us-west-2
terraform apply

Restore Database in Secondary Region (90 minutes)

# Use secondary backup location
export DB_HOST="secondary-db.us-west-2.rds.amazonaws.com"
export S3_BUCKET="spywatcher-backups-us-west"

cd $PROJECT_ROOT/scripts
./restore.sh s3://spywatcher-backups-us-west/postgres/full/latest.dump.gz

Deploy to Secondary Region (60 minutes)

# Deploy application to secondary region
kubectl config use-context us-west-2
kubectl apply -f k8s/

# Wait for pods to be ready
kubectl wait --for=condition=ready pod -l app=spywatcher-backend --timeout=300s

Update Global DNS (30 minutes)

# Update DNS to point to secondary region
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890ABC \
  --change-batch file://failover-to-west.json

# Verify DNS propagation
dig api.spywatcher.com +short

Monitor Service Restoration (20 minutes)
- Verify all services are healthy
- Check database replication lag (if applicable)
- Monitor error rates
- Verify user access
Plan for Failback (When primary region recovers)
- Schedule maintenance window
- Reverse failover procedure
- Update DNS back to primary region
- Run full system tests

Total RTO: ~6 hours

Scenario 4: Ransomware Attack

Symptoms:

Encrypted files
Ransom notes
Unusual file modifications
Compromised accounts

Recovery Steps:

Contain the Attack (Immediate)

# Isolate affected systems
# Disable network access
# Revoke compromised credentials

# If using AWS
aws ec2 modify-instance-attribute \
  --instance-id i-1234567890abcdef0 \
  --no-source-dest-check

Assess Impact (30 minutes)
- Identify compromised systems
- Determine data loss
- Check backup integrity
- Review security logs
Contact Security Team (15 minutes)
- Notify security team
- Contact law enforcement if required
- Engage incident response team
- Preserve evidence

Restore from Clean Backup (90 minutes)

# Use backup from before attack
# Verify backup is not compromised

cd $PROJECT_ROOT/scripts

# Identify clean backup (before attack)
aws s3 ls s3://spywatcher-backups/postgres/full/ | \
  grep "2024-01-20" # Date before attack

# Restore clean backup
./restore.sh s3://spywatcher-backups/postgres/full/spywatcher_full_20240120_020000.dump.gz

Rebuild Infrastructure (120 minutes)
- Provision new clean infrastructure
- Apply security patches
- Update all credentials
- Implement additional security controls
Restore Service (45 minutes)
- Deploy application to clean infrastructure
- Verify all security measures
- Enable monitoring and alerting
- Test thoroughly before full restoration
Post-Incident Actions
- Conduct forensic analysis
- Update security policies
- Implement additional controls
- Train team on security awareness
- Schedule security audit

Total RTO: ~3 hours (excluding investigation time)

Point-in-Time Recovery (PITR)

If you need to recover to a specific point in time:

# Restore to specific timestamp
cd $PROJECT_ROOT/scripts
./restore.sh <backup_file> '2024-01-25 14:30:00'

Requirements:

WAL archiving must be enabled
WAL files must be available in S3
Backup must be from before the target time

Testing Schedule

Monthly Tests

Restore from latest backup to test database
Verify backup integrity
Test backup decryption
Validate data completeness

Quarterly Drills

Full disaster recovery drill
Document time to recovery
Update procedures based on findings
Train team members

Annual Review

Review and update RTO/RPO targets
Update contact information
Review and update procedures
Conduct table-top exercise

Contacts and Escalation

Primary Contacts

Database Admin: db-admin@spywatcher.com
DevOps Lead: devops@spywatcher.com
Security Team: security@spywatcher.com
On-Call Engineer: oncall@spywatcher.com

Escalation Path

On-Call Engineer (0-30 minutes)
Team Lead (30-60 minutes)
Engineering Manager (1-2 hours)
CTO (2+ hours)

External Contacts

Cloud Provider Support: support@aws.com
Database Vendor: support@postgresql.org
Security Incident Response: incident@security-firm.com

Monitoring and Alerts

Critical Alerts

Backup failure alerts (via PagerDuty)
Database health alerts
Service availability alerts
Security incident alerts

Alert Channels

Email: alerts@spywatcher.com
Slack: #production-alerts
Discord: #ops-alerts
PagerDuty: On-call rotation

Post-Recovery Checklist

After completing any recovery procedure:

Verify all services are operational
Confirm data integrity
Review logs for errors
Update status page
Notify stakeholders
Document incident
Schedule post-mortem
Identify improvement opportunities
Update runbook if needed
Test backup integrity
Review security measures

Appendix

Useful Commands

# Check backup status
aws s3 ls s3://spywatcher-backups/postgres/full/ --recursive | sort -r | head -10

# Check database size
psql -h $DB_HOST -U spywatcher -c "SELECT pg_size_pretty(pg_database_size('spywatcher'));"

# Check WAL archiving status
psql -h $DB_HOST -U postgres -c "SELECT * FROM pg_stat_archiver;"

# List recent backup logs
cd $PROJECT_ROOT/backend
npm run db:backup-logs

# Monitor backup health
npm run backup:health-check

Configuration Files

PostgreSQL Config: /etc/postgresql/15/main/postgresql.conf
Backup Config: $PROJECT_ROOT/scripts/backup.sh
Environment: $PROJECT_ROOT/backend/.env
Infrastructure: $PROJECT_ROOT/infrastructure/

Additional Resources

Last Updated: 2024-11-02
Version: 1.0
Next Review: 2025-02-02

14 KiB Raw Permalink Blame History

Disaster Recovery Runbook

Overview

Quick Reference

Prerequisites

Required Access

Required Tools

Backup Strategy

Automated Backups

Backup Locations

Recovery Procedures

Scenario 1: Database Corruption

Scenario 2: Complete Infrastructure Failure

Scenario 3: Regional Outage

Scenario 4: Ransomware Attack

Point-in-Time Recovery (PITR)

Testing Schedule

Monthly Tests

Quarterly Drills

Annual Review

Contacts and Escalation

Primary Contacts

Escalation Path

External Contacts

Monitoring and Alerts

Critical Alerts

Alert Channels

Post-Recovery Checklist

Appendix

Useful Commands

Configuration Files

Additional Resources

14 KiB

Raw Permalink Blame History