subculture-collective/discord-spywatcher

Files

Copilot dd0bec5046 Implement production-ready auto-scaling and load balancing infrastructure (#146 )

* Initial plan

* feat: add comprehensive auto-scaling and load balancing configuration

- Add frontend HPA for auto-scaling (2-5 replicas)
- Enhance backend HPA with custom metrics support
- Improve load balancer configuration with health-based routing
- Add advanced traffic management policies
- Create AUTO_SCALING.md documentation
- Add validation script for auto-scaling setup
- Add load testing script for traffic spike simulation
- Update Helm production values with enhanced configs

Co-authored-by: onnwee <211922112+onnwee@users.noreply.github.com>

* feat: add monitoring, alerting, and comprehensive documentation

- Add Prometheus alerting rules for auto-scaling events
- Add ServiceMonitor for metrics collection
- Create comprehensive AUTO_SCALING_EXAMPLES.md tutorial
- Update DEPLOYMENT.md with auto-scaling references
- Update scripts/README.md with new validation tools
- Add monitoring for HPA, deployments, and load balancers
- Include troubleshooting scenarios and examples

Co-authored-by: onnwee <211922112+onnwee@users.noreply.github.com>

* docs: add comprehensive implementation summary

- Create AUTO_SCALING_IMPLEMENTATION.md with complete overview
- Document all components, files, and specifications
- Include deployment instructions and validation results
- Add technical specifications and performance characteristics
- Document success criteria achievement

Co-authored-by: onnwee <211922112+onnwee@users.noreply.github.com>

* fix: address code review feedback

- Merge duplicate alb.ingress.kubernetes.io/load-balancer-attributes annotation
- Fix nginx.ingress.kubernetes.io/limit-burst-multiplier to correct annotation name
- Remove unused checks_warned variable from validation script
- Fix YAML escape sequence in AUTO_SCALING_EXAMPLES.md

Co-authored-by: onnwee <211922112+onnwee@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: onnwee <211922112+onnwee@users.noreply.github.com>

2025-11-02 18:03:58 -06:00

17 KiB

Raw Permalink Blame History

Auto-scaling & Load Balancing Guide

This document describes the auto-scaling and load balancing configuration for Spywatcher, ensuring dynamic resource scaling and zero-downtime deployments.

Overview
Horizontal Pod Autoscaling (HPA)
Load Balancing Configuration
Health-based Routing
Rolling Updates Strategy
Zero-downtime Deployment
Monitoring and Metrics
Troubleshooting
Best Practices

Overview

Spywatcher implements comprehensive auto-scaling and load balancing to handle variable workloads efficiently:

Horizontal Pod Autoscaling (HPA): Automatically scales pods based on CPU, memory, and custom metrics
Load Balancing: Distributes traffic across healthy instances
Health Checks: Removes unhealthy instances from rotation
Rolling Updates: Zero-downtime deployments with gradual rollouts
Pod Disruption Budgets: Ensures minimum availability during maintenance

Horizontal Pod Autoscaling (HPA)

Backend HPA

The backend service automatically scales between 2 and 10 replicas based on resource utilization:

# k8s/base/backend-hpa.yaml
minReplicas: 2
maxReplicas: 10
metrics:
    - CPU: 70% average utilization
    - Memory: 80% average utilization

Scaling Behavior:

Scale Up: Rapid response to load increases
- 100% increase or 2 pods every 30 seconds
- No stabilization window (immediate scale-up)
Scale Down: Conservative to prevent flapping
- 50% decrease or 1 pod every 60 seconds
- 5-minute stabilization window

Frontend HPA

The frontend service scales between 2 and 5 replicas:

# k8s/base/frontend-hpa.yaml
minReplicas: 2
maxReplicas: 5
metrics:
    - CPU: 70% average utilization
    - Memory: 80% average utilization

Scaling Behavior:

Same aggressive scale-up policy
Conservative scale-down with 5-minute stabilization

Custom Metrics (Optional)

For advanced scaling, configure custom metrics using Prometheus adapter:

# Additional metrics can be added:
- http_requests_per_second: scale at 1000 rps/pod
- active_connections: scale at 100 connections/pod
- queue_depth: scale based on message queue length

Setup Requirements:

Install Prometheus Operator
Install Prometheus Adapter
Configure custom metrics API
Uncomment custom metrics in HPA configuration

Checking HPA Status

# View HPA status
kubectl get hpa -n spywatcher

# Detailed HPA information
kubectl describe hpa spywatcher-backend-hpa -n spywatcher

# Watch HPA in real-time
kubectl get hpa -n spywatcher --watch

# View HPA events
kubectl get events -n spywatcher | grep -i horizontal

Load Balancing Configuration

NGINX Ingress Load Balancing

The ingress controller implements intelligent load balancing:

Load Balancing Algorithm:

EWMA (Exponentially Weighted Moving Average): Distributes requests based on response time
Automatically favors faster backends
Provides better performance than round-robin

Connection Management:

upstream-keepalive-connections: 100
upstream-keepalive-timeout: 60s
upstream-keepalive-requests: 100

Session Affinity:

Hash-based routing using client IP
Sticky sessions for WebSocket connections
3-hour timeout for backend sessions

AWS Load Balancer

For AWS deployments, the ALB/NLB provides:

Features:

Cross-zone load balancing (traffic distributed across all AZs)
Connection draining (60-second timeout for graceful shutdown)
Health checks every 30 seconds
HTTP/2 support enabled
Deletion protection enabled

Health Check Configuration:

Path: /health/live
Interval: 30s
Timeout: 5s
Healthy Threshold: 2
Unhealthy Threshold: 3

Service-level Load Balancing

Kubernetes services use ClusterIP with client IP session affinity:

sessionAffinity: ClientIP
sessionAffinityConfig:
    clientIP:
        timeoutSeconds: 10800 # 3 hours

Health-based Routing

Health Check Endpoints

Backend Health Checks:

Liveness: /health/live - Container is alive
Readiness: /health/ready - Ready to serve traffic
Startup: /health/live - Slow startup tolerance

Frontend Health Checks:

Liveness: / - NGINX is responding
Readiness: / - Ready to serve traffic

Health Check Configuration

Backend:

livenessProbe:
    httpGet:
        path: /health/live
        port: 3001
    initialDelaySeconds: 30
    periodSeconds: 10
    timeoutSeconds: 5
    failureThreshold: 3

readinessProbe:
    httpGet:
        path: /health/ready
        port: 3001
    initialDelaySeconds: 10
    periodSeconds: 5
    timeoutSeconds: 3
    failureThreshold: 3

startupProbe:
    httpGet:
        path: /health/live
        port: 3001
    periodSeconds: 10
    failureThreshold: 30 # 5 minutes total

Automatic Retry Logic

The ingress controller automatically retries failed requests:

proxy-next-upstream: 'error timeout http_502 http_503 http_504'
proxy-next-upstream-tries: 3
proxy-next-upstream-timeout: 10s

Behavior:

Retries on backend errors, timeouts, 502/503/504
Maximum 3 attempts
10-second timeout for retries
Automatically routes to healthy backends

Removing Unhealthy Instances

Instances are removed from load balancer rotation when:

Readiness probe fails 3 consecutive times (15 seconds)
Health check endpoint returns non-200 status
Request timeout exceeds threshold
Container becomes unresponsive

Recovery:

Readiness probe must succeed before pod receives traffic
2 consecutive successful health checks required
Gradual traffic restoration

Rolling Updates Strategy

Deployment Strategy

Both backend and frontend use RollingUpdate strategy:

strategy:
    type: RollingUpdate
    rollingUpdate:
        maxSurge: 1 # 1 extra pod during update
        maxUnavailable: 0 # All pods must be available

Benefits:

Zero downtime - at least minimum pods always available
Gradual rollout - one pod at a time
Automatic rollback on failure
No service interruption

Update Process

Step-by-step:

New pod with updated image is created (maxSurge: 1)
New pod passes startup probe (up to 5 minutes)
New pod passes readiness probe
New pod receives traffic from load balancer
Old pod is marked for termination
Load balancer drains connections from old pod (60s)
Old pod receives SIGTERM signal
Graceful shutdown (30s timeout)
Process repeats for next pod

Revision History

Keep last 10 revisions for rollback:

revisionHistoryLimit: 10

View revision history:

kubectl rollout history deployment/spywatcher-backend -n spywatcher

Zero-downtime Deployment

Requirements Checklist

Multiple replicas (minimum 2)
Health checks configured (liveness, readiness, startup)
Pod Disruption Budget (minAvailable: 1)
Rolling update strategy (maxUnavailable: 0)
Graceful shutdown handling
Connection draining
Pre-stop hooks (if needed)

Deployment Process

Using kubectl:

# Update image
kubectl set image deployment/spywatcher-backend \
  backend=ghcr.io/subculture-collective/spywatcher-backend:v2.0.0 \
  -n spywatcher

# Watch rollout status
kubectl rollout status deployment/spywatcher-backend -n spywatcher

# Pause rollout (if issues detected)
kubectl rollout pause deployment/spywatcher-backend -n spywatcher

# Resume rollout
kubectl rollout resume deployment/spywatcher-backend -n spywatcher

# Rollback if needed
kubectl rollout undo deployment/spywatcher-backend -n spywatcher

Using Kustomize:

# Update image tag in kustomization.yaml
kubectl apply -k k8s/overlays/production

# Monitor rollout
kubectl rollout status deployment/spywatcher-backend -n spywatcher

Graceful Shutdown

Applications must handle SIGTERM signal:

// Backend graceful shutdown example
process.on('SIGTERM', async () => {
    console.log('SIGTERM received, starting graceful shutdown');

    // Stop accepting new connections
    server.close(() => {
        console.log('Server closed');
    });

    // Close database connections
    await prisma.$disconnect();

    // Close Redis connections
    await redis.quit();

    // Exit process
    process.exit(0);
});

Kubernetes termination flow:

Pod marked for termination
Removed from service endpoints (stops receiving new traffic)
SIGTERM sent to container
Grace period starts (default 30s)
Container performs cleanup
If not terminated after grace period, SIGKILL sent

Connection Draining

Load Balancer Level:

60-second connection draining
Existing connections allowed to complete
No new connections routed to terminating pod

Application Level:

Stop accepting new requests
Complete in-flight requests
Close persistent connections gracefully

Pod Disruption Budget

Ensures minimum availability during voluntary disruptions:

# k8s/base/pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
    name: spywatcher-backend-pdb
spec:
    minAvailable: 1 # At least 1 pod must be available
    selector:
        matchLabels:
            app: spywatcher
            tier: backend

Protects against:

Node drain operations
Voluntary evictions
Cluster upgrades
Node maintenance

Monitoring and Metrics

HPA Metrics

# View current metrics
kubectl get hpa -n spywatcher

# Detailed metrics
kubectl describe hpa spywatcher-backend-hpa -n spywatcher

# Raw metrics from metrics-server
kubectl top pods -n spywatcher
kubectl top nodes

Scaling Events

# View scaling events
kubectl get events -n spywatcher | grep -i horizontal

# Watch for scaling events
kubectl get events -n spywatcher --watch | grep -i horizontal

Load Balancer Metrics

AWS CloudWatch Metrics:

Target health count
Request count
Response time
HTTP status codes
Connection count

Prometheus Metrics:

# Request rate
rate(http_requests_total[5m])

# Average response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Pod count
count(kube_pod_status_phase{namespace="spywatcher", phase="Running"})

# HPA current replicas
kube_horizontalpodautoscaler_status_current_replicas{namespace="spywatcher"}

Alerting Rules

Recommended Alerts:

# HPA at max capacity
- alert: HPAMaxedOut
  expr: |
      kube_horizontalpodautoscaler_status_current_replicas
      >= kube_horizontalpodautoscaler_spec_max_replicas
  for: 15m
  labels:
      severity: warning
  annotations:
      summary: HPA has reached maximum replicas

# High scaling frequency
- alert: FrequentScaling
  expr: |
      rate(kube_horizontalpodautoscaler_status_current_replicas[15m]) > 0.5
  for: 30m
  labels:
      severity: warning
  annotations:
      summary: HPA is scaling frequently

# Deployment rollout stuck
- alert: RolloutStuck
  expr: |
      kube_deployment_status_replicas_updated
      < kube_deployment_spec_replicas
  for: 15m
  labels:
      severity: critical
  annotations:
      summary: Deployment rollout is stuck

Troubleshooting

HPA Not Scaling

Symptoms:

HPA shows <unknown> for metrics
Pods not scaling despite high load

Solutions:

Check metrics-server is running:

kubectl get deployment metrics-server -n kube-system
kubectl logs -n kube-system deployment/metrics-server

Verify resource requests are set:

kubectl describe deployment spywatcher-backend -n spywatcher | grep -A 5 Requests

Check HPA events:

kubectl describe hpa spywatcher-backend-hpa -n spywatcher

Verify metrics are available:

kubectl top pods -n spywatcher

Pods Not Receiving Traffic

Symptoms:

Pods are running but not receiving requests
High load on some pods, idle others

Solutions:

Check readiness probe:

kubectl describe pod <pod-name> -n spywatcher | grep -A 10 Readiness

Verify service endpoints:

kubectl get endpoints spywatcher-backend -n spywatcher

Check ingress configuration:

kubectl describe ingress spywatcher-ingress -n spywatcher

Test health endpoint directly:

kubectl port-forward pod/<pod-name> 3001:3001 -n spywatcher
curl http://localhost:3001/health/ready

Rolling Update Stuck

Symptoms:

Deployment shows pods pending
Old pods not terminating
Update taking too long

Solutions:

Check rollout status:

kubectl rollout status deployment/spywatcher-backend -n spywatcher
kubectl describe deployment spywatcher-backend -n spywatcher

View pod events:

kubectl get events -n spywatcher --sort-by='.lastTimestamp' | grep -i error

Check PDB is not blocking:

kubectl get pdb -n spywatcher

Verify node resources:

kubectl describe nodes | grep -A 5 "Allocated resources"

Force rollout (last resort):

kubectl rollout restart deployment/spywatcher-backend -n spywatcher

High Latency During Scaling

Symptoms:

Response times increase during scale-up
Connections failing during scale-down

Solutions:

Adjust readiness probe:
- Reduce initialDelaySeconds
- Increase periodSeconds for stability
Configure connection draining:
- Ensure pre-stop hooks are configured
- Increase termination grace period
Optimize startup time:
- Use startup probe for slow-starting apps
- Reduce container image size
- Implement application-level warmup
Review HPA behavior:
- Adjust stabilization windows
- Modify scale-up/down policies
- Consider custom metrics

Best Practices

Design for Auto-scaling

Stateless Applications
- Store state externally (Redis, database)
- Enable horizontal scaling
- Simplify deployment and recovery
Resource Requests and Limits
- Always set resource requests (required for HPA)
- Set realistic limits based on actual usage
- Leave headroom for traffic spikes
Proper Health Checks
- Implement meaningful health endpoints
- Check external dependencies
- Use startup probes for slow initialization
Graceful Shutdown
- Handle SIGTERM signal
- Complete in-flight requests
- Close connections cleanly
- Set appropriate termination grace period

Scaling Strategy

Conservative Scale-down
- Use longer stabilization windows
- Prevent flapping
- Reduce pod churn
Aggressive Scale-up
- Respond quickly to load increases
- Prevent service degradation
- Better user experience
Set Realistic Limits
- Maximum replicas based on cluster capacity
- Minimum replicas for redundancy
- Consider cost vs. performance trade-offs
Monitor and Adjust
- Review scaling patterns regularly
- Adjust thresholds based on actual load
- Optimize resource requests

Load Balancing

Health Check Tuning
- Balance between responsiveness and stability
- Consider application startup time
- Use appropriate timeout values
Connection Management
- Enable keepalive connections
- Configure appropriate timeouts
- Use connection pooling
Session Affinity
- Use for stateful sessions
- Configure appropriate timeout
- Consider sticky sessions for WebSockets
Cross-zone Distribution
- Enable cross-zone load balancing
- Use pod anti-affinity rules
- Distribute across availability zones

Deployment Strategy

Test in Staging First
- Validate changes in non-production
- Test auto-scaling behavior
- Verify health checks work correctly
Monitor During Rollout
- Watch error rates
- Check response times
- Monitor resource usage
Progressive Delivery
- Use canary deployments for risky changes
- Implement feature flags
- Have rollback plan ready
Database Migrations
- Run migrations before code deployment
- Ensure backward compatibility
- Test rollback scenarios

Cost Optimization

Right-size Resources
- Set requests based on actual usage
- Use VPA (Vertical Pod Autoscaler) for recommendations
- Review and adjust regularly
Efficient Scaling
- Scale based on meaningful metrics
- Avoid over-provisioning
- Use cluster autoscaler for nodes
Schedule-based Scaling
- Reduce replicas during off-peak hours
- Use CronJobs for scheduled scaling
- Consider regional traffic patterns
Resource Quotas
- Set namespace quotas
- Prevent runaway scaling
- Control costs

References

Support

For issues with auto-scaling or load balancing:

Check monitoring dashboards
Review HPA and deployment events
Consult CloudWatch/Prometheus metrics
Contact DevOps team

17 KiB Raw Permalink Blame History

Auto-scaling & Load Balancing Guide

Table of Contents

Overview

Horizontal Pod Autoscaling (HPA)

Backend HPA

Frontend HPA

Custom Metrics (Optional)

Checking HPA Status

Load Balancing Configuration

NGINX Ingress Load Balancing

AWS Load Balancer

Service-level Load Balancing

Health-based Routing

Health Check Endpoints

Health Check Configuration

Automatic Retry Logic

Removing Unhealthy Instances

Rolling Updates Strategy

Deployment Strategy

Update Process

Revision History

Zero-downtime Deployment

Requirements Checklist

Deployment Process

Graceful Shutdown

Connection Draining

Pod Disruption Budget

Monitoring and Metrics

HPA Metrics

Scaling Events

Load Balancer Metrics

Alerting Rules

Troubleshooting

HPA Not Scaling

Pods Not Receiving Traffic

Rolling Update Stuck

High Latency During Scaling

Best Practices

Design for Auto-scaling

Scaling Strategy

Load Balancing

Deployment Strategy

Cost Optimization

References

Support

17 KiB

Raw Permalink Blame History