* Initial plan * feat: add comprehensive auto-scaling and load balancing configuration - Add frontend HPA for auto-scaling (2-5 replicas) - Enhance backend HPA with custom metrics support - Improve load balancer configuration with health-based routing - Add advanced traffic management policies - Create AUTO_SCALING.md documentation - Add validation script for auto-scaling setup - Add load testing script for traffic spike simulation - Update Helm production values with enhanced configs Co-authored-by: onnwee <211922112+onnwee@users.noreply.github.com> * feat: add monitoring, alerting, and comprehensive documentation - Add Prometheus alerting rules for auto-scaling events - Add ServiceMonitor for metrics collection - Create comprehensive AUTO_SCALING_EXAMPLES.md tutorial - Update DEPLOYMENT.md with auto-scaling references - Update scripts/README.md with new validation tools - Add monitoring for HPA, deployments, and load balancers - Include troubleshooting scenarios and examples Co-authored-by: onnwee <211922112+onnwee@users.noreply.github.com> * docs: add comprehensive implementation summary - Create AUTO_SCALING_IMPLEMENTATION.md with complete overview - Document all components, files, and specifications - Include deployment instructions and validation results - Add technical specifications and performance characteristics - Document success criteria achievement Co-authored-by: onnwee <211922112+onnwee@users.noreply.github.com> * fix: address code review feedback - Merge duplicate alb.ingress.kubernetes.io/load-balancer-attributes annotation - Fix nginx.ingress.kubernetes.io/limit-burst-multiplier to correct annotation name - Remove unused checks_warned variable from validation script - Fix YAML escape sequence in AUTO_SCALING_EXAMPLES.md Co-authored-by: onnwee <211922112+onnwee@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: onnwee <211922112+onnwee@users.noreply.github.com>
17 KiB
Auto-scaling & Load Balancing Guide
This document describes the auto-scaling and load balancing configuration for Spywatcher, ensuring dynamic resource scaling and zero-downtime deployments.
Table of Contents
- Overview
- Horizontal Pod Autoscaling (HPA)
- Load Balancing Configuration
- Health-based Routing
- Rolling Updates Strategy
- Zero-downtime Deployment
- Monitoring and Metrics
- Troubleshooting
- Best Practices
Overview
Spywatcher implements comprehensive auto-scaling and load balancing to handle variable workloads efficiently:
- Horizontal Pod Autoscaling (HPA): Automatically scales pods based on CPU, memory, and custom metrics
- Load Balancing: Distributes traffic across healthy instances
- Health Checks: Removes unhealthy instances from rotation
- Rolling Updates: Zero-downtime deployments with gradual rollouts
- Pod Disruption Budgets: Ensures minimum availability during maintenance
Horizontal Pod Autoscaling (HPA)
Backend HPA
The backend service automatically scales between 2 and 10 replicas based on resource utilization:
# k8s/base/backend-hpa.yaml
minReplicas: 2
maxReplicas: 10
metrics:
- CPU: 70% average utilization
- Memory: 80% average utilization
Scaling Behavior:
- Scale Up: Rapid response to load increases
- 100% increase or 2 pods every 30 seconds
- No stabilization window (immediate scale-up)
- Scale Down: Conservative to prevent flapping
- 50% decrease or 1 pod every 60 seconds
- 5-minute stabilization window
Frontend HPA
The frontend service scales between 2 and 5 replicas:
# k8s/base/frontend-hpa.yaml
minReplicas: 2
maxReplicas: 5
metrics:
- CPU: 70% average utilization
- Memory: 80% average utilization
Scaling Behavior:
- Same aggressive scale-up policy
- Conservative scale-down with 5-minute stabilization
Custom Metrics (Optional)
For advanced scaling, configure custom metrics using Prometheus adapter:
# Additional metrics can be added:
- http_requests_per_second: scale at 1000 rps/pod
- active_connections: scale at 100 connections/pod
- queue_depth: scale based on message queue length
Setup Requirements:
- Install Prometheus Operator
- Install Prometheus Adapter
- Configure custom metrics API
- Uncomment custom metrics in HPA configuration
Checking HPA Status
# View HPA status
kubectl get hpa -n spywatcher
# Detailed HPA information
kubectl describe hpa spywatcher-backend-hpa -n spywatcher
# Watch HPA in real-time
kubectl get hpa -n spywatcher --watch
# View HPA events
kubectl get events -n spywatcher | grep -i horizontal
Load Balancing Configuration
NGINX Ingress Load Balancing
The ingress controller implements intelligent load balancing:
Load Balancing Algorithm:
- EWMA (Exponentially Weighted Moving Average): Distributes requests based on response time
- Automatically favors faster backends
- Provides better performance than round-robin
Connection Management:
upstream-keepalive-connections: 100
upstream-keepalive-timeout: 60s
upstream-keepalive-requests: 100
Session Affinity:
- Hash-based routing using client IP
- Sticky sessions for WebSocket connections
- 3-hour timeout for backend sessions
AWS Load Balancer
For AWS deployments, the ALB/NLB provides:
Features:
- Cross-zone load balancing (traffic distributed across all AZs)
- Connection draining (60-second timeout for graceful shutdown)
- Health checks every 30 seconds
- HTTP/2 support enabled
- Deletion protection enabled
Health Check Configuration:
Path: /health/live
Interval: 30s
Timeout: 5s
Healthy Threshold: 2
Unhealthy Threshold: 3
Service-level Load Balancing
Kubernetes services use ClusterIP with client IP session affinity:
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 10800 # 3 hours
Health-based Routing
Health Check Endpoints
Backend Health Checks:
- Liveness:
/health/live- Container is alive - Readiness:
/health/ready- Ready to serve traffic - Startup:
/health/live- Slow startup tolerance
Frontend Health Checks:
- Liveness:
/- NGINX is responding - Readiness:
/- Ready to serve traffic
Health Check Configuration
Backend:
livenessProbe:
httpGet:
path: /health/live
port: 3001
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 3001
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
startupProbe:
httpGet:
path: /health/live
port: 3001
periodSeconds: 10
failureThreshold: 30 # 5 minutes total
Automatic Retry Logic
The ingress controller automatically retries failed requests:
proxy-next-upstream: 'error timeout http_502 http_503 http_504'
proxy-next-upstream-tries: 3
proxy-next-upstream-timeout: 10s
Behavior:
- Retries on backend errors, timeouts, 502/503/504
- Maximum 3 attempts
- 10-second timeout for retries
- Automatically routes to healthy backends
Removing Unhealthy Instances
Instances are removed from load balancer rotation when:
- Readiness probe fails 3 consecutive times (15 seconds)
- Health check endpoint returns non-200 status
- Request timeout exceeds threshold
- Container becomes unresponsive
Recovery:
- Readiness probe must succeed before pod receives traffic
- 2 consecutive successful health checks required
- Gradual traffic restoration
Rolling Updates Strategy
Deployment Strategy
Both backend and frontend use RollingUpdate strategy:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # 1 extra pod during update
maxUnavailable: 0 # All pods must be available
Benefits:
- Zero downtime - at least minimum pods always available
- Gradual rollout - one pod at a time
- Automatic rollback on failure
- No service interruption
Update Process
Step-by-step:
- New pod with updated image is created (maxSurge: 1)
- New pod passes startup probe (up to 5 minutes)
- New pod passes readiness probe
- New pod receives traffic from load balancer
- Old pod is marked for termination
- Load balancer drains connections from old pod (60s)
- Old pod receives SIGTERM signal
- Graceful shutdown (30s timeout)
- Process repeats for next pod
Revision History
Keep last 10 revisions for rollback:
revisionHistoryLimit: 10
View revision history:
kubectl rollout history deployment/spywatcher-backend -n spywatcher
Zero-downtime Deployment
Requirements Checklist
- Multiple replicas (minimum 2)
- Health checks configured (liveness, readiness, startup)
- Pod Disruption Budget (minAvailable: 1)
- Rolling update strategy (maxUnavailable: 0)
- Graceful shutdown handling
- Connection draining
- Pre-stop hooks (if needed)
Deployment Process
Using kubectl:
# Update image
kubectl set image deployment/spywatcher-backend \
backend=ghcr.io/subculture-collective/spywatcher-backend:v2.0.0 \
-n spywatcher
# Watch rollout status
kubectl rollout status deployment/spywatcher-backend -n spywatcher
# Pause rollout (if issues detected)
kubectl rollout pause deployment/spywatcher-backend -n spywatcher
# Resume rollout
kubectl rollout resume deployment/spywatcher-backend -n spywatcher
# Rollback if needed
kubectl rollout undo deployment/spywatcher-backend -n spywatcher
Using Kustomize:
# Update image tag in kustomization.yaml
kubectl apply -k k8s/overlays/production
# Monitor rollout
kubectl rollout status deployment/spywatcher-backend -n spywatcher
Graceful Shutdown
Applications must handle SIGTERM signal:
// Backend graceful shutdown example
process.on('SIGTERM', async () => {
console.log('SIGTERM received, starting graceful shutdown');
// Stop accepting new connections
server.close(() => {
console.log('Server closed');
});
// Close database connections
await prisma.$disconnect();
// Close Redis connections
await redis.quit();
// Exit process
process.exit(0);
});
Kubernetes termination flow:
- Pod marked for termination
- Removed from service endpoints (stops receiving new traffic)
- SIGTERM sent to container
- Grace period starts (default 30s)
- Container performs cleanup
- If not terminated after grace period, SIGKILL sent
Connection Draining
Load Balancer Level:
- 60-second connection draining
- Existing connections allowed to complete
- No new connections routed to terminating pod
Application Level:
- Stop accepting new requests
- Complete in-flight requests
- Close persistent connections gracefully
Pod Disruption Budget
Ensures minimum availability during voluntary disruptions:
# k8s/base/pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: spywatcher-backend-pdb
spec:
minAvailable: 1 # At least 1 pod must be available
selector:
matchLabels:
app: spywatcher
tier: backend
Protects against:
- Node drain operations
- Voluntary evictions
- Cluster upgrades
- Node maintenance
Monitoring and Metrics
HPA Metrics
# View current metrics
kubectl get hpa -n spywatcher
# Detailed metrics
kubectl describe hpa spywatcher-backend-hpa -n spywatcher
# Raw metrics from metrics-server
kubectl top pods -n spywatcher
kubectl top nodes
Scaling Events
# View scaling events
kubectl get events -n spywatcher | grep -i horizontal
# Watch for scaling events
kubectl get events -n spywatcher --watch | grep -i horizontal
Load Balancer Metrics
AWS CloudWatch Metrics:
- Target health count
- Request count
- Response time
- HTTP status codes
- Connection count
Prometheus Metrics:
# Request rate
rate(http_requests_total[5m])
# Average response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Pod count
count(kube_pod_status_phase{namespace="spywatcher", phase="Running"})
# HPA current replicas
kube_horizontalpodautoscaler_status_current_replicas{namespace="spywatcher"}
Alerting Rules
Recommended Alerts:
# HPA at max capacity
- alert: HPAMaxedOut
expr: |
kube_horizontalpodautoscaler_status_current_replicas
>= kube_horizontalpodautoscaler_spec_max_replicas
for: 15m
labels:
severity: warning
annotations:
summary: HPA has reached maximum replicas
# High scaling frequency
- alert: FrequentScaling
expr: |
rate(kube_horizontalpodautoscaler_status_current_replicas[15m]) > 0.5
for: 30m
labels:
severity: warning
annotations:
summary: HPA is scaling frequently
# Deployment rollout stuck
- alert: RolloutStuck
expr: |
kube_deployment_status_replicas_updated
< kube_deployment_spec_replicas
for: 15m
labels:
severity: critical
annotations:
summary: Deployment rollout is stuck
Troubleshooting
HPA Not Scaling
Symptoms:
- HPA shows
<unknown>for metrics - Pods not scaling despite high load
Solutions:
- Check metrics-server is running:
kubectl get deployment metrics-server -n kube-system
kubectl logs -n kube-system deployment/metrics-server
- Verify resource requests are set:
kubectl describe deployment spywatcher-backend -n spywatcher | grep -A 5 Requests
- Check HPA events:
kubectl describe hpa spywatcher-backend-hpa -n spywatcher
- Verify metrics are available:
kubectl top pods -n spywatcher
Pods Not Receiving Traffic
Symptoms:
- Pods are running but not receiving requests
- High load on some pods, idle others
Solutions:
- Check readiness probe:
kubectl describe pod <pod-name> -n spywatcher | grep -A 10 Readiness
- Verify service endpoints:
kubectl get endpoints spywatcher-backend -n spywatcher
- Check ingress configuration:
kubectl describe ingress spywatcher-ingress -n spywatcher
- Test health endpoint directly:
kubectl port-forward pod/<pod-name> 3001:3001 -n spywatcher
curl http://localhost:3001/health/ready
Rolling Update Stuck
Symptoms:
- Deployment shows pods pending
- Old pods not terminating
- Update taking too long
Solutions:
- Check rollout status:
kubectl rollout status deployment/spywatcher-backend -n spywatcher
kubectl describe deployment spywatcher-backend -n spywatcher
- View pod events:
kubectl get events -n spywatcher --sort-by='.lastTimestamp' | grep -i error
- Check PDB is not blocking:
kubectl get pdb -n spywatcher
- Verify node resources:
kubectl describe nodes | grep -A 5 "Allocated resources"
- Force rollout (last resort):
kubectl rollout restart deployment/spywatcher-backend -n spywatcher
High Latency During Scaling
Symptoms:
- Response times increase during scale-up
- Connections failing during scale-down
Solutions:
-
Adjust readiness probe:
- Reduce initialDelaySeconds
- Increase periodSeconds for stability
-
Configure connection draining:
- Ensure pre-stop hooks are configured
- Increase termination grace period
-
Optimize startup time:
- Use startup probe for slow-starting apps
- Reduce container image size
- Implement application-level warmup
-
Review HPA behavior:
- Adjust stabilization windows
- Modify scale-up/down policies
- Consider custom metrics
Best Practices
Design for Auto-scaling
-
Stateless Applications
- Store state externally (Redis, database)
- Enable horizontal scaling
- Simplify deployment and recovery
-
Resource Requests and Limits
- Always set resource requests (required for HPA)
- Set realistic limits based on actual usage
- Leave headroom for traffic spikes
-
Proper Health Checks
- Implement meaningful health endpoints
- Check external dependencies
- Use startup probes for slow initialization
-
Graceful Shutdown
- Handle SIGTERM signal
- Complete in-flight requests
- Close connections cleanly
- Set appropriate termination grace period
Scaling Strategy
-
Conservative Scale-down
- Use longer stabilization windows
- Prevent flapping
- Reduce pod churn
-
Aggressive Scale-up
- Respond quickly to load increases
- Prevent service degradation
- Better user experience
-
Set Realistic Limits
- Maximum replicas based on cluster capacity
- Minimum replicas for redundancy
- Consider cost vs. performance trade-offs
-
Monitor and Adjust
- Review scaling patterns regularly
- Adjust thresholds based on actual load
- Optimize resource requests
Load Balancing
-
Health Check Tuning
- Balance between responsiveness and stability
- Consider application startup time
- Use appropriate timeout values
-
Connection Management
- Enable keepalive connections
- Configure appropriate timeouts
- Use connection pooling
-
Session Affinity
- Use for stateful sessions
- Configure appropriate timeout
- Consider sticky sessions for WebSockets
-
Cross-zone Distribution
- Enable cross-zone load balancing
- Use pod anti-affinity rules
- Distribute across availability zones
Deployment Strategy
-
Test in Staging First
- Validate changes in non-production
- Test auto-scaling behavior
- Verify health checks work correctly
-
Monitor During Rollout
- Watch error rates
- Check response times
- Monitor resource usage
-
Progressive Delivery
- Use canary deployments for risky changes
- Implement feature flags
- Have rollback plan ready
-
Database Migrations
- Run migrations before code deployment
- Ensure backward compatibility
- Test rollback scenarios
Cost Optimization
-
Right-size Resources
- Set requests based on actual usage
- Use VPA (Vertical Pod Autoscaler) for recommendations
- Review and adjust regularly
-
Efficient Scaling
- Scale based on meaningful metrics
- Avoid over-provisioning
- Use cluster autoscaler for nodes
-
Schedule-based Scaling
- Reduce replicas during off-peak hours
- Use CronJobs for scheduled scaling
- Consider regional traffic patterns
-
Resource Quotas
- Set namespace quotas
- Prevent runaway scaling
- Control costs
References
- Kubernetes HPA Documentation
- Kubernetes Rolling Updates
- NGINX Ingress Controller
- AWS Load Balancer Controller
- Pod Disruption Budgets
Support
For issues with auto-scaling or load balancing:
- Check monitoring dashboards
- Review HPA and deployment events
- Consult CloudWatch/Prometheus metrics
- Contact DevOps team