Add production deployment infrastructure with Kubernetes, Terraform, and multi-strategy CI/CD #145
Reference in New Issue
Block a user
Delete Branch "copilot/design-production-deployment-strategy"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Production Deployment Strategy & Infrastructure - Kubernetes & IaC
Addressed all PR review feedback with the following improvements:
Changes Made
Security Improvements
Infrastructure Improvements
Files Modified
terraform/modules/vpc/main.tf- Scoped IAM policy resourceterraform/modules/rds/main.tf- Removed timestamp from final snapshotterraform/main.tf- Updated auth API version to v1scripts/deployment/canary-deploy.sh- Added warning about traffic splitting.github/workflows/deploy-production.yml- Fixed smoke tests, improved migrationk8s/base/backend-deployment.yaml- Removed init containerk8s/base/migration-job.yaml- New separate migration job (created)k8s/base/kustomization.yaml- Added migration job to resourcesk8s/README.md- Updated documentation for migrationsOriginal prompt
This section details on the original issue you should resolve
<issue_title>Production Deployment Strategy & Infrastructure - Kubernetes & IaC</issue_title>
<issue_description>## 🎯 Objective
Design and implement a production deployment strategy with CI/CD, infrastructure as code, and blue-green deployment capabilities.
📋 Requirements
Infrastructure as Code
Terraform/CloudFormation
Container Orchestration
Deployment Environments
Environment Setup
Cloud Provider Setup
Deployment Strategy
Blue-Green Deployment
Canary Deployment
Rolling Deployment
Load Balancing & SSL
Load Balancer Configuration
SSL Certificates
Auto-scaling
Horizontal Scaling
Database Scaling
🛠️ Implementation Details
Kubernetes Deployment
Terraform Infrastructure
Pull Request Overview
This PR adds comprehensive production deployment infrastructure for the Spywatcher application, including Terraform modules for AWS infrastructure, Kubernetes manifests, Helm charts, deployment scripts, and CI/CD automation via GitHub Actions.
Reviewed Changes
Copilot reviewed 55 out of 55 changed files in this pull request and generated 6 comments.
Show a summary per file
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
@@ -0,0 +1,291 @@name: Deploy to ProductionThe smoke test attempts to get the LoadBalancer hostname from a ClusterIP service (as defined in k8s/base/backend-service.yaml line 12:
type: ClusterIP), which won't have a loadBalancer status. This will result in an empty BACKEND_URL and cause the smoke tests to fail. Consider testing via the ingress URL or using port-forwarding for internal service testing.@@ -0,0 +1,193 @@apiVersion: apps/v1Running database migrations in an init container means migrations run for every pod replica during deployment, which can cause race conditions and deadlocks with multiple pods trying to migrate simultaneously. Consider running migrations as a separate Kubernetes Job before deployment or using a leader election mechanism.
@@ -0,0 +91,4 @@# Calculate replica counts based on percentageslocal total_replicas=3local canary_replicas=$(( (total_replicas * canary_weight + 50) / 100 ))The replica calculation for canary_weight=5% will result in
canary_replicas=1due to integer arithmetic rounding, which is actually 33% of total traffic (1 out of 3 replicas), not 5%. This defeats the purpose of gradual canary rollout at low percentages. Consider using a service mesh or ingress controller for accurate traffic splitting instead of relying on replica counts.@@ -0,0 +1,141 @@terraform {The
client.authentication.k8s.io/v1beta1API version for EKS authentication is deprecated as of Kubernetes 1.24. Consider usingclient.authentication.k8s.io/v1instead for future compatibility with EKS cluster version 1.28 specified in variables.@@ -0,0 +1,164 @@resource "aws_db_subnet_group" "main" {Using
timestamp()in the final snapshot identifier will cause Terraform to detect changes on every plan since it's always different. Consider using a variable or removing the timestamp to avoid unnecessary drift detection. The snapshot identifier already includes the identifier which should be unique enough.@@ -0,0 +1,213 @@resource "aws_vpc" "main" {The IAM policy grants VPC Flow Logs permissions on all resources (
Resource = \"*\"). This should be scoped to the specific CloudWatch log group to follow the principle of least privilege. Consider changing toResource = aws_cloudwatch_log_group.flow_log.arn.@copilot apply changes based on the comments in this thread
@@ -0,0 +1,164 @@resource "aws_db_subnet_group" "main" {Fixed in commit
95fdde6. Removed the timestamp() call from final_snapshot_identifier to prevent Terraform drift. Now uses:"${var.identifier}-final-snapshot"@@ -0,0 +1,213 @@resource "aws_vpc" "main" {Fixed in commit
95fdde6. Scoped the IAM policy resource to the specific CloudWatch log group ARN:"${aws_cloudwatch_log_group.flow_log.arn}:*"@@ -0,0 +1,291 @@name: Deploy to ProductionFixed in commit
95fdde6. Updated smoke tests to check for ingress first, and fall back to port-forwarding if ingress is not available. This correctly handles ClusterIP services without expecting LoadBalancer status.@@ -0,0 +1,193 @@apiVersion: apps/v1Fixed in commit
95fdde6. Removed the init container and created a separatemigration-job.yamlwithcompletions: 1andparallelism: 1to prevent race conditions. The GitHub Actions workflow now creates unique migration jobs before deployment.@@ -0,0 +1,141 @@terraform {Fixed in commit
95fdde6. Updated both Kubernetes and Helm provider configurations to useclient.authentication.k8s.io/v1instead of the deprecatedv1beta1API version.@@ -0,0 +91,4 @@# Calculate replica counts based on percentageslocal total_replicas=3local canary_replicas=$(( (total_replicas * canary_weight + 50) / 100 ))Fixed in commit
95fdde6. Added a warning comment in the script about replica-based traffic splitting limitations and recommendation to use a service mesh (Istio, Linkerd) or ingress controller for accurate percentage-based traffic splitting.