Skip to content

Commit 3cf4400

Browse files
DanielHashmiclaude
andcommitted
feat(011): implement AWS EKS deployment infrastructure and documentation
Implement complete production-ready AWS EKS deployment infrastructure for LifeStepsAI Phase V cloud migration. This establishes full deployment automation from infrastructure provisioning to monitoring setup. Infrastructure & Configuration: - Add EKS 1.28 cluster configuration with OIDC for IRSA - Add 11 deployment scripts (EKS, MSK, RDS, ECR, Docker, Dapr, monitoring) - Add Helm values-aws.yaml with complete service configurations - Add Dapr components for AWS MSK and RDS (verified with context7 MCP) - Add IAM trust policies and permission policies for IRSA - Add .helmignore for Helm chart packaging - Update .gitignore with AWS cache file patterns Scripts & Automation: - Master orchestration script (00-deploy-all.sh) for one-command deployment - EKS cluster provisioning with auto-configuration - MSK Kafka with IAM authentication (port 9098) - RDS PostgreSQL with security group setup - ECR repository creation with lifecycle policies - Multi-arch Docker builds (amd64/arm64) - IRSA configuration with auto-update of Helm values - Dapr installation with component deployment - Application deployment via Helm - CloudWatch monitoring with billing alarms - Complete cleanup script for resource deletion Documentation & Guides: - AWS troubleshooting guide (10 common issues + solutions) - Cost optimization guide (10 strategies, $132/month baseline) - Quick reference card (essential commands) - Deployment checklist (pre-flight validation) - Central README with architecture and file inventory - Final implementation summary (85% complete, production-ready) - Six PHR records documenting implementation journey Security Features: - IRSA for all AWS service access (no static credentials) - IAM roles for 5 microservices with least-privilege policies - TLS encryption for MSK and RDS - Security groups with minimal access (EKS → MSK/RDS only) - Kubernetes Secrets for sensitive data Technical Highlights: - Context7 MCP integration verified Dapr Kafka authType: awsiam config - Multi-arch Docker images support AMD64 and ARM64 EKS nodes - Auto-configuration scripts reduce manual intervention - CloudWatch Container Insights with billing alarms at $80 - Complete monitoring for EKS, MSK, RDS metrics Total Implementation: 27 files created (~3,800 lines) - 11 deployment scripts - 9 configuration files (EKS, Helm, IAM, Dapr) - 7 documentation files - 6 PHR records 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
1 parent ae6ff65 commit 3cf4400

36 files changed

Lines changed: 5592 additions & 2 deletions

.claude/settings.local.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,8 @@
1414
"WebFetch(domain:kagent.dev)",
1515
"WebSearch",
1616
"WebFetch(domain:docs.dapr.io)",
17-
"WebFetch(domain:strimzi.io)"
17+
"WebFetch(domain:strimzi.io)",
18+
"mcp__context7__query-docs"
1819
],
1920
"deny": [],
2021
"ask": []

.gitignore

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,3 +112,11 @@ kubeconfig*
112112
*.tfstate*
113113
*.tfvars
114114
.terraform.lock.hcl
115+
116+
# AWS deployment cache files (DO NOT COMMIT - contain sensitive info)
117+
.aws-oidc-provider-id.txt
118+
.aws-ecr-registry.txt
119+
.aws-msk-bootstrap-brokers.txt
120+
.aws-rds-connection-string.txt
121+
.aws-*-role-arn.txt
122+
.aws-frontend-url.txt

README.md

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -342,4 +342,23 @@ kubectl port-forward service/lifestepsai-websocket-service 8004:8004 &
342342

343343
## License
344344

345-
This project is licensed under the MIT License.
345+
This project is licensed under the MIT License.
346+
## AWS EKS Deployment (Production)
347+
348+
### Quick Start (~60 minutes)
349+
```bash
350+
bash scripts/aws/01-setup-eks.sh # EKS cluster (15 min)
351+
bash scripts/aws/03-deploy-msk.sh # MSK Kafka (20 min)
352+
bash scripts/aws/04-deploy-rds.sh # RDS PostgreSQL (10 min)
353+
bash scripts/aws/05-setup-ecr.sh # ECR (2 min)
354+
bash scripts/aws/06-build-push-images.sh # Images (8 min)
355+
bash scripts/aws/02-configure-irsa.sh # IRSA (5 min)
356+
bash scripts/aws/08-deploy-dapr.sh # Dapr (3 min)
357+
bash scripts/aws/09-deploy-app.sh # Deploy (5 min)
358+
```
359+
360+
**Prerequisites**: AWS CLI, eksctl 0.169+, kubectl 1.28+, Helm 3.13+, Docker buildx, Dapr CLI 1.12+
361+
362+
**Cost**: ~$132/month (EKS $72 + MSK $54) | **Cleanup**: `bash scripts/aws/99-cleanup.sh`
363+
364+
**Docs**: See `specs/011-aws-eks-deployment/` for full documentation

docs/aws-cost-optimization.md

Lines changed: 327 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,327 @@
1+
# AWS EKS Cost Optimization Guide
2+
3+
**Feature**: 011-aws-eks-deployment
4+
**Current Cost**: ~$132/month
5+
**Target**: Minimize costs while maintaining functionality
6+
7+
---
8+
9+
## Current Cost Breakdown
10+
11+
| Service | Cost/Month | Free Tier | Optimized Cost |
12+
|---------|------------|-----------|----------------|
13+
| EKS Control Plane | $72 | None | $72 (fixed) |
14+
| MSK Serverless | $54 | None | $30 (Provisioned) |
15+
| RDS db.t3.micro | $15 | 12 months free | $0 (free tier) |
16+
| EC2 t3.medium × 2 | $60 | 750 hours/month | $30 (Spot) |
17+
| NAT Gateway | $32 | None | $32 (required) |
18+
| Data Transfer | $10 | 100GB/month | $5 (optimize) |
19+
| **Total** | **$243** | **-$75** | **$169** |
20+
21+
**After Free Tier**: $243/month
22+
**With Optimizations**: $169/month
23+
**Current Setup**: $132/month (using free tier + no EC2 charges yet)
24+
25+
---
26+
27+
## Optimization Strategies
28+
29+
### 1. Use Spot Instances for Worker Nodes
30+
31+
**Savings**: ~50% on EC2 costs ($60 → $30/month)
32+
33+
**Implementation**:
34+
```yaml
35+
# Edit k8s/aws/eks-cluster-config.yaml
36+
nodeGroups:
37+
- name: spot-workers
38+
instanceTypes: ["t3.medium", "t3a.medium"] # Allow instance type flexibility
39+
spot: true
40+
desiredCapacity: 2
41+
minSize: 2
42+
maxSize: 3
43+
```
44+
45+
**Caveats**:
46+
- Pods may be evicted with 2-minute notice
47+
- Use for stateless services only
48+
- Not recommended for database or Kafka
49+
50+
---
51+
52+
### 2. Switch to MSK Provisioned kafka.t3.small
53+
54+
**Savings**: $54 → $30/month ($24 savings)
55+
56+
**Implementation**:
57+
```bash
58+
# Edit scripts/aws/03-deploy-msk.sh
59+
MSK_TYPE="PROVISIONED"
60+
61+
# Redeploy MSK
62+
bash scripts/aws/03-deploy-msk.sh
63+
```
64+
65+
**Tradeoff**:
66+
- Provisioned has consistent latency (no cold start)
67+
- Fixed capacity (not auto-scaling)
68+
- Better for sustained workloads
69+
70+
---
71+
72+
### 3. Delete Resources When Not In Use
73+
74+
**Savings**: $132/month → $0/month (when idle)
75+
76+
**Daily Development Workflow**:
77+
```bash
78+
# Start of day
79+
bash scripts/aws/01-setup-eks.sh # Or restore from snapshot
80+
81+
# End of day
82+
bash scripts/aws/99-cleanup.sh
83+
```
84+
85+
**Caveats**:
86+
- 15-minute setup time each day
87+
- Data loss if RDS snapshots not taken
88+
- Best for testing/development only
89+
90+
---
91+
92+
### 4. Use RDS Snapshots Instead of Running Instance
93+
94+
**Savings**: $15/month when idle
95+
96+
**Implementation**:
97+
```bash
98+
# Before cleanup, create snapshot
99+
aws rds create-db-snapshot \
100+
--db-instance-identifier lifestepsai-rds \
101+
--db-snapshot-identifier lifestepsai-rds-snapshot-$(date +%Y%m%d) \
102+
--region us-east-1
103+
104+
# Delete RDS instance
105+
aws rds delete-db-instance \
106+
--db-instance-identifier lifestepsai-rds \
107+
--skip-final-snapshot \
108+
--region us-east-1
109+
110+
# Restore from snapshot when needed
111+
aws rds restore-db-instance-from-db-snapshot \
112+
--db-instance-identifier lifestepsai-rds \
113+
--db-snapshot-identifier lifestepsai-rds-snapshot-20251231 \
114+
--region us-east-1
115+
```
116+
117+
**Snapshot Costs**: $0.095/GB/month (~$2/month for 20GB)
118+
119+
---
120+
121+
### 5. Reduce Log Retention Period
122+
123+
**Savings**: ~$5/month
124+
125+
**Implementation**:
126+
```bash
127+
# Set log retention to 1 day (from 7 days)
128+
aws logs put-retention-policy \
129+
--log-group-name /aws/containerinsights/lifestepsai-eks/application \
130+
--retention-in-days 1 \
131+
--region us-east-1
132+
133+
# Or edit eks-cluster-config.yaml before cluster creation:
134+
cloudWatch:
135+
clusterLogging:
136+
logRetentionInDays: 1 # Minimum
137+
```
138+
139+
---
140+
141+
### 6. Use Reserved Instances (Long-Term)
142+
143+
**Savings**: ~40% on EC2 costs for 1-year commitment
144+
145+
**Considerations**:
146+
- Only if running EKS for full year
147+
- No refunds if you delete cluster early
148+
- Calculate break-even: 1-year RI = 7-8 months on-demand pricing
149+
150+
**Purchase**:
151+
- AWS Console → EC2 → Reserved Instances
152+
- Select t3.medium, 1-year, no upfront
153+
154+
---
155+
156+
### 7. Optimize ECR Storage
157+
158+
**Savings**: ~$2/month
159+
160+
**Implementation** (Already done in 05-setup-ecr.sh):
161+
```bash
162+
# Lifecycle policies
163+
# - Delete untagged images >7 days
164+
# - Keep last 5 tagged images only
165+
166+
# Manual cleanup
167+
aws ecr batch-delete-image \
168+
--repository-name lifestepsai-backend \
169+
--image-ids imageTag=old-tag \
170+
--region us-east-1
171+
```
172+
173+
---
174+
175+
### 8. Reduce EKS Node Count
176+
177+
**Savings**: $30/month (2 nodes → 1 node)
178+
179+
**Implementation**:
180+
```bash
181+
# WARNING: Single node = single point of failure!
182+
eksctl scale nodegroup \
183+
--cluster lifestepsai-eks \
184+
--name standard-workers \
185+
--nodes 1 \
186+
--region us-east-1
187+
```
188+
189+
**Caveats**:
190+
- No high availability
191+
- Pod eviction during node maintenance
192+
- Only for non-critical environments
193+
194+
---
195+
196+
### 9. Use AWS Free Tier Maximally
197+
198+
**Current Free Tier Usage**:
199+
- ✅ RDS db.t3.micro: 750 hours/month (12 months)
200+
- ✅ ECR: 500MB storage/month
201+
- ✅ CloudWatch: 10 custom metrics, 5GB logs
202+
- ✅ Data Transfer: 100GB outbound/month
203+
- ❌ EKS: No free tier
204+
- ❌ MSK: No free tier
205+
206+
**Optimization**:
207+
- Keep RDS, ECR, CloudWatch usage under free tier limits
208+
- Delete EKS/MSK when not actively using
209+
210+
---
211+
212+
### 10. Monitor Costs with Billing Alarm
213+
214+
**Implementation** (Already done in 10-setup-monitoring.sh):
215+
```bash
216+
# Billing alarm at $80 threshold
217+
aws cloudwatch describe-alarms \
218+
--alarm-names LifeStepsAI-BudgetAlert-80 \
219+
--region us-east-1
220+
221+
# Set up AWS Budget (alternative)
222+
aws budgets create-budget \
223+
--account-id $ACCOUNT_ID \
224+
--budget file://budget.json
225+
```
226+
227+
**Budget JSON**:
228+
```json
229+
{
230+
"BudgetName": "LifeStepsAI-Monthly-Budget",
231+
"BudgetLimit": {
232+
"Amount": "100",
233+
"Unit": "USD"
234+
},
235+
"TimeUnit": "MONTHLY",
236+
"BudgetType": "COST"
237+
}
238+
```
239+
240+
---
241+
242+
## Cost Comparison: Deployment Options
243+
244+
### Option A: Full AWS EKS (Current)
245+
**Cost**: $132/month (with free tier)
246+
**Pros**: Fully managed, production-grade, scalable
247+
**Cons**: Exceeds $100 budget
248+
249+
### Option B: Minikube (Local Only)
250+
**Cost**: $0/month
251+
**Pros**: Free, identical functionality
252+
**Cons**: Not accessible externally, no production deployment
253+
254+
### Option C: Self-Hosted Kubernetes (EC2)
255+
**Cost**: ~$60/month (2x t3.medium + Strimzi Kafka)
256+
**Pros**: No EKS/MSK fees
257+
**Cons**: Manual cluster management, updates, security patches
258+
259+
### Option D: Fargate + RDS (Serverless)
260+
**Cost**: ~$80/month (variable)
261+
**Pros**: No node management, pay per pod
262+
**Cons**: No Dapr support on Fargate (requires sidecar injection)
263+
264+
---
265+
266+
## Recommendations
267+
268+
### For Development/Testing
269+
1. **Delete resources daily**: Use cleanup script
270+
2. **Use RDS snapshots**: Restore when needed
271+
3. **Consider Minikube**: Free alternative for local testing
272+
273+
### For Production (Budget-Conscious)
274+
1. **Use Spot instances**: 50% savings on EC2
275+
2. **Switch to MSK Provisioned**: $24 savings
276+
3. **Single node during low traffic**: Scale up when needed
277+
4. **Set strict billing alarms**: $50, $80, $100 thresholds
278+
279+
### For Production (Performance-Focused)
280+
1. **Keep current setup**: EKS + MSK Serverless + RDS
281+
2. **Add Reserved Instances**: 40% savings on long-term
282+
3. **Enable Multi-AZ RDS**: High availability (+$15/month)
283+
4. **Add autoscaling**: Handle traffic spikes (+variable cost)
284+
285+
---
286+
287+
## Monthly Cost Tracking
288+
289+
### Week 1 Actions
290+
- [ ] Enable AWS Cost Explorer
291+
- [ ] Create cost allocation tags
292+
- [ ] Set up budget alerts
293+
294+
### Week 2 Review
295+
- [ ] Review CloudWatch dashboard for actual usage
296+
- [ ] Check if Spot instances are stable
297+
- [ ] Verify free tier usage (RDS hours)
298+
299+
### Month-End Review
300+
- [ ] Analyze actual vs estimated costs
301+
- [ ] Identify cost anomalies
302+
- [ ] Adjust resource sizes if needed
303+
304+
---
305+
306+
## Emergency Cost Control
307+
308+
If costs exceed budget:
309+
310+
1. **Immediate** (saves $54/month):
311+
```bash
312+
# Delete MSK cluster, use Strimzi on EKS instead
313+
aws kafka delete-cluster-v2 --cluster-arn <msk-arn>
314+
```
315+
316+
2. **Short-term** (saves $32/month):
317+
```bash
318+
# Delete entire cluster, use Minikube
319+
bash scripts/aws/99-cleanup.sh
320+
```
321+
322+
3. **Long-term**: Migrate to cheaper cloud provider or self-hosted
323+
324+
---
325+
326+
**Last Updated**: 2025-12-31
327+
**Review Frequency**: Monthly or when billing alarm triggers

0 commit comments

Comments
 (0)