Skip to content

Commit ba2afca

Browse files
authored
Merge pull request #12 from cristofima/dev - docs: update architecture cost comparison details
- Updated cost calculation formulas for Lambda and Fargate Spot training jobs with more accurate estimates - Clarified that SageMaker training alone is cost-competitive, but endpoints drive the total cost difference - Added CI/CD status badges to README - Enhanced troubleshooting guidance and documentation structure
2 parents 2db381f + 3eb2955 commit ba2afca

12 files changed

Lines changed: 52 additions & 23 deletions

.github/copilot-instructions.md

Lines changed: 25 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -51,10 +51,14 @@ Required env vars in training container: `DATASET_ID`, `TARGET_COLUMN`, `JOB_ID`
5151

5252
### Training Container (Python)
5353

54-
- **Preprocessing** (`preprocessor.py`): Auto-detects ID columns, uses `feature-engine` for constant/duplicate detection
55-
- **Problem type**: `<20 unique values OR <5% unique ratio` = classification
56-
- **Model training** (`model_trainer.py`): FLAML with `['lgbm', 'rf', 'extra_tree']` - xgboost excluded (bugs)
57-
- **Multiclass**: Explicitly set `metric='accuracy'`
54+
Located in `backend/training/`, runs as Docker container in AWS Batch:
55+
56+
- **Entry point** (`train.py`): Orchestrates 7-step pipeline (download → EDA → preprocess → train → reports → save → update status)
57+
- **Preprocessing** (`preprocessor.py`): Auto-detects ID columns using regex patterns, uses `feature-engine` for constant/duplicate detection
58+
- **Problem type detection**: `<20 unique values OR <5% unique ratio` = classification
59+
- **Model training** (`model_trainer.py`): FLAML with `['lgbm', 'rf', 'extra_tree']` - xgboost excluded due to `best_iteration` bugs
60+
- **Multiclass**: Explicitly set `metric='accuracy'` (FLAML's auto-detection unreliable)
61+
- **Reports**: Generates both EDA (`sweetviz`) and training reports with feature importance charts
5862

5963
### Frontend (TypeScript)
6064

@@ -100,6 +104,8 @@ python scripts/generate_architecture_diagram.py
100104
| Job stuck RUNNING | Missing DynamoDB perms | Add `dynamodb:UpdateItem` to Batch task role in `iam.tf` |
101105
| New train.py param ignored | Not in containerOverrides | Add to `batch_service.py` environment list |
102106
| Frontend CORS errors | Wrong API URL | Get from `terraform output api_gateway_url` |
107+
| Low model accuracy | ID columns in training | Check `preprocessor.py` ID detection patterns |
108+
| DynamoDB Decimal errors | Floats in metrics dict | Convert to `Decimal(str(v))` before saving |
103109

104110
## File Reference by Task
105111

@@ -112,16 +118,17 @@ python scripts/generate_architecture_diagram.py
112118
## Schema Sync Pattern
113119

114120
Backend Pydantic and Frontend TypeScript schemas must match. When adding fields:
115-
1. `backend/api/models/schemas.py` - Add to Pydantic model
116-
2. `frontend/lib/api.ts` - Add to TypeScript interface
117-
3. Example: `JobResponse` (backend) `JobDetails` (frontend)
121+
1. `backend/api/models/schemas.py` - Add to Pydantic model (e.g., `JobResponse`)
122+
2. `frontend/lib/api.ts` - Add to TypeScript interface (e.g., `JobDetails`)
123+
3. Key pairs: `JobResponse``JobDetails`, `DatasetMetadata``DatasetMetadata`, `TrainResponse``TrainResponse`
118124

119125
## Debugging
120126

121127
- Lambda logs: `/aws/lambda/automl-lite-{env}-api`
122128
- Batch logs: `/aws/batch/automl-lite-{env}-training`
123129
- Local API: `http://localhost:8000/docs` (Swagger UI)
124130
- Env var mismatch: Compare `batch_service.py` containerOverrides with `train.py` os.getenv()
131+
- Training issues: Check `dropped_columns` in preprocessing_info for filtered features
125132

126133
## Utility Scripts
127134

@@ -131,9 +138,19 @@ Backend Pydantic and Frontend TypeScript schemas must match. When adding fields:
131138
| `scripts/predict.py` | Make predictions with trained models (Docker) |
132139
| `scripts/generate_architecture_diagram.py` | Generate AWS architecture diagrams |
133140

141+
## CI/CD Workflows (`.github/workflows/`)
142+
143+
| Workflow | Trigger | Purpose |
144+
|----------|---------|---------|
145+
| `deploy-lambda-api.yml` | Push to main/dev | Deploy FastAPI to Lambda |
146+
| `deploy-training-container.yml` | Push to main/dev | Build & push training image to ECR |
147+
| `deploy-infrastructure.yml` | Manual | Terraform apply |
148+
| `ci-terraform.yml` | PR | Terraform validate & plan |
149+
134150
## Key Docs
135151

136-
- `docs/LESSONS_LEARNED.md` - Critical debugging insights
152+
- `docs/LESSONS_LEARNED.md` - Critical debugging insights (read this first for troubleshooting)
137153
- `docs/QUICKSTART.md` - Deployment guide
138154
- `.github/SETUP_CICD.md` - CI/CD with GitHub Actions
139155
- `infrastructure/terraform/ARCHITECTURE_DECISIONS.md` - Why Lambda + Batch split
156+
- `.github/git-commit-messages-instructions.md` - Commit message conventions

.github/git-commit-messages-instructions.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -435,9 +435,9 @@ Modified files (6):
435435
Configuration:
436436
- vCPU: 2, Memory: 4GB
437437
- Max runtime: 60 minutes
438-
- Spot pricing: ~$0.05/job vs $0.17/job on-demand
438+
- Spot pricing: ~$0.017/job (70% savings vs on-demand)
439439
440-
Cost estimate: $3/month for 20 training jobs
440+
Cost estimate: ~$0.34/month for 20 training jobs (Fargate compute only)
441441
```
442442

443443
---

README.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,16 @@
22

33
A lightweight, cost-effective AutoML platform built on AWS serverless architecture. Upload CSV files, automatically detect problem types, and train machine learning models with just a few clicks.
44

5+
## 🔄 CI/CD Status
6+
7+
| Workflow | Main | Dev |
8+
|----------|------|-----|
9+
| CI Terraform | [![CI](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/ci-terraform.yml/badge.svg?branch=main)](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/ci-terraform.yml) | [![CI](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/ci-terraform.yml/badge.svg?branch=dev)](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/ci-terraform.yml) |
10+
| Deploy Infrastructure | [![Deploy](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/deploy-infrastructure.yml/badge.svg?branch=main)](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/deploy-infrastructure.yml) | [![Deploy](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/deploy-infrastructure.yml/badge.svg?branch=dev)](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/deploy-infrastructure.yml) |
11+
| Deploy Lambda API | [![Deploy](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/deploy-lambda-api.yml/badge.svg?branch=main)](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/deploy-lambda-api.yml) | [![Deploy](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/deploy-lambda-api.yml/badge.svg?branch=dev)](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/deploy-lambda-api.yml) |
12+
| Deploy Training Container | [![Deploy](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/deploy-training-container.yml/badge.svg?branch=main)](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/deploy-training-container.yml) | [![Deploy](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/deploy-training-container.yml/badge.svg?branch=dev)](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/deploy-training-container.yml) |
13+
| Deploy Frontend | [![Deploy](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/deploy-frontend.yml/badge.svg?branch=main)](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/deploy-frontend.yml) | [![Deploy](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/deploy-frontend.yml/badge.svg?branch=dev)](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/deploy-frontend.yml) |
14+
515
## 🚀 Features
616

717
- **Smart Problem Detection**: Automatically classifies tasks as regression or classification based on data characteristics

docs/PROJECT_REFERENCE.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -421,8 +421,10 @@ Total: ~$10-25/month
421421

422422
**Comparison:**
423423
- SageMaker with real-time endpoint: ~$150-300/month (ml.c5.xlarge 24/7)
424-
- This solution: $12-15/month
425-
- **Savings: ~80-95%**
424+
- This solution: $10-25/month
425+
- **Savings: ~85-95%** (vs SageMaker with endpoints)
426+
427+
> Note: SageMaker training alone costs ~$0.68-3.20/month for 20 jobs—comparable to this solution. The significant savings come from avoiding always-on inference endpoints.
426428
427429
---
428430

8.33 KB
Loading
19 KB
Loading
24.5 KB
Loading
21.6 KB
Loading
24.5 KB
Loading

infrastructure/terraform/ARCHITECTURE_DECISIONS.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -81,9 +81,9 @@ For datasets >10K rows, Lambda timeout is insufficient.
8181
- Cost: ~$0.04 per vCPU-hour (Spot)
8282

8383
**Training job cost comparison (10 min job):**
84-
- Lambda (10GB): ~$0.17
85-
- Fargate Spot (2 vCPU, 4GB): ~$0.013
86-
- **Savings: 92%**
84+
- Lambda (10GB): ~$0.10
85+
- Fargate Spot (2 vCPU, 4GB): ~$0.017
86+
- **Savings: 83%**
8787

8888
---
8989

@@ -130,7 +130,7 @@ For datasets >10K rows, Lambda timeout is insufficient.
130130
- ✅ No cold starts
131131

132132
**Cons:**
133-
- ❌ Always running (higher cost: ~$25/month)
133+
- ❌ Always running (higher cost: ~$12-15/month)
134134
- ❌ Not ideal for batch jobs
135135
- ❌ Overkill for intermittent training
136136

@@ -297,7 +297,7 @@ If you want to eliminate Docker for demo purposes, you can:
297297
**Containers are used ONLY in the training component, where they are technically required due to:**
298298
- Dependency size constraints (265MB > 250MB Lambda limit)
299299
- Runtime duration (can exceed 15 min Lambda limit)
300-
- Cost optimization (Batch Spot 92% cheaper)
300+
- Cost optimization (Batch Spot ~83% cheaper than Lambda for training)
301301

302302
**This is not "unnecessary Docker complexity" - it's the right tool for the job.**
303303

0 commit comments

Comments
 (0)