cristofima
diff --git a/‎.github/copilot-instructions.md‎
Lines changed: 25 additions & 8 deletions b/‎.github/copilot-instructions.md‎
Lines changed: 25 additions & 8 deletions
diff --git a/‎.github/git-commit-messages-instructions.md‎
Lines changed: 2 additions & 2 deletions b/‎.github/git-commit-messages-instructions.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎README.md‎
Lines changed: 10 additions & 0 deletions b/‎README.md‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎docs/PROJECT_REFERENCE.md‎
Lines changed: 4 additions & 2 deletions b/‎docs/PROJECT_REFERENCE.md‎
Lines changed: 4 additions & 2 deletions
diff --git a/‎docs/diagrams/architecture-cicd.png‎
8.33 KB b/‎docs/diagrams/architecture-cicd.png‎
8.33 KB
diff --git a/‎docs/diagrams/architecture-cost.png‎
19 KB b/‎docs/diagrams/architecture-cost.png‎
19 KB
diff --git a/‎docs/diagrams/architecture-dataflow.png‎
24.5 KB b/‎docs/diagrams/architecture-dataflow.png‎
24.5 KB
diff --git a/‎docs/diagrams/architecture-main.png‎
21.6 KB b/‎docs/diagrams/architecture-main.png‎
21.6 KB
diff --git a/‎docs/diagrams/architecture-training.png‎
24.5 KB b/‎docs/diagrams/architecture-training.png‎
24.5 KB
diff --git a/‎infrastructure/terraform/ARCHITECTURE_DECISIONS.md‎
Lines changed: 5 additions & 5 deletions b/‎infrastructure/terraform/ARCHITECTURE_DECISIONS.md‎
Lines changed: 5 additions & 5 deletions
@@ -51,10 +51,14 @@ Required env vars in training container: `DATASET_ID`, `TARGET_COLUMN`, `JOB_ID`
 
 ### Training Container (Python)
 
-- **Preprocessing** (`preprocessor.py`): Auto-detects ID columns, uses `feature-engine` for constant/duplicate detection
-- **Problem type**: `<20 unique values OR <5% unique ratio` = classification
-- **Model training** (`model_trainer.py`): FLAML with `['lgbm', 'rf', 'extra_tree']` - xgboost excluded (bugs)
-- **Multiclass**: Explicitly set `metric='accuracy'`
+Located in `backend/training/`, runs as Docker container in AWS Batch:
+
+- **Entry point** (`train.py`): Orchestrates 7-step pipeline (download → EDA → preprocess → train → reports → save → update status)
+- **Preprocessing** (`preprocessor.py`): Auto-detects ID columns using regex patterns, uses `feature-engine` for constant/duplicate detection
+- **Problem type detection**: `<20 unique values OR <5% unique ratio` = classification
+- **Model training** (`model_trainer.py`): FLAML with `['lgbm', 'rf', 'extra_tree']` - xgboost excluded due to `best_iteration` bugs
+- **Multiclass**: Explicitly set `metric='accuracy'` (FLAML's auto-detection unreliable)
+- **Reports**: Generates both EDA (`sweetviz`) and training reports with feature importance charts
 
 ### Frontend (TypeScript)
 
@@ -100,6 +104,8 @@ python scripts/generate_architecture_diagram.py
 | Job stuck RUNNING | Missing DynamoDB perms | Add `dynamodb:UpdateItem` to Batch task role in `iam.tf` |
 | New train.py param ignored | Not in containerOverrides | Add to `batch_service.py` environment list |
 | Frontend CORS errors | Wrong API URL | Get from `terraform output api_gateway_url` |
+| Low model accuracy | ID columns in training | Check `preprocessor.py` ID detection patterns |
+| DynamoDB Decimal errors | Floats in metrics dict | Convert to `Decimal(str(v))` before saving |
 
 ## File Reference by Task
 
@@ -112,16 +118,17 @@ python scripts/generate_architecture_diagram.py
 ## Schema Sync Pattern
 
 Backend Pydantic and Frontend TypeScript schemas must match. When adding fields:
-1. `backend/api/models/schemas.py` - Add to Pydantic model
-2. `frontend/lib/api.ts` - Add to TypeScript interface
-3. Example: `JobResponse` (backend) ↔ `JobDetails` (frontend)
+1. `backend/api/models/schemas.py` - Add to Pydantic model (e.g., `JobResponse`)
+2. `frontend/lib/api.ts` - Add to TypeScript interface (e.g., `JobDetails`)
+3. Key pairs: `JobResponse` ↔ `JobDetails`, `DatasetMetadata` ↔ `DatasetMetadata`, `TrainResponse` ↔ `TrainResponse`
 
 ## Debugging
 
 - Lambda logs: `/aws/lambda/automl-lite-{env}-api`
 - Batch logs: `/aws/batch/automl-lite-{env}-training`
 - Local API: `http://localhost:8000/docs` (Swagger UI)
 - Env var mismatch: Compare `batch_service.py` containerOverrides with `train.py` os.getenv()
+- Training issues: Check `dropped_columns` in preprocessing_info for filtered features
 
 ## Utility Scripts
 
@@ -131,9 +138,19 @@ Backend Pydantic and Frontend TypeScript schemas must match. When adding fields:
 | `scripts/predict.py` | Make predictions with trained models (Docker) |
 | `scripts/generate_architecture_diagram.py` | Generate AWS architecture diagrams |
 
+## CI/CD Workflows (`.github/workflows/`)
+
+| Workflow | Trigger | Purpose |
+|----------|---------|---------|
+| `deploy-lambda-api.yml` | Push to main/dev | Deploy FastAPI to Lambda |
+| `deploy-training-container.yml` | Push to main/dev | Build & push training image to ECR |
+| `deploy-infrastructure.yml` | Manual | Terraform apply |
+| `ci-terraform.yml` | PR | Terraform validate & plan |
+
 ## Key Docs
 
-- `docs/LESSONS_LEARNED.md` - Critical debugging insights
+- `docs/LESSONS_LEARNED.md` - Critical debugging insights (read this first for troubleshooting)
 - `docs/QUICKSTART.md` - Deployment guide
 - `.github/SETUP_CICD.md` - CI/CD with GitHub Actions
 - `infrastructure/terraform/ARCHITECTURE_DECISIONS.md` - Why Lambda + Batch split
+- `.github/git-commit-messages-instructions.md` - Commit message conventions
@@ -435,9 +435,9 @@ Modified files (6):
 Configuration:
 - vCPU: 2, Memory: 4GB
 - Max runtime: 60 minutes
-- Spot pricing: ~$0.05/job vs $0.17/job on-demand
+- Spot pricing: ~$0.017/job (70% savings vs on-demand)
 
-Cost estimate: $3/month for 20 training jobs
+Cost estimate: ~$0.34/month for 20 training jobs (Fargate compute only)
 ```
 
 ---
 
@@ -2,6 +2,16 @@
 
 A lightweight, cost-effective AutoML platform built on AWS serverless architecture. Upload CSV files, automatically detect problem types, and train machine learning models with just a few clicks.
 
+## 🔄 CI/CD Status
+
+| Workflow | Main | Dev |
+|----------|------|-----|
+| CI Terraform | [![CI](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/ci-terraform.yml/badge.svg?branch=main)](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/ci-terraform.yml) | [![CI](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/ci-terraform.yml/badge.svg?branch=dev)](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/ci-terraform.yml) |
+| Deploy Infrastructure | [![Deploy](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/deploy-infrastructure.yml/badge.svg?branch=main)](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/deploy-infrastructure.yml) | [![Deploy](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/deploy-infrastructure.yml/badge.svg?branch=dev)](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/deploy-infrastructure.yml) |
+| Deploy Lambda API | [![Deploy](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/deploy-lambda-api.yml/badge.svg?branch=main)](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/deploy-lambda-api.yml) | [![Deploy](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/deploy-lambda-api.yml/badge.svg?branch=dev)](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/deploy-lambda-api.yml) |
+| Deploy Training Container | [![Deploy](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/deploy-training-container.yml/badge.svg?branch=main)](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/deploy-training-container.yml) | [![Deploy](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/deploy-training-container.yml/badge.svg?branch=dev)](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/deploy-training-container.yml) |
+| Deploy Frontend | [![Deploy](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/deploy-frontend.yml/badge.svg?branch=main)](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/deploy-frontend.yml) | [![Deploy](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/deploy-frontend.yml/badge.svg?branch=dev)](https://github.com/cristofima/AWS-AutoML-Lite/actions/workflows/deploy-frontend.yml) |
+
 ## 🚀 Features
 
 - **Smart Problem Detection**: Automatically classifies tasks as regression or classification based on data characteristics
 
@@ -421,8 +421,10 @@ Total: ~$10-25/month
 
 **Comparison:**
 - SageMaker with real-time endpoint: ~$150-300/month (ml.c5.xlarge 24/7)
-- This solution: $12-15/month
-- **Savings: ~80-95%**
+- This solution: $10-25/month
+- **Savings: ~85-95%** (vs SageMaker with endpoints)
+
+> Note: SageMaker training alone costs ~$0.68-3.20/month for 20 jobs—comparable to this solution. The significant savings come from avoiding always-on inference endpoints.
 
 ---
 
 
@@ -81,9 +81,9 @@ For datasets >10K rows, Lambda timeout is insufficient.
 - Cost: ~$0.04 per vCPU-hour (Spot)
 
 **Training job cost comparison (10 min job):**
-- Lambda (10GB): ~$0.17
-- Fargate Spot (2 vCPU, 4GB): ~$0.013
-- **Savings: 92%**
+- Lambda (10GB): ~$0.10
+- Fargate Spot (2 vCPU, 4GB): ~$0.017
+- **Savings: 83%**
 
 ---
 
@@ -130,7 +130,7 @@ For datasets >10K rows, Lambda timeout is insufficient.
 - ✅ No cold starts
 
 **Cons:**
-- ❌ Always running (higher cost: ~$25/month)
+- ❌ Always running (higher cost: ~$12-15/month)
 - ❌ Not ideal for batch jobs
 - ❌ Overkill for intermittent training
 
@@ -297,7 +297,7 @@ If you want to eliminate Docker for demo purposes, you can:
 **Containers are used ONLY in the training component, where they are technically required due to:**
 - Dependency size constraints (265MB > 250MB Lambda limit)
 - Runtime duration (can exceed 15 min Lambda limit)
-- Cost optimization (Batch Spot 92% cheaper)
+- Cost optimization (Batch Spot ~83% cheaper than Lambda for training)
 
 **This is not "unnecessary Docker complexity" - it's the right tool for the job.**