Skip to content

Commit b1519b1

Browse files
authored
Merge pull request #27 from cristofima/dev - perf: implement ETag-based caching and job details handling improvements
- ETag-based conditional requests with 304 Not Modified responses for job details - In-memory caching of S3 presigned URLs with 80% TTL to reduce API calls - Frontend utility to preserve valid presigned URLs when merging job state updates
2 parents 7b9b428 + 113ed91 commit b1519b1

18 files changed

Lines changed: 491 additions & 77 deletions

File tree

CHANGELOG.md

Lines changed: 21 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
2525
- Regression predictions show value with ± RMSE error margin (e.g., 0.0991 ± 0.002)
2626
- R² displayed as coefficient (0-1) per ML standards, not percentage
2727
- Target column name shown in prediction results panel
28-
- Cost comparison panel: Lambda ($0 idle) vs SageMaker (~$50-100/month)
28+
- Cost comparison panel: Lambda ($0 idle) vs SageMaker (~$36-171/month)
2929
- ONNX Runtime >=1.16.3 for serverless inference (uses 1.16.3 on Lambda, 1.20.x locally)
3030

3131
- **Dark Mode Support** - Full dark/light/system theme support across all pages
@@ -99,6 +99,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
9999
- `best_estimator` field now stored in DynamoDB metrics to track which algorithm was selected
100100
- Updated `LESSONS_LEARNED.md` with bug timeline and resolution details
101101

102+
- **S3 Cache Reliability** - Use lazy cleanup in `S3Service` to fix Lambda freezing issues
103+
- Lazy cleanup prevents execution environment freezing in Lambda
104+
- Ensures reliable cache eviction without background threads
105+
106+
- **Deployment Consistency** - Enforced `ConsistentRead=True` for model deployment status checks
107+
- Prevents race conditions during deployment status polling
108+
- Ensures strong consistency for critical state changes
109+
102110
### Dependencies
103111
- **Dependency Audit & Version Updates** - Production-stable versions with flexible ranges
104112
- FastAPI upgraded from 0.109.0 to >=0.115.0 (fixes ReDoc CDN issue with `redoc@next`)
@@ -122,6 +130,17 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
122130
- Regression correctly detected for continuous numerical targets
123131
- Fixes "The least populated class in y has only 1 member" FLAML error
124132

133+
- **Job Caching & Deletion** - Resolved stale data issues
134+
- `DELETE /jobs/{id}` now returns aggressive `no-store` cache headers to prevent access to deleted jobs
135+
- `GET /jobs/{id}` forces revalidation (`max-age=0`) to immediately reflect deployment status changes
136+
- Frontend `getJobDetails` skips browser cache to ensure 404s are respected
137+
- Fixed issue where clearing notes/tags didn't save due to DynamoDB empty string constraints (now uses `REMOVE` operation)
138+
- Implemented `mergeJobPreservingUrls` to prevent presigned URL expiration during polling updates
139+
140+
- **API 304 Compliance** - Fixed `get_job_status` to return empty body for 304 Not Modified responses
141+
- Complies with HTTP RFC 7232 standard
142+
- Prevents client-side parsing errors
143+
125144
### Removed
126145
- **Unused Frontend Dependencies** - Cleaned up packages that were never used in codebase
127146
- `aws-sdk` (~15 MB) - Frontend uses backend API endpoints, not direct AWS SDK calls
@@ -233,7 +252,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
233252
- Version badges in README
234253

235254
### Cost Optimization
236-
- ~$10-25/month total cost for moderate usage
255+
- ~$2-15/month total cost for moderate usage ($0 when idle)
237256
- Fargate Spot pricing (70% discount)
238257
- No always-on infrastructure
239258
- Training cost: ~$0.02/job

backend/README.md

Lines changed: 31 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -263,6 +263,14 @@ Or mount `~/.aws` when using Docker (already configured in docker-compose.yml).
263263

264264
4. **CORS Errors**: The API includes CORS middleware. Check `api/main.py` if issues persist.
265265

266+
### Caching Strategy
267+
268+
The API implements strict caching controls to ensure UI consistency:
269+
270+
- **GET /jobs/{id}**: `Cache-Control: private, max-age=0, must-revalidate`. Forces browsers to validate ETag on every request, ensuring deployment status changes are seen immediately. Uses DynamoDB Strong Consistency (`ConsistentRead=True`) to generate accurate ETags.
271+
- **DELETE /jobs/{id}**: Returns `Cache-Control: no-store, no-cache, must-revalidate, max-age=0` to immediately invalidate client caches.
272+
- **Consistency**: Critical operations (`update_job_metadata`, `deploy_model`) use DynamoDB Strong Consistency (`ConsistentRead=True`) to guarantee read-after-write accuracy.
273+
266274
## 🧪 Testing
267275

268276
The backend includes comprehensive unit and integration tests for both API and Training modules. Tests run automatically in CI/CD pipelines before deployment.
@@ -272,14 +280,14 @@ The backend includes comprehensive unit and integration tests for both API and T
272280
```
273281
backend/tests/
274282
├── pytest.ini # Pytest configuration
275-
├── api/ # API tests (104 tests, 69% coverage)
283+
├── api/ # API tests (109 tests, 71% coverage)
276284
│ ├── conftest.py # Shared fixtures
277285
│ ├── test_endpoints.py # Endpoint tests (39 tests)
278286
│ ├── test_schemas.py # Pydantic validation tests (23 tests)
279287
│ ├── test_dynamo_service.py # DynamoDB service tests
280288
│ ├── test_s3_service.py # S3 service tests
281289
│ └── test_services_integration.py # moto-based integration tests (21 tests)
282-
└── training/ # Training tests (159 tests, 53% coverage)
290+
└── training/ # Training tests (159 tests, 63% coverage)
283291
├── conftest.py # Shared fixtures
284292
├── unit/ # Pure unit tests
285293
│ ├── test_preprocessor.py
@@ -387,3 +395,24 @@ from training.utils.detection import (
387395
This follows the DRY principle - logic is defined once and reused across `core/preprocessor.py` and `reports/eda.py`.
388396

389397
This detection is performed both in the API (for UI display) and in the training container (for model training).
398+
399+
## 💰 Cost Analysis (Inference)
400+
401+
Based on official [AWS SageMaker Pricing](https://aws.amazon.com/sagemaker/ai/pricing/) and [Lambda Pricing](https://aws.amazon.com/lambda/pricing/) for `us-east-1`:
402+
403+
| Component | Serverless (Lambda + ONNX) | SageMaker ml.t3.medium | SageMaker ml.c5.xlarge |
404+
| :--- | :--- | :--- | :--- |
405+
| **Idle Cost** | **$0.00 / month** | ~$36.00 / month | ~$171.36 / month |
406+
| **Hourly Rate** | N/A (Pay-per-req) | $0.05 / hour | $0.238 / hour |
407+
| **Per Prediction** | ~$0.000004 | Included | Included |
408+
| **Break-even** | **Best for < 9M reqs** | Better for 9M-40M reqs | Better for > 42M reqs |
409+
410+
### Real-world Scenario (100k predictions/mo)
411+
- **Serverless**: **$0.40** (Virtually free)
412+
- **SageMaker (t3.medium)**: $36.00 (Fixed cost)
413+
- **Savings**: **98.8%** cost reduction for low-to-moderate workloads.
414+
415+
> [!TIP]
416+
> This project is designed to be **"Side Project Friendly"**. By using Serverless Inference, you avoid the $432-$2,056 yearly cost of keeping a SageMaker endpoint running 24/7.
417+
418+

backend/api/routers/models.py

Lines changed: 68 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
1-
from fastapi import APIRouter, HTTPException, status, Query
1+
from fastapi import APIRouter, HTTPException, status, Query, Response, Request
22
from typing import Dict, Optional, Any
3+
import hashlib
34
from ..models.schemas import (
45
JobListResponse, JobResponse, JobStatus, ProblemType, JobUpdateRequest,
56
DeployRequest, DeployResponse, PreprocessingInfo, JobSummary
@@ -13,12 +14,14 @@
1314

1415

1516
@router.get("/{job_id}", response_model=JobResponse)
16-
async def get_job_status(job_id: str) -> JobResponse:
17+
async def get_job_status(job_id: str, response: Response, request: Request) -> JobResponse:
1718
"""
18-
Get the status and results of a training job
19+
Get the status and results of a training job.
20+
Implements ETag-based caching with must-revalidate for accurate state after deploy/undeploy.
1921
"""
2022
try:
21-
job = dynamodb_service.get_job(job_id)
23+
# Use consistent read to ensure we generate ETag from the absolute latest state
24+
job = dynamodb_service.get_job(job_id, consistent_read=True)
2225
if not job:
2326
raise HTTPException(
2427
status_code=status.HTTP_404_NOT_FOUND,
@@ -51,7 +54,7 @@ async def get_job_status(job_id: str) -> JobResponse:
5154
target_mapping=job['preprocessing_info'].get('target_mapping')
5255
)
5356

54-
response = JobResponse(
57+
job_response = JobResponse(
5558
job_id=job['job_id'],
5659
dataset_id=job.get('dataset_id', ''),
5760
status=JobStatus(job['status']),
@@ -77,7 +80,7 @@ async def get_job_status(job_id: str) -> JobResponse:
7780
# Extract bucket and key from s3:// path
7881
model_path = job['model_path'].replace('s3://', '')
7982
bucket, key = model_path.split('/', 1)
80-
response.model_download_url = s3_service.generate_presigned_download_url(
83+
job_response.model_download_url = s3_service.generate_presigned_download_url_cached(
8184
bucket=bucket,
8285
key=key
8386
)
@@ -86,7 +89,7 @@ async def get_job_status(job_id: str) -> JobResponse:
8689
if job.get('onnx_model_path'):
8790
onnx_path = job['onnx_model_path'].replace('s3://', '')
8891
bucket, key = onnx_path.split('/', 1)
89-
response.onnx_model_download_url = s3_service.generate_presigned_download_url(
92+
job_response.onnx_model_download_url = s3_service.generate_presigned_download_url_cached(
9093
bucket=bucket,
9194
key=key
9295
)
@@ -96,20 +99,45 @@ async def get_job_status(job_id: str) -> JobResponse:
9699
if eda_path:
97100
report_path = eda_path.replace('s3://', '')
98101
bucket, key = report_path.split('/', 1)
99-
url = s3_service.generate_presigned_download_url(bucket=bucket, key=key)
100-
response.report_download_url = url # Backward compatibility
101-
response.eda_report_download_url = url
102+
url = s3_service.generate_presigned_download_url_cached(bucket=bucket, key=key)
103+
job_response.report_download_url = url # Backward compatibility
104+
job_response.eda_report_download_url = url
102105

103106
# Training Report
104107
if job.get('training_report_path'):
105108
training_path = job['training_report_path'].replace('s3://', '')
106109
bucket, key = training_path.split('/', 1)
107-
response.training_report_download_url = s3_service.generate_presigned_download_url(
110+
job_response.training_report_download_url = s3_service.generate_presigned_download_url_cached(
108111
bucket=bucket,
109112
key=key
110113
)
111114

112-
return response
115+
# ============================================================================
116+
# HTTP Cache Strategy with ETag for accurate state after deploy/undeploy
117+
# ============================================================================
118+
119+
# 1. Generate ETag based on mutable fields (updated_at, deployed, deployed_at)
120+
# This changes whenever the job state changes (including deploy/undeploy)
121+
etag_source = f"{job.get('updated_at', '')}-{job.get('deployed', False)}-{job.get('deployed_at', '')}"
122+
etag = f'"{hashlib.md5(etag_source.encode()).hexdigest()}"'
123+
response.headers["ETag"] = etag
124+
125+
# 2. Check If-None-Match header for conditional requests (304 Not Modified)
126+
if_none_match = request.headers.get("If-None-Match")
127+
if if_none_match == etag:
128+
# Resource hasn't changed - return 304 (browser will use cached version)
129+
# IMPORTANT: 304 responses MUST NOT have a body
130+
return Response(status_code=304, headers={"ETag": etag, "Cache-Control": "private, max-age=0, must-revalidate"})
131+
132+
# 3. Always force revalidation
133+
# We used to have adaptive TTLs, but deployment status changes need to be reflected immediately.
134+
# max-age=0 + must-revalidate ensures the browser ALWAYS validates the ETag with the server.
135+
# Server (consistent read) -> Calculates ETag -> 304 if same, 200 if changed.
136+
# This is the most robust way to handle state changes like 'Deployed' vs 'Undeployed'.
137+
response.headers["Cache-Control"] = "private, max-age=0, must-revalidate"
138+
response.headers["Vary"] = "Authorization" # Vary by auth header if auth is added later
139+
140+
return job_response
113141

114142
except HTTPException:
115143
raise
@@ -121,7 +149,7 @@ async def get_job_status(job_id: str) -> JobResponse:
121149

122150

123151
@router.delete("/{job_id}")
124-
async def delete_job(job_id: str, delete_data: bool = True) -> Dict[str, Any]:
152+
async def delete_job(job_id: str, response: Response, delete_data: bool = True) -> Dict[str, Any]:
125153
"""
126154
Delete a training job and optionally all associated data (model, report, dataset)
127155
"""
@@ -189,6 +217,9 @@ async def delete_job(job_id: str, delete_data: bool = True) -> Dict[str, Any]:
189217
# Delete job record from DynamoDB
190218
dynamodb_service.delete_job(job_id)
191219

220+
# Ensure client caches are invalidated immediately
221+
response.headers["Cache-Control"] = "no-store, no-cache, must-revalidate, max-age=0"
222+
192223
return {
193224
"message": "Job deleted successfully",
194225
"job_id": job_id,
@@ -205,30 +236,31 @@ async def delete_job(job_id: str, delete_data: bool = True) -> Dict[str, Any]:
205236

206237

207238
@router.patch("/{job_id}", response_model=JobResponse)
208-
async def update_job_metadata(job_id: str, request: JobUpdateRequest) -> JobResponse:
239+
async def update_job_metadata(job_id: str, update_request: JobUpdateRequest, response: Response, request: Request) -> JobResponse:
209240
"""
210241
Update job metadata (tags and notes) for experiment tracking.
211242
Tags can be used to categorize jobs (e.g., "experiment-1", "baseline", "production").
212243
Notes can store observations or comments about the training run.
213244
"""
214245
try:
215246
# Verify job exists
216-
job = dynamodb_service.get_job(job_id)
247+
# Use consistent read to ensure we have the absolute latest state before validating and updating
248+
job = dynamodb_service.get_job(job_id, consistent_read=True)
217249
if not job:
218250
raise HTTPException(
219251
status_code=status.HTTP_404_NOT_FOUND,
220252
detail="Job not found"
221253
)
222254

223255
# Validate tags if provided
224-
if request.tags is not None:
225-
if len(request.tags) > 10:
256+
if update_request.tags is not None:
257+
if len(update_request.tags) > 10:
226258
raise HTTPException(
227259
status_code=status.HTTP_400_BAD_REQUEST,
228260
detail="Maximum 10 tags allowed per job"
229261
)
230262
# Validate individual tag length
231-
for tag in request.tags:
263+
for tag in update_request.tags:
232264
if not tag.strip():
233265
raise HTTPException(
234266
status_code=status.HTTP_400_BAD_REQUEST,
@@ -241,7 +273,7 @@ async def update_job_metadata(job_id: str, request: JobUpdateRequest) -> JobResp
241273
)
242274

243275
# Validate notes length if provided (defense-in-depth, Pydantic also validates)
244-
if request.notes is not None and len(request.notes) > 1000:
276+
if update_request.notes is not None and len(update_request.notes) > 1000:
245277
raise HTTPException(
246278
status_code=status.HTTP_400_BAD_REQUEST,
247279
detail="Notes must be 1000 characters or less"
@@ -250,12 +282,12 @@ async def update_job_metadata(job_id: str, request: JobUpdateRequest) -> JobResp
250282
# Update job metadata in DynamoDB
251283
dynamodb_service.update_job_metadata(
252284
job_id=job_id,
253-
tags=request.tags,
254-
notes=request.notes
285+
tags=update_request.tags,
286+
notes=update_request.notes
255287
)
256288

257-
# Return updated job
258-
return await get_job_status(job_id)
289+
# Return updated job (pass response and request for HTTP headers + ETag)
290+
return await get_job_status(job_id, response, request)
259291

260292
except HTTPException:
261293
raise
@@ -274,7 +306,8 @@ async def deploy_model(job_id: str, request: DeployRequest) -> DeployResponse:
274306
"""
275307
try:
276308
# Verify job exists
277-
job = dynamodb_service.get_job(job_id)
309+
# Use consistent read to ensure we have the absolute latest state before deploying
310+
job = dynamodb_service.get_job(job_id, consistent_read=True)
278311
if not job:
279312
raise HTTPException(
280313
status_code=status.HTTP_404_NOT_FOUND,
@@ -298,6 +331,14 @@ async def deploy_model(job_id: str, request: DeployRequest) -> DeployResponse:
298331
# Update deployed status
299332
dynamodb_service.update_job_deployed(job_id, request.deploy)
300333

334+
# IMPORTANT: Invalidate HTTP cache for this job
335+
# Force clients to fetch fresh data with updated deployed/deployed_at fields
336+
# Note: This does NOT invalidate S3 presigned URL cache (those remain valid)
337+
from fastapi import Response
338+
response = Response()
339+
response.headers["Cache-Control"] = "no-cache, no-store, must-revalidate"
340+
response.headers["X-Cache-Invalidated"] = "deploy-status-changed"
341+
301342
action = "deployed" if request.deploy else "undeployed"
302343
return DeployResponse(
303344
job_id=job_id,
@@ -340,7 +381,8 @@ async def list_jobs(
340381
# Convert to JobSummary (lightweight) instead of full JobResponse
341382
jobs = []
342383
for job in raw_jobs:
343-
metrics = job.get('metrics', {})
384+
# Safely handle None/null metrics (happens when jobs fail before completion)
385+
metrics = job.get('metrics') or {}
344386

345387
# Extract primary metric (accuracy for classification, r2_score for regression)
346388
problem_type = job.get('problem_type')
@@ -350,7 +392,7 @@ async def list_jobs(
350392
elif problem_type == 'regression' and metrics.get('r2_score'):
351393
primary_metric = float(metrics['r2_score'])
352394

353-
# Extract training time and best estimator
395+
# Extract training time and best estimator (safely handle None)
354396
training_time = float(metrics['training_time']) if metrics.get('training_time') else None
355397
best_estimator = str(metrics['best_estimator']) if metrics.get('best_estimator') else None
356398

0 commit comments

Comments
 (0)