Skip to content

Commit 2e3b20a

Browse files
Initial ML platform implementation
0 parents  commit 2e3b20a

33 files changed

Lines changed: 1832 additions & 0 deletions

.github/workflows/ci.yml

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
name: ci
2+
3+
on:
4+
push:
5+
pull_request:
6+
7+
jobs:
8+
build:
9+
runs-on: ubuntu-latest
10+
steps:
11+
- uses: actions/checkout@v4
12+
- uses: actions/setup-python@v5
13+
with:
14+
python-version: '3.11'
15+
- name: Install dependencies
16+
run: |
17+
python -m pip install --upgrade pip
18+
pip install -e .[dev]
19+
- name: Lint
20+
run: ruff check src tests
21+
- name: Test
22+
run: pytest
23+
- name: Training sanity check
24+
run: python -m mlplatform.scripts.training_sanity_check
25+
- name: Build Docker image
26+
run: docker build -t mlplatform:ci .
27+
- name: Simulate deployment
28+
run: python -m mlplatform.scripts.simulate_deploy

.gitignore

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
.pytest_cache/
2+
.testdata/
3+
__pycache__/
4+
*.pyc
5+
.venv/
6+
venv/
7+
dist/
8+
build/
9+
*.egg-info/

Dockerfile

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
FROM python:3.11-slim
2+
3+
ENV PYTHONDONTWRITEBYTECODE=1 \
4+
PYTHONUNBUFFERED=1 \
5+
PIP_NO_CACHE_DIR=1
6+
7+
WORKDIR /app
8+
9+
RUN addgroup --system app && adduser --system --ingroup app app
10+
11+
COPY pyproject.toml README.md /app/
12+
COPY src /app/src
13+
14+
RUN pip install --upgrade pip && pip install .
15+
16+
USER app
17+
18+
EXPOSE 8000
19+
20+
CMD ["uvicorn", "mlplatform.api.app:create_app", "--factory", "--host", "0.0.0.0", "--port", "8000"]

README.md

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
# MLPlatform
2+
3+
MLPlatform is a production-style machine learning platform simulation that covers the full model lifecycle:
4+
training orchestration, experiment tracking, model registry, automated promotion, inference serving, and observability.
5+
6+
## Architecture
7+
8+
```text
9+
+--------------------+
10+
| CLI / FastAPI |
11+
+---------+----------+
12+
|
13+
v
14+
+--------------------+
15+
| Job Queue + Worker |
16+
| async training exec |
17+
+---------+----------+
18+
|
19+
+------------------+-------------------+
20+
| | |
21+
v v v
22+
+---------------+ +----------------+ +---------------------+
23+
| Experiment DB | | Model Registry | | Observability Store |
24+
| SQLite/Postgres| | version/stage | | latency/drift/usage |
25+
+-------+-------+ +--------+-------+ +----------+----------+
26+
| | |
27+
v v v
28+
+---------------+ +------------------+ +------------------+
29+
| Run metadata | | Artifact store | | Monitoring APIs |
30+
+---------------+ +------------------+ +------------------+
31+
|
32+
v
33+
+--------------------+
34+
| FastAPI Serving |
35+
| multi-version load |
36+
+--------------------+
37+
```
38+
39+
## Folder Structure
40+
41+
```text
42+
src/mlplatform/
43+
api/ FastAPI REST surface
44+
serving/ online inference service
45+
training/ config-driven training and queueing
46+
*.py database, registry, tracking, promotion, observability
47+
48+
tests/
49+
end-to-end and component tests
50+
51+
.github/workflows/
52+
CI pipeline
53+
```
54+
55+
## Core Capabilities
56+
57+
- Async training submission through CLI and REST API.
58+
- YAML-driven experiments with reproducible seeding.
59+
- Persistent experiment tracking in SQLite or Postgres via SQLAlchemy.
60+
- Automatic model versioning and stage management.
61+
- Rule-based promotion checks using accuracy gain and latency thresholds.
62+
- FastAPI inference service with multiple model versions loaded simultaneously.
63+
- Basic observability for latency, request volume, and drift signals.
64+
65+
## Local Usage
66+
67+
```bash
68+
pip install -e .[dev]
69+
mlplatform init-db
70+
uvicorn mlplatform.api.app:create_app --factory --reload
71+
```
72+
73+
Run the training sanity check:
74+
75+
```bash
76+
python -m mlplatform.scripts.training_sanity_check
77+
```
78+
79+
Run tests:
80+
81+
```bash
82+
pytest
83+
```
84+
85+
## Design Decisions
86+
87+
- SQLAlchemy + SQLite by default for portability; the schema is compatible with Postgres.
88+
- Queueing is simulated with a durable job table plus an in-process worker so the system behaves like an internal platform without needing Redis.
89+
- The trainer supports PyTorch when available, but the platform remains operable in minimal environments through a deterministic fallback trainer.
90+
- Model artifacts are versioned on disk with registry metadata in the database, which keeps deployment simple and rollback explicit.
91+
92+
## Failure Handling
93+
94+
- Training failures are persisted with stack traces and the run status is marked failed.
95+
- Promotion is blocked when safety checks fail, including missing metrics or latency regressions.
96+
- Serving returns structured errors for unknown model versions or missing artifacts.
97+
- Observability ingestion is non-blocking and never takes down the serving path.

pyproject.toml

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
[build-system]
2+
requires = ["setuptools>=68", "wheel"]
3+
build-backend = "setuptools.build_meta"
4+
5+
[project]
6+
name = "mlplatform"
7+
version = "0.1.0"
8+
description = "Production-grade ML platform simulation with training, tracking, registry, serving, and observability"
9+
readme = "README.md"
10+
requires-python = ">=3.10"
11+
dependencies = [
12+
"fastapi>=0.110",
13+
"uvicorn[standard]>=0.27",
14+
"pydantic>=2.6",
15+
"PyYAML>=6.0",
16+
"SQLAlchemy>=2.0",
17+
"httpx>=0.26",
18+
]
19+
20+
[project.optional-dependencies]
21+
dev = [
22+
"pytest>=8.0",
23+
"ruff>=0.6",
24+
]
25+
torch = [
26+
"torch>=2.2",
27+
]
28+
29+
[project.scripts]
30+
mlplatform = "mlplatform.cli:main"
31+
32+
[tool.setuptools]
33+
package-dir = {"" = "src"}
34+
35+
[tool.setuptools.packages.find]
36+
where = ["src"]
37+
38+
[tool.ruff]
39+
line-length = 100
40+
src = ["src", "tests"]
41+
42+
[tool.pytest.ini_options]
43+
addopts = "-q"
44+
testpaths = ["tests"]
45+
pythonpath = ["src"]

src/mlplatform/__init__.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
"""MLPlatform package."""
2+
3+
from mlplatform.config import Settings
4+
from mlplatform.db import init_db
5+
6+
__all__ = ["Settings", "init_db"]

src/mlplatform/api/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
"""API package."""

src/mlplatform/api/app.py

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
from __future__ import annotations
2+
3+
from fastapi import FastAPI
4+
5+
from mlplatform.db import init_db
6+
from mlplatform.api.routes import router
7+
from mlplatform.queue import AsyncJobQueue
8+
from mlplatform.training.service import TrainingService
9+
from mlplatform.serving.service import ServingService
10+
11+
12+
def create_app() -> FastAPI:
13+
init_db()
14+
app = FastAPI(title="MLPlatform", version="0.1.0")
15+
training_service = TrainingService()
16+
serving_service = ServingService()
17+
job_queue = AsyncJobQueue()
18+
19+
def handle_job(job_id: str, payload: dict[str, object]) -> dict[str, object]:
20+
return training_service.run_job(job_id, payload)
21+
22+
job_queue.start(handle_job)
23+
app.state.training_service = training_service
24+
app.state.serving_service = serving_service
25+
app.state.job_queue = job_queue
26+
app.include_router(router)
27+
28+
@app.on_event("shutdown")
29+
def shutdown_queue() -> None:
30+
job_queue.stop()
31+
32+
return app

0 commit comments

Comments
 (0)