@@ -42,9 +42,11 @@ This project predicts **French departmental crime rates** by category using offi
4242---
4343
4444## 📁 Project Structure
45+
46+
4547oasis-security/
4648├── .github/
47- │ └── workflows/ # GitHub Act ions CI/CD
49+ │ └── workflows/ # GitHub Actions CI/CD
4850├── data/ # Cleaned datasets (.parquet)
4951├── docs/
5052│ └── crime_predictor/ # Technical documentation
@@ -66,7 +68,7 @@ oasis-security/
6668├── app.py # Main Streamlit dashboard
6769├── script_crimes_et_delits.py # Data collection & cleaning
6870├── Dockerfile # Multi-stage build (train → production)
69- ├── docker-compose.yml # Full stack (MLflow + Postgres + API)
71+ ├── docker-compose.yml # Full stack (MLflow + Postgres + API)
7072├── requirements.txt
7173└── README.md
7274
@@ -84,27 +86,28 @@ pip install -r requirements.txt
8486
8587### 2. Download and Clean Data
8688python script_crimes_et_delits.py
87- # → generates data/crimes_clean.parquet
89+ → generates data/crimes_clean.parquet
8890
89- ### 3. Train the Model
91+ ### 3. Train the Model
9092python models/crime_predictor/src/train.py
91- # → compares 4 models, logs to MLflow, saves best model
92- # → generates models/crime_predictor/models/crime_predictor.pkl
93- # → generates models/crime_predictor/models/metrics.json
93+ → compares 4 models, logs to MLflow, saves best model
94+ → generates models/crime_predictor/models/crime_predictor.pkl
95+ → generates models/crime_predictor/models/metrics.json
9496
95- ### 4. Launch the Dashboard
97+ #### 4. Launch the Dashboard
9698streamlit run app.py
97- # → http://localhost:8501
99+ → http://localhost:8501
98100
99101### 5. Launch the API
100102uvicorn models.crime_predictor.src.predict: app --reload --port 8000
101- # → http://localhost:8000/docs
103+ → http://localhost:8000/docs
102104
103105---
104- 🔄 Data Pipeline
105- data.gouv.fr (SSMSI)
106+
107+ ## 🔄 Data Pipeline
108+ data.gouv.fr (SSMSI)
106109 ↓
107- script_crimes_et_delits.py
110+ script_crimes_et_delits.py
108111 ├── Download CSV (requests)
109112 ├── Normalize column names (snake_case)
110113 ├── Remove duplicates
@@ -114,43 +117,46 @@ script_crimes_et_delits.py
114117 │ ├── annual_rate_change (pct_change by dep × category)
115118 │ └── year_norm (normalized [ 0, 1] )
116119 └── Save Parquet (Snappy)
117-
120+ ↓
118121 data/crimes_clean.parquet
119- Raw data: 8 columns, ~ 50,000 rows
120- After cleaning: 10 columns, ~ 49,000 rows (<2% loss)
122+
123+ Raw data: 8 columns, ~ 50,000 rows
124+ After cleaning: 10 columns, ~ 49,000 rows (<2% loss)
121125
122126---
123- 🤖 Modeling & Results
127+
128+ ## 🤖 Modeling & Results
124129Features
125- | Feature | Description |
126- | ------------- | ----------------------------- |
127- | ` annee ` | Year (int) |
128- | ` dep_encoded ` | Department (LabelEncoded) |
129- | ` cat_encoded ` | Crime category (LabelEncoded) |
130- | ` annee_norm ` | Normalized year [ 0,1] |
130+ Feature Description
131+ annee Year (int)
132+ dep_encoded Department (LabelEncoded)
133+ cat_encoded Crime category (LabelEncoded)
134+ annee_norm Normalized year [ 0,1]
131135
132136Target: tauxpour100000hab (crime rate per 100,000 inhabitants)
133137Split: 80% train / 20% test — random seed 42
134138Validation: 5-fold cross-validation on training set
135139
136- Model Comparison (Test Set)
137140| Model | R² test | RMSE | MAE | CV R² (±std) |
138141| ----------------- | -------- | -------- | -------- | --------------- |
139142| Ridge | 0.71 | 87.4 | 62.1 | 0.69 ± 0.03 |
140143| Random Forest | 0.89 | 54.2 | 38.7 | 0.87 ± 0.02 |
141144| Gradient Boosting | 0.88 | 56.1 | 40.2 | 0.86 ± 0.02 |
142145| ** XGBoost** ✅ | ** 0.91** | ** 49.8** | ** 35.3** | ** 0.90 ± 0.01** |
143146
147+
144148 Best model: XGBoost — R²=0.91 on test set
145149 Low train/test gap → no significant overfitting
146150 Low CV variance → confirmed robustness
147151
148152MLflow Tracking
149153mlflow ui --backend-store-uri models/crime_predictor/mlruns
150- # → http://localhost:5000
154+ → http://localhost:5000
151155
152156---
153- 📊 Streamlit Dashboard
157+
158+ ## 📊 Streamlit Dashboard
159+
1541605 interactive pages:
155161| Page | Content |
156162| -------------------- | ----------------------------------------------------- |
@@ -161,7 +167,8 @@ mlflow ui --backend-store-uri models/crime_predictor/mlruns
161167| Ethics & Limitations | Biases and usage limits |
162168
163169---
164- 🌐 FastAPI Endpoints
170+
171+ ## 🌐 FastAPI Endpoints
165172Available Endpoints
166173| Method | Endpoint | Description |
167174| ------ | ---------- | --------------------------------- |
@@ -173,7 +180,6 @@ Example Request
173180curl -X POST http://localhost:8000/predict \
174181 -H "Content-Type: application/json" \
175182 -d '{"annee": 2025, "dep_encoded": 5, "cat_encoded": 0, "annee_norm": 1.0}'
176-
177183{
178184 "predicted_rate": 312.47,
179185 "unit": "incidents per 100,000 inhabitants",
@@ -182,7 +188,8 @@ curl -X POST http://localhost:8000/predict \
182188}
183189
184190---
185- 🧪 Tests
191+
192+ ## 🧪 Tests
186193# Run all tests
187194pytest models/crime_predictor/tests/ -v
188195
@@ -196,61 +203,66 @@ Test coverage:
196203| ` TestModel ` | Shape, type, positivity, R², determinism (7 assertions) |
197204| ` TestSerialization ` | Joblib serialization, metrics.json structure (2 assertions) |
198205
199- ---
200- 🐳 Docker & CI/CD
201- Multi-stage Docker
202- # Build (training → production)
206+ ---
207+
208+ ## 🐳 Docker & CI/CD
209+ # Multi-stage Docker
210+ Build (training → production)
203211docker build -t oasis-security: latest .
204212
205213# Run API
206214docker run -p 8000:8000 oasis-security: latest
207215
208- Full Stack (MLflow + Postgres + API)
216+ # Full Stack (MLflow + Postgres + API)
209217docker-compose up -d
210- # MLflow UI → http://localhost:5000
211- # API → http://localhost:8000/docs
218+ MLflow UI → http://localhost:5000
219+ API → http://localhost:8000/docs
220+
221+ # GitHub Actions CI/CD
212222
213- GitHub Actions CI/CD
214223.github/workflows/ci-cd.yml triggers on each push:
215224
216- Linting (flake8)
217- Unit tests (pytest)
218- Docker build
219- Push Docker image to GHCR
225+ 1 . Linting (flake8)
226+ 2 . Unit tests (pytest)
227+ 3 . Docker build
228+ 4 . Push Docker image to GHCR
220229
221- ---
222- ⚠️ Ethics & Limitations
230+ ---
231+
232+ ## ⚠️ Ethics & Limitations
223233
224234This model is a statistical exploration tool, not an operational decision system.
225235
226236Data limitations:
227237
228- Covers only recorded crimes (dark figure estimated 50–80%)
229- Recording practices vary by department
230- No infra-departmental data
238+ - Covers only recorded crimes (dark figure estimated 50–80%)
239+ - Recording practices vary by department
240+ - No infra-departmental data
231241
232242Model biases:
233243
234- Reflects reporting biases
235- Correlation ≠ causation
236- Not suitable for external shocks (COVID, economic crises)
244+ - Reflects reporting biases
245+ - Correlation ≠ causation
246+ - Not suitable for external shocks (COVID, economic crises)
237247
238248Prohibited use:
239249
240- Predictive targeting of individuals or geographic areas
241- Judicial or penal decision-making
250+ - Predictive targeting of individuals or geographic areas
251+ - Judicial or penal decision-making
242252
243253Compliance: aggregated anonymized open data — no personal data used.
244254
245255---
256+
246257📜 License
247- MIT — see LICENSE
258+ MIT — see LICENSE
248259
249260---
261+
250262👤 Author
251- Frédéric Tellier — Data Scientist
252- LinkedIN: https://www.linkedin.com/in/fr%C3%A9d%C3%A9ric-tellier-8a9170283/ ; Portfolio: https://github.com/Dreipfelt/
253263
254- Project developed as part of CDSD certification — 2025
264+ Frédéric Tellier — Data Scientist
265+ [ LinkedIn] ( **url**https://www.linkedin.com/in/fr%C3%A9d%C3%A9ric-tellier-8a9170283/ )
266+ | [ Portfolio] ( [url](https://github.com/Dreipfelt/) )
255267
256- ---
268+ Project developed as part of CDSD certification — 2026
0 commit comments