|
| 1 | +# 🛡️ OASIS Security — Crime & Delinquency Analysis in France |
| 2 | + |
| 3 | +> **CDSD Certification Project — RNCP35288** |
| 4 | +> Data Science Designer & Developer |
| 5 | +
|
| 6 | +[](https://huggingface.co/spaces/Dreipfelt/oasis-security) |
| 7 | +[](https://python.org) |
| 8 | +[](https://streamlit.io) |
| 9 | +[](./mlruns) |
| 10 | +[](./models/crime_predictor/Dockerfile) |
| 11 | +[](https://www.data.gouv.fr) |
| 12 | + |
| 13 | +--- |
| 14 | + |
| 15 | +## 📌 Context & Business Problem |
| 16 | + |
| 17 | +Recorded crime and delinquency data in France is publicly available but rarely |
| 18 | +surfaced in an accessible, analytical format. Law enforcement agencies, local |
| 19 | +authorities, and researchers require tools to identify trends, compare regions, |
| 20 | +and anticipate future developments. |
| 21 | + |
| 22 | +**OASIS Security** addresses this gap by delivering a complete, production-grade |
| 23 | +data science pipeline — from raw government CSV to interactive forecasting |
| 24 | +dashboard and REST inference API — covering all 18 administrative regions of |
| 25 | +metropolitan and overseas France. |
| 26 | + |
| 27 | +**Key question:** |
| 28 | +> *Can we accurately model and forecast regional crime trends in France from |
| 29 | +> 2016 to 2030 using recorded Police Nationale and Gendarmerie Nationale |
| 30 | +> statistics?* |
| 31 | +
|
| 32 | +**Answer:** Yes — our best model (Gradient Boosting) achieves **R² = 0.979** |
| 33 | +on the held-out test set, with a cross-validated R² of **0.978 ± 0.002**, |
| 34 | +confirming strong generalisation. |
| 35 | + |
| 36 | +--- |
| 37 | + |
| 38 | +## 🏆 Model Performance Summary |
| 39 | + |
| 40 | +| Model | R² Test | RMSE Test | MAE Test | CV R² Mean | CV R² Std | |
| 41 | +|---|---|---|---|---|---| |
| 42 | +| **Gradient Boosting** ✅ | **0.9793** | **48.84** | **29.95** | **0.9777** | **0.0022** | |
| 43 | +| XGBoost | 0.9781 | 50.21 | 30.90 | 0.9766 | 0.0028 | |
| 44 | +| Random Forest | 0.9724 | 56.33 | 39.72 | 0.9684 | 0.0026 | |
| 45 | +| Ridge | 0.0218 | 335.48 | 249.28 | 0.0065 | 0.0458 | |
| 46 | + |
| 47 | +> All experiments tracked with **MLflow** — see `mlruns/` for full run history, |
| 48 | +> parameters, and artefacts. |
| 49 | +
|
| 50 | +--- |
| 51 | + |
| 52 | +## 🗂️ Dataset |
| 53 | + |
| 54 | +| Property | Details | |
| 55 | +|---|---| |
| 56 | +| **Source** | [data.gouv.fr](https://www.data.gouv.fr) | |
| 57 | +| **Publisher** | Police Nationale & Gendarmerie Nationale | |
| 58 | +| **Scope** | All 18 French administrative regions (INSEE 2025) | |
| 59 | +| **Period** | 2016–2025 | |
| 60 | +| **Granularity** | Region × Crime category × Year | |
| 61 | +| **Format** | CSV (semicolon-delimited, UTF-8) | |
| 62 | +| **Update frequency** | Annual | |
| 63 | + |
| 64 | +The dataset is loaded dynamically at runtime from its canonical URL on |
| 65 | +`static.data.gouv.fr`, ensuring the application always reflects the latest |
| 66 | +published figures without manual intervention. |
| 67 | + |
| 68 | +--- |
| 69 | + |
| 70 | +## 🏗️ Architecture |
| 71 | + |
| 72 | +``` |
| 73 | +┌─────────────────────────────────────────────────────────────┐ |
| 74 | +│ DATA PIPELINE │ |
| 75 | +│ │ |
| 76 | +│ data.gouv.fr ──► load_data() ──► detect_columns() │ |
| 77 | +│ │ │ |
| 78 | +│ ┌────────▼────────┐ │ |
| 79 | +│ │ Preprocessing │ │ |
| 80 | +│ │ · Type casting │ │ |
| 81 | +│ │ · Null handling│ │ |
| 82 | +│ │ · Label mapping│ │ |
| 83 | +│ └────────┬────────┘ │ |
| 84 | +└─────────────────────────────────────────┼───────────────────┘ |
| 85 | + │ |
| 86 | +┌─────────────────────────────────────────▼───────────────────┐ |
| 87 | +│ FEATURE ENGINEERING │ |
| 88 | +│ │ |
| 89 | +│ · Cyclic temporal features (year_sin, year_cos) │ |
| 90 | +│ · Trend normalisation (year_trend) │ |
| 91 | +│ · Lag features (lag1, lag2) │ |
| 92 | +│ · Rolling mean (roll_mean_3) │ |
| 93 | +│ · Regional aggregates (region_mean) │ |
| 94 | +│ · Categorical encoding (ind_code, reg_code) │ |
| 95 | +└─────────────────────────────────────────┬───────────────────┘ |
| 96 | + │ |
| 97 | +┌─────────────────────────────────────────▼───────────────────┐ |
| 98 | +│ MODELLING LAYER │ |
| 99 | +│ │ |
| 100 | +│ ┌───────────────────┐ ┌───────────────────────────┐ │ |
| 101 | +│ │ Train set │ │ Test set (held out) │ │ |
| 102 | +│ │ 2016 → 2023 │─────►│ 2024–2025 │ │ |
| 103 | +│ └─────────┬─────────┘ └───────────────────────────┘ │ |
| 104 | +│ │ │ |
| 105 | +│ ┌─────────▼──────────────────────────────────────────┐ │ |
| 106 | +│ │ Gradient Boosting · XGBoost · Random Forest │ │ |
| 107 | +│ │ Ridge · LightGBM · Prophet · Holt-Winters │ │ |
| 108 | +│ └─────────────────────────┬──────────────────────────┘ │ |
| 109 | +│ │ │ |
| 110 | +│ TimeSeriesSplit cross-validation (n=3) │ |
| 111 | +│ MLflow experiment tracking (12 runs) │ |
| 112 | +│ → Champion: Gradient Boosting (R²=0.979) │ |
| 113 | +└─────────────────────────────────────────┬───────────────────┘ |
| 114 | + │ |
| 115 | +┌─────────────────────────────────────────▼───────────────────┐ |
| 116 | +│ SERVING LAYER │ |
| 117 | +│ │ |
| 118 | +│ ┌────────────────────────┐ ┌────────────────────────┐ │ |
| 119 | +│ │ Streamlit Dashboard │ │ FastAPI REST API │ │ |
| 120 | +│ │ (Hugging Face Spaces) │ │ (Docker container) │ │ |
| 121 | +│ │ streamlit/app.py │ │ models/.../predict.py │ │ |
| 122 | +│ └────────────────────────┘ └────────────────────────┘ │ |
| 123 | +└─────────────────────────────────────────────────────────────┘ |
| 124 | +``` |
| 125 | + |
| 126 | +--- |
| 127 | + |
| 128 | +## 🤖 Modelling Approach |
| 129 | + |
| 130 | +### Problem framing |
| 131 | +Each (region, crime category) pair forms an independent supervised regression |
| 132 | +problem. The target variable is the annual number of recorded offences per |
| 133 | +100,000 inhabitants (`taux_100k`). |
| 134 | + |
| 135 | +### Feature engineering |
| 136 | +Production-grade features are constructed for each observation: |
| 137 | + |
| 138 | +- **Cyclic temporal encoding** — `year_sin` and `year_cos` capture periodicity |
| 139 | + without imposing linearity on the year variable |
| 140 | +- **Lag features** — `lag1` and `lag2` provide the model with recent history |
| 141 | + per (indicator, region) group |
| 142 | +- **Rolling mean** — `roll_mean_3` smooths short-term volatility |
| 143 | +- **Regional aggregates** — `region_mean` contextualises each series within |
| 144 | + its regional baseline |
| 145 | +- **Categorical encoding** — indicators and regions are ordinally encoded |
| 146 | + |
| 147 | +### Validation strategy |
| 148 | +A `TimeSeriesSplit` with 3 folds is used throughout, respecting the temporal |
| 149 | +ordering of observations and preventing data leakage from future to past. |
| 150 | + |
| 151 | +### Experiment tracking |
| 152 | +All model runs are logged with **MLflow**, including: |
| 153 | + |
| 154 | +- Hyperparameters (`model`, `n_estimators`, `learning_rate`, etc.) |
| 155 | +- Metrics (`r2_train`, `r2_test`, `rmse_test`, `mae_test`, `cv_r2_mean`, `cv_r2_std`) |
| 156 | +- Model artefacts (serialised `.pkl` files) |
| 157 | +- Git commit hash for full reproducibility |
| 158 | + |
| 159 | +--- |
| 160 | + |
| 161 | +## 🛠️ Technical Stack |
| 162 | + |
| 163 | +| Layer | Technology | Version | |
| 164 | +|---|---|---| |
| 165 | +| Language | Python | 3.11 | |
| 166 | +| Dashboard | Streamlit | 1.45 | |
| 167 | +| Visualisation | Plotly Express & Graph Objects | ≥ 5.18 | |
| 168 | +| Data processing | Pandas, NumPy | ≥ 2.0, ≥ 1.24 | |
| 169 | +| ML — Boosting | LightGBM, XGBoost, GradientBoosting | ≥ 4.3, ≥ 1.7 | |
| 170 | +| ML — Forecasting | Prophet, Statsmodels (Holt-Winters) | 1.1, ≥ 0.14 | |
| 171 | +| ML — Utilities | Scikit-learn (TimeSeriesSplit, metrics) | ≥ 1.3 | |
| 172 | +| Experiment tracking | MLflow | ≥ 2.12 | |
| 173 | +| REST API | FastAPI + Uvicorn | ≥ 0.110 | |
| 174 | +| Containerisation | Docker (multi-stage build) | — | |
| 175 | +| Deployment | Hugging Face Spaces (Streamlit SDK) | — | |
| 176 | + |
| 177 | +--- |
| 178 | + |
| 179 | +## 🐳 MLOps & Containerisation |
| 180 | + |
| 181 | +The inference pipeline is fully containerised using a **multi-stage Docker |
| 182 | +build**, cleanly separating the training environment from the production image. |
| 183 | + |
| 184 | +``` |
| 185 | +Stage 1 — trainer |
| 186 | + · Installs full ML stack (LightGBM, XGBoost, Prophet, statsmodels…) |
| 187 | + · Receives DATA_URL as a build argument |
| 188 | + · Runs train.py → serialises crime_predictor.pkl |
| 189 | +
|
| 190 | +Stage 2 — production |
| 191 | + · Copies only the serialised artefact from Stage 1 |
| 192 | + · Installs minimal serving dependencies (fastapi, uvicorn, pandas, numpy) |
| 193 | + · Exposes port 8000 with HEALTHCHECK |
| 194 | + · Runs as non-root user (security best practice) |
| 195 | +``` |
| 196 | + |
| 197 | +```bash |
| 198 | +# Build |
| 199 | +docker build \ |
| 200 | + --build-arg DATA_URL="https://static.data.gouv.fr/.../donnee-reg.csv" \ |
| 201 | + -t oasis-security:latest \ |
| 202 | + ./models/crime_predictor/ |
| 203 | + |
| 204 | +# Run |
| 205 | +docker run -p 8000:8000 oasis-security:latest |
| 206 | + |
| 207 | +# Health check |
| 208 | +curl http://localhost:8000/health |
| 209 | + |
| 210 | +# Inference |
| 211 | +curl -X POST http://localhost:8000/predict \ |
| 212 | + -H "Content-Type: application/json" \ |
| 213 | + -d '{"region": "11", "crime_category": "Vols avec violence", "horizon": 5}' |
| 214 | +``` |
| 215 | + |
| 216 | +--- |
| 217 | + |
| 218 | +## 📁 Repository Structure |
| 219 | + |
| 220 | +``` |
| 221 | +oasis-security/ |
| 222 | +│ |
| 223 | +├── README.md # This file |
| 224 | +├── LICENSE |
| 225 | +├── .gitignore |
| 226 | +├── requirements.txt # Top-level dependencies |
| 227 | +├── Dockerfile # Root-level compose target |
| 228 | +├── docker-compose.yml |
| 229 | +│ |
| 230 | +├── data/ |
| 231 | +│ ├── raw/ # Source files (never modified) |
| 232 | +│ ├── processed/ # Cleaned, model-ready CSVs |
| 233 | +│ ├── geo/ # Geospatial files (GeoJSON) |
| 234 | +│ └── docs/ # Dataset documentation |
| 235 | +│ |
| 236 | +├── notebooks/ |
| 237 | +│ ├── 01_exploration_crimes.ipynb # Data exploration & EDA |
| 238 | +│ ├── 02_benchmark_modeles.ipynb # Model comparison & selection |
| 239 | +│ └── 03_analyse_departements.ipynb # Departmental deep-dive |
| 240 | +│ |
| 241 | +├── pipeline/ # Reusable data pipeline modules |
| 242 | +│ ├── preprocess.py |
| 243 | +│ ├── features.py |
| 244 | +│ ├── train.py |
| 245 | +│ └── predict.py |
| 246 | +│ |
| 247 | +├── models/ |
| 248 | +│ └── crime_predictor/ |
| 249 | +│ ├── Dockerfile # Multi-stage build (train → serve) |
| 250 | +│ ├── artifacts/ |
| 251 | +│ │ ├── crime_predictor.pkl # Serialised champion model |
| 252 | +│ │ └── metrics.json # Benchmark results (R²=0.979) |
| 253 | +│ ├── src/ |
| 254 | +│ │ ├── config.yaml # Hyperparameters & data config |
| 255 | +│ │ ├── model.py # CrimeRatePredictor class |
| 256 | +│ │ ├── train.py # Training pipeline |
| 257 | +│ │ └── predict.py # FastAPI inference endpoint |
| 258 | +│ └── tests/ |
| 259 | +│ └── test_model.py |
| 260 | +│ |
| 261 | +├── mlruns/ # MLflow tracking (12 runs logged) |
| 262 | +│ |
| 263 | +├── images/ # Visuals for documentation |
| 264 | +│ |
| 265 | +└── streamlit/ # Hugging Face Space |
| 266 | + ├── app.py |
| 267 | + └── requirements.txt |
| 268 | +``` |
| 269 | + |
| 270 | +--- |
| 271 | + |
| 272 | +## 🚀 Running Locally |
| 273 | + |
| 274 | +### Dashboard |
| 275 | + |
| 276 | +```bash |
| 277 | +git clone https://github.com/Data-Science-Designer-and-Developer/oasis-security.git |
| 278 | +cd oasis-security |
| 279 | + |
| 280 | +pip install -r requirements.txt |
| 281 | +streamlit run streamlit/app.py |
| 282 | +``` |
| 283 | + |
| 284 | +### Inference API |
| 285 | + |
| 286 | +```bash |
| 287 | +cd models/crime_predictor |
| 288 | + |
| 289 | +docker build \ |
| 290 | + --build-arg DATA_URL="https://static.data.gouv.fr/.../donnee-reg.csv" \ |
| 291 | + -t oasis-security:latest . |
| 292 | + |
| 293 | +docker run -p 8000:8000 oasis-security:latest |
| 294 | +``` |
| 295 | + |
| 296 | +### MLflow UI |
| 297 | + |
| 298 | +```bash |
| 299 | +mlflow ui --backend-store-uri ./mlruns |
| 300 | +# Open http://localhost:5000 |
| 301 | +``` |
| 302 | + |
| 303 | +--- |
| 304 | + |
| 305 | +## ⚖️ Ethics & Data Privacy |
| 306 | + |
| 307 | +The data used throughout this project is: |
| 308 | + |
| 309 | +- **Publicly available** — published by French government authorities under |
| 310 | + Licence Ouverte v2.0 |
| 311 | +- **Aggregated** — figures are presented at regional level only; no |
| 312 | + individual-level records are processed or stored |
| 313 | +- **Non-identifiable** — no re-identification of persons is possible from |
| 314 | + the published aggregates |
| 315 | + |
| 316 | +This project is intended solely for informational, educational, and analytical |
| 317 | +purposes. Forecasts are indicative and subject to the inherent limitations of |
| 318 | +statistical modelling on short time series. The analysis carries no |
| 319 | +discriminatory intent with respect to geographical areas or populations. |
| 320 | + |
| 321 | +Data processing complies with the principles of the **GDPR** (Regulation (EU) |
| 322 | +2016/679), in particular data minimisation, purpose limitation, and storage |
| 323 | +limitation. |
| 324 | + |
| 325 | +> ⚠️ Recorded crime figures reflect offences *registered* by police and |
| 326 | +> gendarmerie services — not actual crime rates. Under-reporting, changes in |
| 327 | +> classification practices, and variations in policing intensity may all |
| 328 | +> influence the figures independently of true crime levels. |
| 329 | +
|
| 330 | +--- |
| 331 | + |
| 332 | +## 📜 Licence |
| 333 | + |
| 334 | +Data: [Licence Ouverte v2.0](https://www.etalab.gouv.fr/licence-ouverte-open-licence) |
| 335 | +— © Police Nationale & Gendarmerie Nationale / data.gouv.fr |
| 336 | + |
| 337 | +Code: MIT |
| 338 | + |
| 339 | +--- |
| 340 | + |
| 341 | +*CDSD Certification Project — Data Science Designer & Developer (RNCP35288)* |
0 commit comments