Skip to content

Commit 2d64855

Browse files
committed
docs: add comprehensive GitHub README for CDSD presentation
1 parent fecbe3e commit 2d64855

1 file changed

Lines changed: 341 additions & 0 deletions

File tree

README_github.md

Lines changed: 341 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,341 @@
1+
# 🛡️ OASIS Security — Crime & Delinquency Analysis in France
2+
3+
> **CDSD Certification Project — RNCP35288**
4+
> Data Science Designer & Developer
5+
6+
[![Hugging Face](https://img.shields.io/badge/🤗%20HF%20Space-Live%20Demo-blue)](https://huggingface.co/spaces/Dreipfelt/oasis-security)
7+
[![Python](https://img.shields.io/badge/Python-3.11-3776AB?logo=python&logoColor=white)](https://python.org)
8+
[![Streamlit](https://img.shields.io/badge/Streamlit-1.45-FF4B4B?logo=streamlit&logoColor=white)](https://streamlit.io)
9+
[![MLflow](https://img.shields.io/badge/MLflow-tracked-0194E2?logo=mlflow&logoColor=white)](./mlruns)
10+
[![Docker](https://img.shields.io/badge/Docker-multi--stage-2496ED?logo=docker&logoColor=white)](./models/crime_predictor/Dockerfile)
11+
[![License](https://img.shields.io/badge/Data-data.gouv.fr-green)](https://www.data.gouv.fr)
12+
13+
---
14+
15+
## 📌 Context & Business Problem
16+
17+
Recorded crime and delinquency data in France is publicly available but rarely
18+
surfaced in an accessible, analytical format. Law enforcement agencies, local
19+
authorities, and researchers require tools to identify trends, compare regions,
20+
and anticipate future developments.
21+
22+
**OASIS Security** addresses this gap by delivering a complete, production-grade
23+
data science pipeline — from raw government CSV to interactive forecasting
24+
dashboard and REST inference API — covering all 18 administrative regions of
25+
metropolitan and overseas France.
26+
27+
**Key question:**
28+
> *Can we accurately model and forecast regional crime trends in France from
29+
> 2016 to 2030 using recorded Police Nationale and Gendarmerie Nationale
30+
> statistics?*
31+
32+
**Answer:** Yes — our best model (Gradient Boosting) achieves **R² = 0.979**
33+
on the held-out test set, with a cross-validated R² of **0.978 ± 0.002**,
34+
confirming strong generalisation.
35+
36+
---
37+
38+
## 🏆 Model Performance Summary
39+
40+
| Model | R² Test | RMSE Test | MAE Test | CV R² Mean | CV R² Std |
41+
|---|---|---|---|---|---|
42+
| **Gradient Boosting**| **0.9793** | **48.84** | **29.95** | **0.9777** | **0.0022** |
43+
| XGBoost | 0.9781 | 50.21 | 30.90 | 0.9766 | 0.0028 |
44+
| Random Forest | 0.9724 | 56.33 | 39.72 | 0.9684 | 0.0026 |
45+
| Ridge | 0.0218 | 335.48 | 249.28 | 0.0065 | 0.0458 |
46+
47+
> All experiments tracked with **MLflow** — see `mlruns/` for full run history,
48+
> parameters, and artefacts.
49+
50+
---
51+
52+
## 🗂️ Dataset
53+
54+
| Property | Details |
55+
|---|---|
56+
| **Source** | [data.gouv.fr](https://www.data.gouv.fr) |
57+
| **Publisher** | Police Nationale & Gendarmerie Nationale |
58+
| **Scope** | All 18 French administrative regions (INSEE 2025) |
59+
| **Period** | 2016–2025 |
60+
| **Granularity** | Region × Crime category × Year |
61+
| **Format** | CSV (semicolon-delimited, UTF-8) |
62+
| **Update frequency** | Annual |
63+
64+
The dataset is loaded dynamically at runtime from its canonical URL on
65+
`static.data.gouv.fr`, ensuring the application always reflects the latest
66+
published figures without manual intervention.
67+
68+
---
69+
70+
## 🏗️ Architecture
71+
72+
```
73+
┌─────────────────────────────────────────────────────────────┐
74+
│ DATA PIPELINE │
75+
│ │
76+
│ data.gouv.fr ──► load_data() ──► detect_columns() │
77+
│ │ │
78+
│ ┌────────▼────────┐ │
79+
│ │ Preprocessing │ │
80+
│ │ · Type casting │ │
81+
│ │ · Null handling│ │
82+
│ │ · Label mapping│ │
83+
│ └────────┬────────┘ │
84+
└─────────────────────────────────────────┼───────────────────┘
85+
86+
┌─────────────────────────────────────────▼───────────────────┐
87+
│ FEATURE ENGINEERING │
88+
│ │
89+
│ · Cyclic temporal features (year_sin, year_cos) │
90+
│ · Trend normalisation (year_trend) │
91+
│ · Lag features (lag1, lag2) │
92+
│ · Rolling mean (roll_mean_3) │
93+
│ · Regional aggregates (region_mean) │
94+
│ · Categorical encoding (ind_code, reg_code) │
95+
└─────────────────────────────────────────┬───────────────────┘
96+
97+
┌─────────────────────────────────────────▼───────────────────┐
98+
│ MODELLING LAYER │
99+
│ │
100+
│ ┌───────────────────┐ ┌───────────────────────────┐ │
101+
│ │ Train set │ │ Test set (held out) │ │
102+
│ │ 2016 → 2023 │─────►│ 2024–2025 │ │
103+
│ └─────────┬─────────┘ └───────────────────────────┘ │
104+
│ │ │
105+
│ ┌─────────▼──────────────────────────────────────────┐ │
106+
│ │ Gradient Boosting · XGBoost · Random Forest │ │
107+
│ │ Ridge · LightGBM · Prophet · Holt-Winters │ │
108+
│ └─────────────────────────┬──────────────────────────┘ │
109+
│ │ │
110+
│ TimeSeriesSplit cross-validation (n=3) │
111+
│ MLflow experiment tracking (12 runs) │
112+
│ → Champion: Gradient Boosting (R²=0.979) │
113+
└─────────────────────────────────────────┬───────────────────┘
114+
115+
┌─────────────────────────────────────────▼───────────────────┐
116+
│ SERVING LAYER │
117+
│ │
118+
│ ┌────────────────────────┐ ┌────────────────────────┐ │
119+
│ │ Streamlit Dashboard │ │ FastAPI REST API │ │
120+
│ │ (Hugging Face Spaces) │ │ (Docker container) │ │
121+
│ │ streamlit/app.py │ │ models/.../predict.py │ │
122+
│ └────────────────────────┘ └────────────────────────┘ │
123+
└─────────────────────────────────────────────────────────────┘
124+
```
125+
126+
---
127+
128+
## 🤖 Modelling Approach
129+
130+
### Problem framing
131+
Each (region, crime category) pair forms an independent supervised regression
132+
problem. The target variable is the annual number of recorded offences per
133+
100,000 inhabitants (`taux_100k`).
134+
135+
### Feature engineering
136+
Production-grade features are constructed for each observation:
137+
138+
- **Cyclic temporal encoding**`year_sin` and `year_cos` capture periodicity
139+
without imposing linearity on the year variable
140+
- **Lag features**`lag1` and `lag2` provide the model with recent history
141+
per (indicator, region) group
142+
- **Rolling mean**`roll_mean_3` smooths short-term volatility
143+
- **Regional aggregates**`region_mean` contextualises each series within
144+
its regional baseline
145+
- **Categorical encoding** — indicators and regions are ordinally encoded
146+
147+
### Validation strategy
148+
A `TimeSeriesSplit` with 3 folds is used throughout, respecting the temporal
149+
ordering of observations and preventing data leakage from future to past.
150+
151+
### Experiment tracking
152+
All model runs are logged with **MLflow**, including:
153+
154+
- Hyperparameters (`model`, `n_estimators`, `learning_rate`, etc.)
155+
- Metrics (`r2_train`, `r2_test`, `rmse_test`, `mae_test`, `cv_r2_mean`, `cv_r2_std`)
156+
- Model artefacts (serialised `.pkl` files)
157+
- Git commit hash for full reproducibility
158+
159+
---
160+
161+
## 🛠️ Technical Stack
162+
163+
| Layer | Technology | Version |
164+
|---|---|---|
165+
| Language | Python | 3.11 |
166+
| Dashboard | Streamlit | 1.45 |
167+
| Visualisation | Plotly Express & Graph Objects | ≥ 5.18 |
168+
| Data processing | Pandas, NumPy | ≥ 2.0, ≥ 1.24 |
169+
| ML — Boosting | LightGBM, XGBoost, GradientBoosting | ≥ 4.3, ≥ 1.7 |
170+
| ML — Forecasting | Prophet, Statsmodels (Holt-Winters) | 1.1, ≥ 0.14 |
171+
| ML — Utilities | Scikit-learn (TimeSeriesSplit, metrics) | ≥ 1.3 |
172+
| Experiment tracking | MLflow | ≥ 2.12 |
173+
| REST API | FastAPI + Uvicorn | ≥ 0.110 |
174+
| Containerisation | Docker (multi-stage build) ||
175+
| Deployment | Hugging Face Spaces (Streamlit SDK) ||
176+
177+
---
178+
179+
## 🐳 MLOps & Containerisation
180+
181+
The inference pipeline is fully containerised using a **multi-stage Docker
182+
build**, cleanly separating the training environment from the production image.
183+
184+
```
185+
Stage 1 — trainer
186+
· Installs full ML stack (LightGBM, XGBoost, Prophet, statsmodels…)
187+
· Receives DATA_URL as a build argument
188+
· Runs train.py → serialises crime_predictor.pkl
189+
190+
Stage 2 — production
191+
· Copies only the serialised artefact from Stage 1
192+
· Installs minimal serving dependencies (fastapi, uvicorn, pandas, numpy)
193+
· Exposes port 8000 with HEALTHCHECK
194+
· Runs as non-root user (security best practice)
195+
```
196+
197+
```bash
198+
# Build
199+
docker build \
200+
--build-arg DATA_URL="https://static.data.gouv.fr/.../donnee-reg.csv" \
201+
-t oasis-security:latest \
202+
./models/crime_predictor/
203+
204+
# Run
205+
docker run -p 8000:8000 oasis-security:latest
206+
207+
# Health check
208+
curl http://localhost:8000/health
209+
210+
# Inference
211+
curl -X POST http://localhost:8000/predict \
212+
-H "Content-Type: application/json" \
213+
-d '{"region": "11", "crime_category": "Vols avec violence", "horizon": 5}'
214+
```
215+
216+
---
217+
218+
## 📁 Repository Structure
219+
220+
```
221+
oasis-security/
222+
223+
├── README.md # This file
224+
├── LICENSE
225+
├── .gitignore
226+
├── requirements.txt # Top-level dependencies
227+
├── Dockerfile # Root-level compose target
228+
├── docker-compose.yml
229+
230+
├── data/
231+
│ ├── raw/ # Source files (never modified)
232+
│ ├── processed/ # Cleaned, model-ready CSVs
233+
│ ├── geo/ # Geospatial files (GeoJSON)
234+
│ └── docs/ # Dataset documentation
235+
236+
├── notebooks/
237+
│ ├── 01_exploration_crimes.ipynb # Data exploration & EDA
238+
│ ├── 02_benchmark_modeles.ipynb # Model comparison & selection
239+
│ └── 03_analyse_departements.ipynb # Departmental deep-dive
240+
241+
├── pipeline/ # Reusable data pipeline modules
242+
│ ├── preprocess.py
243+
│ ├── features.py
244+
│ ├── train.py
245+
│ └── predict.py
246+
247+
├── models/
248+
│ └── crime_predictor/
249+
│ ├── Dockerfile # Multi-stage build (train → serve)
250+
│ ├── artifacts/
251+
│ │ ├── crime_predictor.pkl # Serialised champion model
252+
│ │ └── metrics.json # Benchmark results (R²=0.979)
253+
│ ├── src/
254+
│ │ ├── config.yaml # Hyperparameters & data config
255+
│ │ ├── model.py # CrimeRatePredictor class
256+
│ │ ├── train.py # Training pipeline
257+
│ │ └── predict.py # FastAPI inference endpoint
258+
│ └── tests/
259+
│ └── test_model.py
260+
261+
├── mlruns/ # MLflow tracking (12 runs logged)
262+
263+
├── images/ # Visuals for documentation
264+
265+
└── streamlit/ # Hugging Face Space
266+
├── app.py
267+
└── requirements.txt
268+
```
269+
270+
---
271+
272+
## 🚀 Running Locally
273+
274+
### Dashboard
275+
276+
```bash
277+
git clone https://github.com/Data-Science-Designer-and-Developer/oasis-security.git
278+
cd oasis-security
279+
280+
pip install -r requirements.txt
281+
streamlit run streamlit/app.py
282+
```
283+
284+
### Inference API
285+
286+
```bash
287+
cd models/crime_predictor
288+
289+
docker build \
290+
--build-arg DATA_URL="https://static.data.gouv.fr/.../donnee-reg.csv" \
291+
-t oasis-security:latest .
292+
293+
docker run -p 8000:8000 oasis-security:latest
294+
```
295+
296+
### MLflow UI
297+
298+
```bash
299+
mlflow ui --backend-store-uri ./mlruns
300+
# Open http://localhost:5000
301+
```
302+
303+
---
304+
305+
## ⚖️ Ethics & Data Privacy
306+
307+
The data used throughout this project is:
308+
309+
- **Publicly available** — published by French government authorities under
310+
Licence Ouverte v2.0
311+
- **Aggregated** — figures are presented at regional level only; no
312+
individual-level records are processed or stored
313+
- **Non-identifiable** — no re-identification of persons is possible from
314+
the published aggregates
315+
316+
This project is intended solely for informational, educational, and analytical
317+
purposes. Forecasts are indicative and subject to the inherent limitations of
318+
statistical modelling on short time series. The analysis carries no
319+
discriminatory intent with respect to geographical areas or populations.
320+
321+
Data processing complies with the principles of the **GDPR** (Regulation (EU)
322+
2016/679), in particular data minimisation, purpose limitation, and storage
323+
limitation.
324+
325+
> ⚠️ Recorded crime figures reflect offences *registered* by police and
326+
> gendarmerie services — not actual crime rates. Under-reporting, changes in
327+
> classification practices, and variations in policing intensity may all
328+
> influence the figures independently of true crime levels.
329+
330+
---
331+
332+
## 📜 Licence
333+
334+
Data: [Licence Ouverte v2.0](https://www.etalab.gouv.fr/licence-ouverte-open-licence)
335+
— © Police Nationale & Gendarmerie Nationale / data.gouv.fr
336+
337+
Code: MIT
338+
339+
---
340+
341+
*CDSD Certification Project — Data Science Designer & Developer (RNCP35288)*

0 commit comments

Comments
 (0)