@@ -44,83 +44,223 @@ This project predicts **French departmental crime rates** by category using offi
4444## 📁 Project Structure
4545
4646
47- oasis-security/
48- ├── .github/
49- │ └── workflows/ # GitHub Actions CI/CD
50- ├── data/ # Cleaned datasets (.parquet)
51- ├── docs/
52- │ └── crime_predictor/ # Technical documentation
53- ├── images/ # Visualizations and plots
54- ├── models/
55- │ └── crime_predictor/
56- │ ├── src/
57- │ │ ├── train.py # Training pipeline (model comparison)
58- │ │ └── predict.py # FastAPI endpoint definitions
59- │ ├── models/
60- │ │ ├── crime_predictor.pkl # Serialized model
61- │ │ └── metrics.json # Train/test metrics
62- │ ├── mlruns/ # MLflow experiment tracking
63- │ └── tests/
64- │ └── test_model.py # Unit tests
65- ├── notebooks/ # Exploration & EDA notebooks
66- ├── pipeline/ # Automation scripts
67- ├── streamlit/ # Streamlit supplementary assets
68- ├── app.py # Main Streamlit dashboard
69- ├── script_crimes_et_delits.py # Data collection & cleaning
70- ├── Dockerfile # Multi-stage build (train → production)
71- ├── docker-compose.yml # Full stack (MLflow + Postgres + API)
72- ├── requirements.txt
73- └── README.md
47+ oasis-security/
48+ ├── .github/
49+ │ └── workflows/ # GitHub Actions CI/CD
50+ ├── data/ # Cleaned datasets (.parquet)
51+ ├── docs/
52+ │ └── crime_predictor/ # Technical documentation
53+ ├── images/ # Visualizations and plots
54+ ├── models/
55+ │ └── crime_predictor/
56+ │ ├── src/
57+ │ │ ├── train.py # Training pipeline (model comparison)
58+ │ │ └── predict.py # FastAPI endpoint definitions
59+ │ ├── models/
60+ │ │ ├── crime_predictor.pkl # Serialized model
61+ │ │ └── metrics.json # Train/test metrics
62+ │ ├── mlruns/ # MLflow experiment tracking
63+ │ └── tests/
64+ │ └── test_model.py # Unit tests
65+ ├── notebooks/ # Exploration & EDA notebooks
66+ ├── pipeline/ # Automation scripts
67+ ├── streamlit/ # Streamlit supplementary assets
68+ ├── app.py # Main Streamlit dashboard
69+ ├── script_crimes_et_delits.py # Data collection & cleaning
70+ ├── Dockerfile # Multi-stage build (train → production)
71+ ├── docker-compose.yml # Full stack (MLflow + Postgres + API)
72+ ├── requirements.txt
73+ └── README.md
7474
7575
7676---
7777
7878## ⚙️ Installation & Running
7979
8080### 1. Clone & Install Dependencies
81-
82- ``` bash
8381git clone https://github.com/Data-Science-Designer-and-Developer/oasis-security.git
8482cd oasis-security
8583python3.11 -m venv .venv
8684source .venv/bin/activate # Windows: .venv\Scripts\activate
8785pip install -r requirements.txt
88- 2. Download and Clean Data
86+
87+ ### 2. Download and Clean Data
8988python script_crimes_et_delits.py
9089# → generates data/crimes_clean.parquet
91- 3. Train the Model
90+
91+ ### 3. Train the Model
9292python models/crime_predictor/src/train.py
9393# → compares 4 models, logs to MLflow, saves best model
9494# → generates models/crime_predictor/models/crime_predictor.pkl
9595# → generates models/crime_predictor/models/metrics.json
96- 4. Launch the Dashboard
96+
97+ #### 4. Launch the Dashboard
9798streamlit run app.py
9899# → http://localhost:8501
99- 5. Launch the API
100+
101+ ### 5. Launch the API
100102uvicorn models.crime_predictor.src.predict: app --reload --port 8000
101103# → http://localhost:8000/docs
104+
105+ ---
106+
102107🔄 Data Pipeline
103108data.gouv.fr (SSMSI)
104109 ↓
105110script_crimes_et_delits.py
106- ├── Download CSV (requests)
107- ├── Normalize column names (snake_case)
108- ├── Remove duplicates
109- ├── Convert numeric types
110- ├── Remove outlier rates (< 0)
111- ├── Feature engineering
112- │ ├── annual_rate_change (pct_change by dep × category)
113- │ └── year_norm (normalized [0, 1])
114- └── Save Parquet (Snappy)
115- ↓
116- data/crimes_clean.parquet
117-
118- Raw data: 8 columns, ~ 50,000 rows
119- After cleaning: 10 columns, ~ 49,000 rows (< 2% loss)
111+ ├── Download CSV (requests)
112+ ├── Normalize column names (snake_case)
113+ ├── Remove duplicates
114+ ├── Convert numeric types
115+ ├── Remove outlier rates (<0)
116+ ├── Feature engineering
117+ │ ├── annual_rate_change (pct_change by dep × category)
118+ │ └── year_norm (normalized [ 0, 1] )
119+ └── Save Parquet (Snappy)
120+ ↓
121+ data/crimes_clean.parquet
122+
123+ Raw data: 8 columns, ~ 50,000 rows
124+ After cleaning: 10 columns, ~ 49,000 rows (<2% loss)
125+
126+ ---
120127
121128🤖 Modeling & Results
122129Features
123130Feature Description
124131annee Year (int)
132+ dep_encoded Department (LabelEncoded)
133+ cat_encoded Crime category (LabelEncoded)
134+ annee_norm Normalized year [ 0,1]
135+
136+ Target: tauxpour100000hab (crime rate per 100,000 inhabitants)
137+ Split: 80% train / 20% test — random seed 42
138+ Validation: 5-fold cross-validation on training set
139+
140+ | Model | R² test | RMSE | MAE | CV R² (±std) |
141+ | ----------------- | -------- | -------- | -------- | --------------- |
142+ | Ridge | 0.71 | 87.4 | 62.1 | 0.69 ± 0.03 |
143+ | Random Forest | 0.89 | 54.2 | 38.7 | 0.87 ± 0.02 |
144+ | Gradient Boosting | 0.88 | 56.1 | 40.2 | 0.86 ± 0.02 |
145+ | ** XGBoost** ✅ | ** 0.91** | ** 49.8** | ** 35.3** | ** 0.90 ± 0.01** |
146+
147+
148+ Best model: XGBoost — R²=0.91 on test set
149+ Low train/test gap → no significant overfitting
150+ Low CV variance → confirmed robustness
151+
152+ MLflow Tracking
153+ mlflow ui --backend-store-uri models/crime_predictor/mlruns
154+ # → http://localhost:5000
155+
156+ ---
157+
158+ 📊 Streamlit Dashboard
159+
160+ 5 interactive pages:
161+ | Page | Content |
162+ | -------------------- | ----------------------------------------------------- |
163+ | Overview | KPIs, boxplots by category, top 10 departments |
164+ | Department Analysis | Multi-department comparison, heatmaps |
165+ | Temporal Trends | 2016–2023 evolution, base-100 index, annual variation |
166+ | ML Prediction | Interactive simulator with historical graph |
167+ | Ethics & Limitations | Biases and usage limits |
168+
169+ 🌐 FastAPI Endpoints
170+ Available Endpoints
171+ | Method | Endpoint | Description |
172+ | ------ | ---------- | --------------------------------- |
173+ | GET | ` /health ` | API status + model metrics |
174+ | POST | ` /predict ` | Predict crime rate |
175+ | GET | ` /docs ` | Interactive Swagger documentation |
176+
177+ Example Request
178+ curl -X POST http://localhost:8000/predict \
179+ -H "Content-Type: application/json" \
180+ -d '{"annee": 2025, "dep_encoded": 5, "cat_encoded": 0, "annee_norm": 1.0}'
181+ {
182+ "predicted_rate": 312.47,
183+ "unit": "incidents per 100,000 inhabitants",
184+ "model_used": "XGBoost",
185+ "r2_test": 0.91
186+ }
187+
188+ ---
189+
190+ 🧪 Tests
191+ # Run all tests
192+ pytest models/crime_predictor/tests/ -v
193+
194+ # With coverage
195+ pytest models/crime_predictor/tests/ -v --cov=models/crime_predictor/src --cov-report=term-missing
196+
197+ Test coverage:
198+ | Class | Tests |
199+ | ------------------- | ----------------------------------------------------------- |
200+ | ` TestData ` | DataFrame integrity (6 assertions) |
201+ | ` TestModel ` | Shape, type, positivity, R², determinism (7 assertions) |
202+ | ` TestSerialization ` | Joblib serialization, metrics.json structure (2 assertions) |
203+
204+ ---
205+
206+ 🐳 Docker & CI/CD
207+ Multi-stage Docker
208+ # Build (training → production)
209+ docker build -t oasis-security: latest .
210+
211+ # Run API
212+ docker run -p 8000:8000 oasis-security: latest
213+
214+ Full Stack (MLflow + Postgres + API)
215+ docker-compose up -d
216+ # MLflow UI → http://localhost:5000
217+ # API → http://localhost:8000/docs
218+
219+ GitHub Actions CI/CD
220+
221+ .github/workflows/ci-cd.yml triggers on each push:
222+
223+ 1 . Linting (flake8)
224+ 2 . Unit tests (pytest)
225+ 3 . Docker build
226+ 4 . Push Docker image to GHCR
227+
228+ ---
229+
230+ ⚠️ Ethics & Limitations
231+
232+ This model is a statistical exploration tool, not an operational decision system.
233+
234+ Data limitations:
235+
236+ - Covers only recorded crimes (dark figure estimated 50–80%)
237+ - Recording practices vary by department
238+ - No infra-departmental data
239+
240+ Model biases:
241+
242+ - Reflects reporting biases
243+ - Correlation ≠ causation
244+ - Not suitable for external shocks (COVID, economic crises)
245+
246+ Prohibited use:
247+
248+ - Predictive targeting of individuals or geographic areas
249+ - Judicial or penal decision-making
250+
251+ Compliance: aggregated anonymized open data — no personal data used.
252+
253+ ---
254+
255+ 📜 License
256+ MIT — see LICENSE
257+
258+ ---
259+
260+ 👤 Author
261+
262+ Frédéric Tellier — Data Scientist
263+ [ LinkedIn] ( **url**https://www.linkedin.com/in/fr%C3%A9d%C3%A9ric-tellier-8a9170283/ )
264+ | [ Portfolio] ( [url](https://github.com/Dreipfelt/) )
125265
126266Project developed as part of CDSD certification — 2025
0 commit comments