Skip to content

Commit 2895caa

Browse files
committed
updated README
1 parent 26f1a3b commit 2895caa

1 file changed

Lines changed: 256 additions & 0 deletions

File tree

1_README.md

Lines changed: 256 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,256 @@
1+
# 🛡️ Oasis Security — Crime Predictor
2+
3+
[![CI/CD](https://github.com/Data-Science-Designer-and-Developer/oasis-security/actions/workflows/ci-cd.yml/badge.svg)](https://github.com/Data-Science-Designer-and-Developer/oasis-security/actions)
4+
[![Docker](https://img.shields.io/badge/Docker-GHCR-blue)](https://ghcr.io/Data-Science-Designer-and-Developer)
5+
[![Python](https://img.shields.io/badge/Python-3.11-blue)](https://www.python.org/)
6+
[![License: MIT](https://img.shields.io/badge/License-MIT-green)](LICENSE)
7+
8+
> **Predictive model for crime rates in France** (per 100,000 inhabitants)
9+
> Full ML pipeline: data collection → cleaning → modeling → API → dashboard
10+
> Official source: SSMSI / [data.gouv.fr](https://www.data.gouv.fr) — 2016–2023
11+
12+
---
13+
14+
## 📋 Table of Contents
15+
16+
1. [Context & Goals](#-context--goals)
17+
2. [Project Structure](#-project-structure)
18+
3. [Installation & Running](#-installation--running)
19+
4. [Data Pipeline](#-data-pipeline)
20+
5. [Modeling & Results](#-modeling--results)
21+
6. [Streamlit Dashboard](#-streamlit-dashboard)
22+
7. [FastAPI Endpoints](#-fastapi-endpoints)
23+
8. [Tests](#-tests)
24+
9. [Docker & CI/CD](#-docker--cicd)
25+
10. [Ethics & Limitations](#-ethics--limitations)
26+
27+
---
28+
29+
## 🎯 Context & Goals
30+
31+
This project predicts **French departmental crime rates** by category using official police and gendarmerie data.
32+
33+
**Primary use case**: a statistical exploration tool for journalists, social science researchers, and public policy makers.
34+
35+
**Technical objectives**:
36+
37+
- Build a fully reproducible end-to-end ML pipeline
38+
- Compare multiple regression algorithms using MLflow tracking
39+
- Deploy a prediction API (FastAPI) and interactive dashboard (Streamlit)
40+
- Follow MLOps best practices: versioning, testing, CI/CD, Docker
41+
42+
---
43+
44+
## 📁 Project Structure
45+
oasis-security/
46+
├── .github/
47+
│ └── workflows/ # GitHub Act ions CI/CD
48+
├── data/ # Cleaned datasets (.parquet)
49+
├── docs/
50+
│ └── crime_predictor/ # Technical documentation
51+
├── images/ # Visualizations and plots
52+
├── models/
53+
│ └── crime_predictor/
54+
│ ├── src/
55+
│ │ ├── train.py # Training pipeline (model comparison)
56+
│ │ └── predict.py # FastAPI endpoint definitions
57+
│ ├── models/
58+
│ │ ├── crime_predictor.pkl # Serialized model
59+
│ │ └── metrics.json # Train/test metrics
60+
│ ├── mlruns/ # MLflow experiment tracking
61+
│ └── tests/
62+
│ └── test_model.py # Unit tests
63+
├── notebooks/ # Exploration & EDA notebooks
64+
├── pipeline/ # Automation scripts
65+
├── streamlit/ # Streamlit supplementary assets
66+
├── app.py # Main Streamlit dashboard
67+
├── script_crimes_et_delits.py # Data collection & cleaning
68+
├── Dockerfile # Multi-stage build (train → production)
69+
├── docker-compose.yml # Full stack (MLflow + Postgres + API)
70+
├── requirements.txt
71+
└── README.md
72+
73+
74+
---
75+
76+
## ⚙️ Installation & Running
77+
78+
### 1. Clone & Install Dependencies
79+
git clone https://github.com/Data-Science-Designer-and-Developer/oasis-security.git
80+
cd oasis-security
81+
python3.11 -m venv .venv
82+
source .venv/bin/activate # Windows: .venv\Scripts\activate
83+
pip install -r requirements.txt
84+
85+
### 2. Download and Clean Data
86+
python script_crimes_et_delits.py
87+
# → generates data/crimes_clean.parquet
88+
89+
### 3. Train the Model
90+
python models/crime_predictor/src/train.py
91+
# → compares 4 models, logs to MLflow, saves best model
92+
# → generates models/crime_predictor/models/crime_predictor.pkl
93+
# → generates models/crime_predictor/models/metrics.json
94+
95+
### 4. Launch the Dashboard
96+
streamlit run app.py
97+
# http://localhost:8501
98+
99+
### 5. Launch the API
100+
uvicorn models.crime_predictor.src.predict:app --reload --port 8000
101+
# http://localhost:8000/docs
102+
103+
---
104+
🔄 Data Pipeline
105+
data.gouv.fr (SSMSI)
106+
107+
script_crimes_et_delits.py
108+
├── Download CSV (requests)
109+
├── Normalize column names (snake_case)
110+
├── Remove duplicates
111+
├── Convert numeric types
112+
├── Remove outlier rates (<0)
113+
├── Feature engineering
114+
│ ├── annual_rate_change (pct_change by dep × category)
115+
│ └── year_norm (normalized [0, 1])
116+
└── Save Parquet (Snappy)
117+
118+
data/crimes_clean.parquet
119+
Raw data: 8 columns, ~50,000 rows
120+
After cleaning: 10 columns, ~49,000 rows (<2% loss)
121+
122+
---
123+
🤖 Modeling & Results
124+
Features
125+
| Feature | Description |
126+
| ------------- | ----------------------------- |
127+
| `annee` | Year (int) |
128+
| `dep_encoded` | Department (LabelEncoded) |
129+
| `cat_encoded` | Crime category (LabelEncoded) |
130+
| `annee_norm` | Normalized year [0,1] |
131+
132+
Target: tauxpour100000hab (crime rate per 100,000 inhabitants)
133+
Split: 80% train / 20% test — random seed 42
134+
Validation: 5-fold cross-validation on training set
135+
136+
Model Comparison (Test Set)
137+
| Model | R² test | RMSE | MAE | CV R² (±std) |
138+
| ----------------- | -------- | -------- | -------- | --------------- |
139+
| Ridge | 0.71 | 87.4 | 62.1 | 0.69 ± 0.03 |
140+
| Random Forest | 0.89 | 54.2 | 38.7 | 0.87 ± 0.02 |
141+
| Gradient Boosting | 0.88 | 56.1 | 40.2 | 0.86 ± 0.02 |
142+
| **XGBoost**| **0.91** | **49.8** | **35.3** | **0.90 ± 0.01** |
143+
144+
Best model: XGBoost — R²=0.91 on test set
145+
Low train/test gap → no significant overfitting
146+
Low CV variance → confirmed robustness
147+
148+
MLflow Tracking
149+
mlflow ui --backend-store-uri models/crime_predictor/mlruns
150+
# http://localhost:5000
151+
152+
---
153+
📊 Streamlit Dashboard
154+
5 interactive pages:
155+
| Page | Content |
156+
| -------------------- | ----------------------------------------------------- |
157+
| Overview | KPIs, boxplots by category, top 10 departments |
158+
| Department Analysis | Multi-department comparison, heatmaps |
159+
| Temporal Trends | 2016–2023 evolution, base-100 index, annual variation |
160+
| ML Prediction | Interactive simulator with historical graph |
161+
| Ethics & Limitations | Biases and usage limits |
162+
163+
---
164+
🌐 FastAPI Endpoints
165+
Available Endpoints
166+
| Method | Endpoint | Description |
167+
| ------ | ---------- | --------------------------------- |
168+
| GET | `/health` | API status + model metrics |
169+
| POST | `/predict` | Predict crime rate |
170+
| GET | `/docs` | Interactive Swagger documentation |
171+
172+
Example Request
173+
curl -X POST http://localhost:8000/predict \
174+
-H "Content-Type: application/json" \
175+
-d '{"annee": 2025, "dep_encoded": 5, "cat_encoded": 0, "annee_norm": 1.0}'
176+
177+
{
178+
"predicted_rate": 312.47,
179+
"unit": "incidents per 100,000 inhabitants",
180+
"model_used": "XGBoost",
181+
"r2_test": 0.91
182+
}
183+
184+
---
185+
🧪 Tests
186+
# Run all tests
187+
pytest models/crime_predictor/tests/ -v
188+
189+
# With coverage
190+
pytest models/crime_predictor/tests/ -v --cov=models/crime_predictor/src --cov-report=term-missing
191+
192+
Test coverage:
193+
| Class | Tests |
194+
| ------------------- | ----------------------------------------------------------- |
195+
| `TestData` | DataFrame integrity (6 assertions) |
196+
| `TestModel` | Shape, type, positivity, R², determinism (7 assertions) |
197+
| `TestSerialization` | Joblib serialization, metrics.json structure (2 assertions) |
198+
199+
---
200+
🐳 Docker & CI/CD
201+
Multi-stage Docker
202+
# Build (training → production)
203+
docker build -t oasis-security:latest .
204+
205+
# Run API
206+
docker run -p 8000:8000 oasis-security:latest
207+
208+
Full Stack (MLflow + Postgres + API)
209+
docker-compose up -d
210+
# MLflow UI → http://localhost:5000
211+
# API → http://localhost:8000/docs
212+
213+
GitHub Actions CI/CD
214+
.github/workflows/ci-cd.yml triggers on each push:
215+
216+
Linting (flake8)
217+
Unit tests (pytest)
218+
Docker build
219+
Push Docker image to GHCR
220+
221+
---
222+
⚠️ Ethics & Limitations
223+
224+
This model is a statistical exploration tool, not an operational decision system.
225+
226+
Data limitations:
227+
228+
Covers only recorded crimes (dark figure estimated 50–80%)
229+
Recording practices vary by department
230+
No infra-departmental data
231+
232+
Model biases:
233+
234+
Reflects reporting biases
235+
Correlation ≠ causation
236+
Not suitable for external shocks (COVID, economic crises)
237+
238+
Prohibited use:
239+
240+
Predictive targeting of individuals or geographic areas
241+
Judicial or penal decision-making
242+
243+
Compliance: aggregated anonymized open data — no personal data used.
244+
245+
---
246+
📜 License
247+
MIT — see LICENSE
248+
249+
---
250+
👤 Author
251+
Frédéric Tellier — Data Scientist
252+
LinkedIN: https://www.linkedin.com/in/fr%C3%A9d%C3%A9ric-tellier-8a9170283/ ; Portfolio: https://github.com/Dreipfelt/
253+
254+
Project developed as part of CDSD certification — 2025
255+
256+
---

0 commit comments

Comments
 (0)