Skip to content

Commit 50e67cc

Browse files
authored
Revise project structure and installation steps
Updated project structure and installation instructions in README.
1 parent 61422eb commit 50e67cc

1 file changed

Lines changed: 187 additions & 47 deletions

File tree

README.md

Lines changed: 187 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -44,83 +44,223 @@ This project predicts **French departmental crime rates** by category using offi
4444
## 📁 Project Structure
4545

4646

47-
oasis-security/
48-
├── .github/
49-
│ └── workflows/ # GitHub Actions CI/CD
50-
├── data/ # Cleaned datasets (.parquet)
51-
├── docs/
52-
│ └── crime_predictor/ # Technical documentation
53-
├── images/ # Visualizations and plots
54-
├── models/
55-
│ └── crime_predictor/
56-
│ ├── src/
57-
│ │ ├── train.py # Training pipeline (model comparison)
58-
│ │ └── predict.py # FastAPI endpoint definitions
59-
│ ├── models/
60-
│ │ ├── crime_predictor.pkl # Serialized model
61-
│ │ └── metrics.json # Train/test metrics
62-
│ ├── mlruns/ # MLflow experiment tracking
63-
│ └── tests/
64-
│ └── test_model.py # Unit tests
65-
├── notebooks/ # Exploration & EDA notebooks
66-
├── pipeline/ # Automation scripts
67-
├── streamlit/ # Streamlit supplementary assets
68-
├── app.py # Main Streamlit dashboard
69-
├── script_crimes_et_delits.py # Data collection & cleaning
70-
├── Dockerfile # Multi-stage build (train → production)
71-
├── docker-compose.yml # Full stack (MLflow + Postgres + API)
72-
├── requirements.txt
73-
└── README.md
47+
oasis-security/
48+
├── .github/
49+
│ └── workflows/ # GitHub Actions CI/CD
50+
├── data/ # Cleaned datasets (.parquet)
51+
├── docs/
52+
│ └── crime_predictor/ # Technical documentation
53+
├── images/ # Visualizations and plots
54+
├── models/
55+
│ └── crime_predictor/
56+
│ ├── src/
57+
│ │ ├── train.py # Training pipeline (model comparison)
58+
│ │ └── predict.py # FastAPI endpoint definitions
59+
│ ├── models/
60+
│ │ ├── crime_predictor.pkl # Serialized model
61+
│ │ └── metrics.json # Train/test metrics
62+
│ ├── mlruns/ # MLflow experiment tracking
63+
│ └── tests/
64+
│ └── test_model.py # Unit tests
65+
├── notebooks/ # Exploration & EDA notebooks
66+
├── pipeline/ # Automation scripts
67+
├── streamlit/ # Streamlit supplementary assets
68+
├── app.py # Main Streamlit dashboard
69+
├── script_crimes_et_delits.py # Data collection & cleaning
70+
├── Dockerfile # Multi-stage build (train → production)
71+
├── docker-compose.yml # Full stack (MLflow + Postgres + API)
72+
├── requirements.txt
73+
└── README.md
7474

7575

7676
---
7777

7878
## ⚙️ Installation & Running
7979

8080
### 1. Clone & Install Dependencies
81-
82-
```bash
8381
git clone https://github.com/Data-Science-Designer-and-Developer/oasis-security.git
8482
cd oasis-security
8583
python3.11 -m venv .venv
8684
source .venv/bin/activate # Windows: .venv\Scripts\activate
8785
pip install -r requirements.txt
88-
2. Download and Clean Data
86+
87+
### 2. Download and Clean Data
8988
python script_crimes_et_delits.py
9089
# → generates data/crimes_clean.parquet
91-
3. Train the Model
90+
91+
### 3. Train the Model
9292
python models/crime_predictor/src/train.py
9393
# → compares 4 models, logs to MLflow, saves best model
9494
# → generates models/crime_predictor/models/crime_predictor.pkl
9595
# → generates models/crime_predictor/models/metrics.json
96-
4. Launch the Dashboard
96+
97+
#### 4. Launch the Dashboard
9798
streamlit run app.py
9899
# http://localhost:8501
99-
5. Launch the API
100+
101+
### 5. Launch the API
100102
uvicorn models.crime_predictor.src.predict:app --reload --port 8000
101103
# http://localhost:8000/docs
104+
105+
---
106+
102107
🔄 Data Pipeline
103108
data.gouv.fr (SSMSI)
104109
105110
script_crimes_et_delits.py
106-
├── Download CSV (requests)
107-
├── Normalize column names (snake_case)
108-
├── Remove duplicates
109-
├── Convert numeric types
110-
├── Remove outlier rates (<0)
111-
├── Feature engineering
112-
│ ├── annual_rate_change (pct_change by dep × category)
113-
│ └── year_norm (normalized [0, 1])
114-
└── Save Parquet (Snappy)
115-
116-
data/crimes_clean.parquet
117-
118-
Raw data: 8 columns, ~50,000 rows
119-
After cleaning: 10 columns, ~49,000 rows (<2% loss)
111+
├── Download CSV (requests)
112+
├── Normalize column names (snake_case)
113+
├── Remove duplicates
114+
├── Convert numeric types
115+
├── Remove outlier rates (<0)
116+
├── Feature engineering
117+
│ ├── annual_rate_change (pct_change by dep × category)
118+
│ └── year_norm (normalized [0, 1])
119+
└── Save Parquet (Snappy)
120+
121+
data/crimes_clean.parquet
122+
123+
Raw data: 8 columns, ~50,000 rows
124+
After cleaning: 10 columns, ~49,000 rows (<2% loss)
125+
126+
---
120127

121128
🤖 Modeling & Results
122129
Features
123130
Feature Description
124131
annee Year (int)
132+
dep_encoded Department (LabelEncoded)
133+
cat_encoded Crime category (LabelEncoded)
134+
annee_norm Normalized year [0,1]
135+
136+
Target: tauxpour100000hab (crime rate per 100,000 inhabitants)
137+
Split: 80% train / 20% test — random seed 42
138+
Validation: 5-fold cross-validation on training set
139+
140+
| Model | R² test | RMSE | MAE | CV R² (±std) |
141+
| ----------------- | -------- | -------- | -------- | --------------- |
142+
| Ridge | 0.71 | 87.4 | 62.1 | 0.69 ± 0.03 |
143+
| Random Forest | 0.89 | 54.2 | 38.7 | 0.87 ± 0.02 |
144+
| Gradient Boosting | 0.88 | 56.1 | 40.2 | 0.86 ± 0.02 |
145+
| **XGBoost**| **0.91** | **49.8** | **35.3** | **0.90 ± 0.01** |
146+
147+
148+
Best model: XGBoost — R²=0.91 on test set
149+
Low train/test gap → no significant overfitting
150+
Low CV variance → confirmed robustness
151+
152+
MLflow Tracking
153+
mlflow ui --backend-store-uri models/crime_predictor/mlruns
154+
# http://localhost:5000
155+
156+
---
157+
158+
📊 Streamlit Dashboard
159+
160+
5 interactive pages:
161+
| Page | Content |
162+
| -------------------- | ----------------------------------------------------- |
163+
| Overview | KPIs, boxplots by category, top 10 departments |
164+
| Department Analysis | Multi-department comparison, heatmaps |
165+
| Temporal Trends | 2016–2023 evolution, base-100 index, annual variation |
166+
| ML Prediction | Interactive simulator with historical graph |
167+
| Ethics & Limitations | Biases and usage limits |
168+
169+
🌐 FastAPI Endpoints
170+
Available Endpoints
171+
| Method | Endpoint | Description |
172+
| ------ | ---------- | --------------------------------- |
173+
| GET | `/health` | API status + model metrics |
174+
| POST | `/predict` | Predict crime rate |
175+
| GET | `/docs` | Interactive Swagger documentation |
176+
177+
Example Request
178+
curl -X POST http://localhost:8000/predict \
179+
-H "Content-Type: application/json" \
180+
-d '{"annee": 2025, "dep_encoded": 5, "cat_encoded": 0, "annee_norm": 1.0}'
181+
{
182+
"predicted_rate": 312.47,
183+
"unit": "incidents per 100,000 inhabitants",
184+
"model_used": "XGBoost",
185+
"r2_test": 0.91
186+
}
187+
188+
---
189+
190+
🧪 Tests
191+
# Run all tests
192+
pytest models/crime_predictor/tests/ -v
193+
194+
# With coverage
195+
pytest models/crime_predictor/tests/ -v --cov=models/crime_predictor/src --cov-report=term-missing
196+
197+
Test coverage:
198+
| Class | Tests |
199+
| ------------------- | ----------------------------------------------------------- |
200+
| `TestData` | DataFrame integrity (6 assertions) |
201+
| `TestModel` | Shape, type, positivity, R², determinism (7 assertions) |
202+
| `TestSerialization` | Joblib serialization, metrics.json structure (2 assertions) |
203+
204+
---
205+
206+
🐳 Docker & CI/CD
207+
Multi-stage Docker
208+
# Build (training → production)
209+
docker build -t oasis-security:latest .
210+
211+
# Run API
212+
docker run -p 8000:8000 oasis-security:latest
213+
214+
Full Stack (MLflow + Postgres + API)
215+
docker-compose up -d
216+
# MLflow UI → http://localhost:5000
217+
# API → http://localhost:8000/docs
218+
219+
GitHub Actions CI/CD
220+
221+
.github/workflows/ci-cd.yml triggers on each push:
222+
223+
1. Linting (flake8)
224+
2. Unit tests (pytest)
225+
3. Docker build
226+
4. Push Docker image to GHCR
227+
228+
---
229+
230+
⚠️ Ethics & Limitations
231+
232+
This model is a statistical exploration tool, not an operational decision system.
233+
234+
Data limitations:
235+
236+
- Covers only recorded crimes (dark figure estimated 50–80%)
237+
- Recording practices vary by department
238+
- No infra-departmental data
239+
240+
Model biases:
241+
242+
- Reflects reporting biases
243+
- Correlation ≠ causation
244+
- Not suitable for external shocks (COVID, economic crises)
245+
246+
Prohibited use:
247+
248+
- Predictive targeting of individuals or geographic areas
249+
- Judicial or penal decision-making
250+
251+
Compliance: aggregated anonymized open data — no personal data used.
252+
253+
---
254+
255+
📜 License
256+
MIT — see LICENSE
257+
258+
---
259+
260+
👤 Author
261+
262+
Frédéric Tellier — Data Scientist
263+
[LinkedIn](**url**https://www.linkedin.com/in/fr%C3%A9d%C3%A9ric-tellier-8a9170283/)
264+
| [Portfolio]([url](https://github.com/Dreipfelt/))
125265

126266
Project developed as part of CDSD certification — 2025

0 commit comments

Comments
 (0)