Skip to content

Commit bdd6d0b

Browse files
committed
Added Directory Structure
1 parent 4a0f395 commit bdd6d0b

14 files changed

Lines changed: 87 additions & 167 deletions

README.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,6 @@ See full Makefile usage [here](#makefile-usage) — from setup to linting, testi
4747
├── mlruns/MLflow experiment tracking artifacts
4848
├── infra/Terraform IaC for provisioning MLflow container
4949
├── github_pipeline/Feature engineering, inference, monitoring scripts
50-
├── scripts/Utility or automation scripts (e.g., setup, cleanup)
5150
├── tests/Pytest-based unit/integration tests
5251
├── reports/Data drift reports (JSON/HTML) from Evidently
5352
├── alerts/Alert log dumps (e.g., triggered drift/anomaly alerts)
@@ -177,7 +176,7 @@ You can explore experiment runs and models in the MLflow UI.
177176
The model (Isolation Forest) is trained on actor-wise event features:
178177

179178
```bash
180-
python scripts/train_model.py
179+
python github_pipeline/train_model.py
181180
```
182181
The latest parquet file is used automatically. Model and scaler are saved to models/.
183182

dags/daily_github_inference.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,20 @@
77
from github_pipeline.inference import run_inference
88
from github_pipeline.cleanup import run_cleanup
99

10+
"""
11+
DAG: daily_github_inference
12+
13+
This Airflow DAG orchestrates the daily GitHub anomaly detection pipeline.
14+
15+
Tasks:
16+
- 📥 Download and parse GitHub Archive logs for the previous day at 15:00 UTC
17+
- 🛠️ Perform feature engineering on the ingested data
18+
- 🧠 Run inference using the latest trained Isolation Forest model
19+
- 🧹 Clean up old data files, keeping only the latest relevant timestamps
20+
21+
Each step uses PythonOperators to invoke modular functions from github_pipeline.
22+
"""
23+
1024

1125
def get_yesterday_fixed_hour(ds: str, hour: int = 15) -> str:
1226
"""

dags/daily_monitoring_dag.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,19 @@
77
from glob import glob
88
import sys
99

10+
"""
11+
DAG: daily_monitoring_dag
12+
13+
This Airflow DAG runs daily monitoring on GitHub features.
14+
15+
Tasks:
16+
- 🧪 Compare the two latest feature files to detect data drift using Evidently
17+
- 🚨 Trigger alerts via Slack and Email if drift or anomaly spikes are detected
18+
- 📊 Save drift reports as JSON and HTML in the reports/ directory
19+
20+
This DAG complements the daily inference DAG by validating data quality and stability.
21+
"""
22+
1023
# Add github_pipeline to PYTHONPATH
1124
sys.path.append(str(Path(__file__).resolve().parents[1] / "github_pipeline"))
1225

dags/retraining_dag.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,20 @@
55
import sys
66
import os
77

8+
"""
9+
DAG: weekly_model_retraining
10+
11+
This Airflow DAG retrains the Isolation Forest model every week using the latest feature data.
12+
13+
Tasks:
14+
- 🧠 Loads most recent actor features from data/features/
15+
- 🛠️ Trains an Isolation Forest model on engineered features
16+
- 🗃️ Logs model metrics and artifacts to MLflow
17+
- 💾 Saves model and anomaly scores in models/ and data/features/
18+
19+
This DAG ensures the anomaly detection system stays up to date with evolving GitHub activity patterns.
20+
"""
21+
822
# Ensure github_pipeline is in the PYTHONPATH
923
sys.path.append(
1024
os.path.abspath(os.path.join(os.path.dirname(__file__), "..", "github_pipeline"))

github_pipeline/cleanup.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ def run_cleanup(verbose: bool = True):
1010
- Last 2 timestamps
1111
- The timestamp used in latest model training (stored in models/last_trained.txt)
1212
"""
13+
1314
folders = [
1415
"data/raw",
1516
"data/processed",

github_pipeline/feature_engineering.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,14 @@
88

99

1010
def run_feature_engineering(timestamp: str):
11+
"""
12+
Generates actor-level hourly features from GitHub event logs
13+
for a given timestamp and saves them as a Parquet file.
14+
15+
Input: data/processed/{timestamp}.parquet
16+
Output: data/features/actor_features_{timestamp}.parquet
17+
"""
18+
1119
input_file = f"data/processed/{timestamp}.parquet"
1220
output_file = f"data/features/actor_features_{timestamp}.parquet"
1321

github_pipeline/inference.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,13 @@
99

1010

1111
def run_inference(timestamp: str):
12+
"""
13+
Runs anomaly inference on actor-level features using Isolation Forest.
14+
15+
Input: data/features/actor_features_{timestamp}.parquet
16+
Output: data/features/actor_predictions_{timestamp}.parquet
17+
"""
18+
1219
print(f"[INFO] Running inference for timestamp: {timestamp}")
1320

1421
input_file = f"data/features/actor_features_{timestamp}.parquet"

github_pipeline/monitor.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,19 @@
1313

1414

1515
def run_monitoring(reference_ts: str, current_ts: str):
16+
"""
17+
Checks for data drift (via Evidently) and anomaly spikes between two timestamps.
18+
Triggers Slack and Email alerts if drift or anomalies are detected.
19+
20+
Inputs:
21+
- actor_features_{reference_ts}.parquet
22+
- actor_features_{current_ts}.parquet
23+
- actor_predictions_{current_ts}.parquet (optional for anomaly alert)
24+
25+
Outputs:
26+
- JSON and HTML drift reports in the 'reports/' directory
27+
"""
28+
1629
# === Feature Columns ===
1730
feature_cols = [
1831
"event_count",

github_pipeline/train_model.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,22 @@
1212

1313

1414
def run_training(timestamp: str):
15+
"""
16+
Trains an Isolation Forest model for anomaly detection using actor features for a given timestamp.
17+
18+
Steps:
19+
- Loads feature parquet file.
20+
- Scales the data and trains the model.
21+
- Computes anomaly scores and flags.
22+
- Logs parameters, metrics, and artifacts to MLflow.
23+
- Saves the model and prediction results to disk and registry.
24+
25+
Outputs:
26+
- models/isolation_forest.pkl
27+
- models/last_trained.txt
28+
- data/features/actor_predictions_<timestamp>.parquet
29+
"""
30+
1531
print("[INFO] Starting model training...")
1632

1733
input_file = f"data/features/actor_features_{timestamp}.parquet"

scripts/.gitkeep

Whitespace-only changes.

0 commit comments

Comments
 (0)