11# 🛠️ GitHub Anomaly Detection Pipeline
22
3+ ## 💡 Motivation & Use Case
4+
5+ GitHub hosts an enormous amount of user activity, including pull requests, issues, forks, and stars. Monitoring this activity in real-time is essential for identifying unusual or malicious behavior — such as bots, misuse, or suspicious spikes in contributions.
6+
7+ This project aims to build a ** production-grade anomaly detection system** to:
8+
9+ - Detect abnormal GitHub user behavior (e.g., excessive PRs, bot-like stars)
10+ - Alert maintainers and admins in real time via Slack or email
11+ - Serve anomaly scores via API and support continuous retraining
12+ - Visualize trends, drift, and recent activity using an interactive dashboard
13+
14+ ---
15+
316A production-grade anomaly detection system for GitHub user behavior using:
417
518- ** Apache Airflow** for orchestration
619- ** Pandas + Scikit-learn (Isolation Forest)** for modeling and anomaly detection
720- ** Alerts: Email & Slack** alerting mechanisms for anomaly spikes and data drift
821- ** FastAPI** for real-time inference
922- ** Pytest, Black, Flake8** for testing and linting
10- - ** Pre-commit + GitHub Actions** for CI/CD and code quality
11- - ** Streamlit UI** for visualization
23+ - ** Pre-commit + GitHub Actions** for CI/CD and code quality
24+ - ** Streamlit UI** for visualization
25+ - ** Terraform** for infrastructure-as-code provisioning (MLflow)
1226
1327---
1428
15- ## 📦 Project Structure
16-
17- # To Do
29+ ## 🤖 Too lazy for copy-pasting commands?
1830
31+ If you're like me and hate typing out commands... good news!
32+ Just use the ** Makefile** to do all the boring stuff for you:
1933
20- ---
34+ ``` bash
35+ make help
36+ ```
2137
22- ## 📈 Use Case
38+ See full Makefile usage [ here ] ( #makefile-usage ) — from setup to linting, testing, API, Airflow, and Terraform infra!
2339
24- The pipeline detects anomalies in GitHub user behavior on an hourly basis and can:
40+ ## 📦 Project Structure
2541
26- - Alert on suspicious activity (e.g., bot-like behavior)
27- - Serve anomaly scores via API
28- - Continuously retrain and monitor model health
42+ # Coming Soon
2943
3044---
3145
32- ## ⚙️ Setup
46+ ## ⚙️ Setup Instructions
3347
3448### 1. Clone and install dependencies
3549
@@ -45,6 +59,34 @@ pipenv shell
4559pip install -r requirements.txt
4660```
4761
62+ ### 📄 .env Configuration (Required)
63+
64+ Before running Airflow, you must create a ` .env ` file in the project root with at least this line:
65+
66+ ``` env
67+ AIRFLOW_UID=50000
68+ ```
69+
70+ This is required for Docker to set correct permissions inside the Airflow containers.
71+
72+ #### Optional (For Email & Slack Alerts)
73+
74+ If you'd like to enable alerts, you can also include the following variables:
75+
76+ ``` env
77+ # Slack Alerts
78+ SLACK_API_TOKEN=xoxb-...
79+ SLACK_CHANNEL=#your-channel
80+
81+ # Email Alerts
82+ EMAIL_SENDER=your_email@example.com
83+ EMAIL_PASSWORD=your_email_app_password
84+ EMAIL_RECEIVER=receiver@example.com
85+ EMAIL_SMTP=smtp.gmail.com
86+ EMAIL_PORT=587
87+ ```
88+ ---
89+
4890### 2. ⚙️ Airflow + 📈 MLflow Integration
4991
5092This project uses Apache Airflow to orchestrate a real-time ML pipeline and MLflow to track model training, metrics, and artifacts.
@@ -317,17 +359,7 @@ This removes the MLflow container provisioned by Terraform.
317359
318360### 7. 🧭 Architecture
319361
320- To Do
321-
322- [ GitHub Archive Logs]
323- ↓
324- [ Airflow DAG]
325- ↓
326- [ Feature Engineering]
327- ↓
328- [ Isolation Forest Model]
329- ↓ ↘
330- [ API: FastAPI] [ Alerts / Drift Monitor]
362+ ![ Architecture] ( assets/architecture.png )
331363
332364### 8. 🧹 Clean Code
333365
@@ -351,6 +383,7 @@ make lint
351383
352384``` bash
353385make install # Install all dependencies via Pipenv (both runtime and dev)
386+ make create-env # Create .env file with required AIRFLOW_UID and alert config placeholders
354387make clean # Remove all __pycache__ folders and .pyc files
355388```
356389
0 commit comments