Skip to content

ilearncoding1/AI-CYBERSECURITY

Repository files navigation

AI Cybersecurity β€” Phishing Email Detection (NLP + IMAP)

A practical machine learning project that detects phishing/spam emails using classic NLP techniques. Train a classifier from a labeled dataset, save reusable artifacts, then test predictions on real mailbox messages via IMAP.


Contents


Overview

This repository provides an end-to-end baseline phishing/spam detection pipeline:

  1. Train a text classifier on a labeled dataset (ham vs spam)
  2. Evaluate performance (accuracy + classification report)
  3. Persist the trained model and TF‑IDF vectorizer for reuse
  4. Connect via IMAP to fetch recent emails and predict whether each message looks like phishing/spam or safe

Model: TF‑IDF (unigrams + bigrams) β†’ Multinomial Naive Bayes


Key Features

  • Baseline phishing/spam classifier trained from emails_dataset.csv
  • TF‑IDF vectorization with unigrams and bigrams for better phrase detection
  • Multinomial Naive Bayes (fast, strong baseline for text)
  • Saves artifacts:
    • phishing_model.joblib
    • vectorizer.joblib
  • IMAP mailbox testing (inbox + spam/junk folders when available)
  • Clear workflow for retraining as you add new labeled examples

Tech Stack

  • Python 3.9+
  • pandas
  • scikit-learn
  • joblib

Project Structure

.
β”œβ”€β”€ emails_dataset.csv
β”œβ”€β”€ train_with_email_dataset.py
β”œβ”€β”€ fetch_and_test_emails.py
β”œβ”€β”€ phishing_model.joblib
└── vectorizer.joblib
  • emails_dataset.csv β€” Training dataset with columns: label, text
  • train_with_email_dataset.py β€” Training + evaluation script; saves model artifacts
  • fetch_and_test_emails.py β€” Fetches email via IMAP and runs predictions
  • phishing_model.joblib β€” Saved trained model (generated after training)
  • vectorizer.joblib β€” Saved TF‑IDF vectorizer (generated after training)

Getting Started

1) Clone the repo

git clone <your-repo-url>
cd <your-repo-folder>

2) Create and activate a virtual environment (recommended)

macOS/Linux

python -m venv .venv
source .venv/bin/activate

Windows (PowerShell)

python -m venv .venv
.\.venv\Scripts\Activate.ps1

3) Install dependencies

pip install pandas scikit-learn joblib

Dataset Format

Your emails_dataset.csv must contain:

  • label: ham or spam
  • text: full email content (or a meaningful excerpt)

Example:

label text
ham Team meeting moved to 3 PM.
spam Urgent: Verify your account now to avoid suspension.

Notes

  • Keep labels consistent: only ham and spam.
  • Dataset quality (and variety) strongly affects performance.

Train the Model

Run:

python train_with_email_dataset.py

This script will:

  • preprocess/clean text (as implemented in the script)
  • split dataset into train/test sets
  • train the classifier
  • print metrics (accuracy + classification report)
  • save artifacts:
    • phishing_model.joblib
    • vectorizer.joblib

Test Real Emails via IMAP

Run:

python fetch_and_test_emails.py

You will be prompted for:

  • Email address
  • Password (use an app password when required by your provider)
  • IMAP server (example: imap.gmail.com)

The script will fetch recent messages (inbox and spam/junk where available) and print a prediction for each email.

Common IMAP servers

  • Gmail: imap.gmail.com
  • Outlook/Office365: outlook.office365.com
  • Yahoo: imap.mail.yahoo.com

Folder names vary by provider (e.g., Spam, Junk, Bulk Mail). If your provider uses different names, you may need to adjust the script.


Security & Privacy Notes

  • Never commit credentials (email/password/app password) into this repository.
  • Prefer environment variables or a local secrets file that is excluded by .gitignore.
  • If credentials were exposed at any point: rotate them immediately.
  • Treat downloaded email content as sensitive data (avoid logging/storing unnecessarily).

Typical Workflow

  1. Add new labeled examples to emails_dataset.csv
  2. Retrain:
    python train_with_email_dataset.py
  3. Re-test your mailbox:
    python fetch_and_test_emails.py

Limitations

  • Baseline text-only model (no URL reputation, domain checks, header analysis, sender reputation, attachment scanning, etc.)
  • Performance depends on dataset quality, size, language coverage, and class balance
  • IMAP providers/folder conventions vary; some accounts restrict IMAP access by default

Roadmap / Future Improvements

  • Expand and diversify labeled dataset
  • Try additional models:
    • Logistic Regression, Linear SVM
    • transformer-based classifiers (e.g., BERT variants)
  • Add explanations (top contributing words/features per prediction)
  • Add URL extraction + lightweight safety checks
  • Build a small web UI / API for easier use

License

Add your preferred license before publishing publicly (e.g., MIT).

Suggested: create a LICENSE file and include the license name here.

About

πŸ›‘οΈ AI-powered phishing email detection using Python, TF-IDF & Naive Bayes. Train a classifier on labeled email data and test it on real inbox messages via IMAP. Built for the EFM Cybersecurity program.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages