A practical machine learning project that detects phishing/spam emails using classic NLP techniques. Train a classifier from a labeled dataset, save reusable artifacts, then test predictions on real mailbox messages via IMAP.
- Overview
- Key Features
- Tech Stack
- Project Structure
- Getting Started
- Dataset Format
- Train the Model
- Test Real Emails via IMAP
- Security & Privacy Notes
- Typical Workflow
- Limitations
- Roadmap / Future Improvements
- License
This repository provides an end-to-end baseline phishing/spam detection pipeline:
- Train a text classifier on a labeled dataset (
hamvsspam) - Evaluate performance (accuracy + classification report)
- Persist the trained model and TFβIDF vectorizer for reuse
- Connect via IMAP to fetch recent emails and predict whether each message looks like phishing/spam or safe
Model: TFβIDF (unigrams + bigrams) β Multinomial Naive Bayes
- Baseline phishing/spam classifier trained from
emails_dataset.csv - TFβIDF vectorization with unigrams and bigrams for better phrase detection
- Multinomial Naive Bayes (fast, strong baseline for text)
- Saves artifacts:
phishing_model.joblibvectorizer.joblib
- IMAP mailbox testing (inbox + spam/junk folders when available)
- Clear workflow for retraining as you add new labeled examples
- Python 3.9+
- pandas
- scikit-learn
- joblib
.
βββ emails_dataset.csv
βββ train_with_email_dataset.py
βββ fetch_and_test_emails.py
βββ phishing_model.joblib
βββ vectorizer.joblib
emails_dataset.csvβ Training dataset with columns:label,texttrain_with_email_dataset.pyβ Training + evaluation script; saves model artifactsfetch_and_test_emails.pyβ Fetches email via IMAP and runs predictionsphishing_model.joblibβ Saved trained model (generated after training)vectorizer.joblibβ Saved TFβIDF vectorizer (generated after training)
git clone <your-repo-url>
cd <your-repo-folder>macOS/Linux
python -m venv .venv
source .venv/bin/activateWindows (PowerShell)
python -m venv .venv
.\.venv\Scripts\Activate.ps1pip install pandas scikit-learn joblibYour emails_dataset.csv must contain:
label: ham or spamtext: full email content (or a meaningful excerpt)
Example:
| label | text |
|---|---|
| ham | Team meeting moved to 3 PM. |
| spam | Urgent: Verify your account now to avoid suspension. |
Notes
- Keep labels consistent: only
hamandspam. - Dataset quality (and variety) strongly affects performance.
Run:
python train_with_email_dataset.pyThis script will:
- preprocess/clean text (as implemented in the script)
- split dataset into train/test sets
- train the classifier
- print metrics (accuracy + classification report)
- save artifacts:
phishing_model.joblibvectorizer.joblib
Run:
python fetch_and_test_emails.pyYou will be prompted for:
- Email address
- Password (use an app password when required by your provider)
- IMAP server (example:
imap.gmail.com)
The script will fetch recent messages (inbox and spam/junk where available) and print a prediction for each email.
Common IMAP servers
- Gmail:
imap.gmail.com - Outlook/Office365:
outlook.office365.com - Yahoo:
imap.mail.yahoo.com
Folder names vary by provider (e.g.,
Spam,Junk,Bulk Mail). If your provider uses different names, you may need to adjust the script.
- Never commit credentials (email/password/app password) into this repository.
- Prefer environment variables or a local secrets file that is excluded by
.gitignore. - If credentials were exposed at any point: rotate them immediately.
- Treat downloaded email content as sensitive data (avoid logging/storing unnecessarily).
- Add new labeled examples to
emails_dataset.csv - Retrain:
python train_with_email_dataset.py
- Re-test your mailbox:
python fetch_and_test_emails.py
- Baseline text-only model (no URL reputation, domain checks, header analysis, sender reputation, attachment scanning, etc.)
- Performance depends on dataset quality, size, language coverage, and class balance
- IMAP providers/folder conventions vary; some accounts restrict IMAP access by default
- Expand and diversify labeled dataset
- Try additional models:
- Logistic Regression, Linear SVM
- transformer-based classifiers (e.g., BERT variants)
- Add explanations (top contributing words/features per prediction)
- Add URL extraction + lightweight safety checks
- Build a small web UI / API for easier use
Add your preferred license before publishing publicly (e.g., MIT).
Suggested: create a
LICENSEfile and include the license name here.