Thank you for your interest in contributing to NLP-Email-Categorizer! We welcome contributions from the community, including bug reports, feature requests, code improvements, and documentation enhancements. This guide outlines how to get involved and help improve this open-source text classification pipeline.
All contributors are expected to adhere to the Code of Conduct. This ensures a respectful and inclusive environment for everyone involved in the project.
If you encounter a bug in NLP-Email-Categorizer:
- Check Existing Issues: Search the Issues page to see if the bug has already been reported.
- Create a New Issue: If the bug is new, open a new issue and provide:
- A clear title and description of the bug.
- Steps to reproduce the issue (e.g., specific dataset, notebook cell).
- Expected and actual behavior.
- Screenshots, error logs, or stack traces, if applicable.
- Your environment (e.g., Python version, OS, Jupyter version).
- Use the Bug Report Template: Follow the template provided in the issue creation form for consistency.
Example Bug Report:
- Title: "GUI Prediction Fails with Empty Input"
- Description: Entering an empty subject in the GUI causes an error.
- Steps: Open
Naive_Bayes_Text_Classification.ipynb, run all cells, input "" in GUI, click Predict. - Expected: Error message for empty input.
- Actual: ValueError in vectorizer.
- Environment: Python 3.8, Jupyter Notebook 6.4, Windows 11.
We welcome ideas to enhance NLP-Email-Categorizer! To suggest a feature:
- Check Existing Requests: Review the Issues page to avoid duplicates.
- Submit a Feature Request: Open a new issue and include:
- A clear title and detailed description of the feature.
- The problem it solves or the benefit it provides (e.g., adds TF-IDF, improves GUI).
- Any relevant examples, code snippets, or references to similar tools.
- Use the Feature Request Template: Follow the provided template to structure your suggestion.
Example Feature Request:
- Title: "Add TF-IDF Vectorizer Option"
- Description: Include TF-IDF as an alternative to CountVectorizer for feature extraction.
- Benefit: Improves classification by weighting rare terms.
To contribute code or documentation:
-
Fork the Repository:
- Fork the NLP-Email-Categorizer repository.
- Clone your fork to your local machine:
git clone https://github.com/YOUR_USERNAME/NLP-Email-Categorizer.git
-
Create a Branch:
- Create a new branch for your changes:
git checkout -b feature/your-feature-name
- Use descriptive branch names (e.g.,
fix/gui-error,feature/tfidf-vectorizer).
- Create a new branch for your changes:
-
Make Changes:
- Modify the notebooks or add new files as needed.
- Follow the Code Style Guidelines below.
- Test your changes in a Jupyter environment (local or Colab).
-
Commit Changes:
- Write clear, concise commit messages:
git commit -m "Add TF-IDF vectorizer option to advanced notebook" - Reference related issues (e.g.,
Fixes #123).
- Write clear, concise commit messages:
-
Push and Create a Pull Request:
- Push your branch to your fork:
git push origin feature/your-feature-name
- Open a pull request (PR) against the
mainbranch of the original repository. - Use the PR template and provide:
- A description of the changes.
- The issue number(s) addressed (if any).
- Screenshots or outputs for notebook changes (e.g., new visualizations).
- Testing performed (e.g., environments tested, datasets used).
- Push your branch to your fork:
-
Code Review:
- Respond to feedback from maintainers.
- Make requested changes and update your PR as needed.
- Your PR will be merged once approved.
To set up a development environment for NLP-Email-Categorizer:
-
Prerequisites:
- Python 3.6+ installed.
- Jupyter Notebook or JupyterLab installed.
- Git for version control.
- A code editor (e.g., VS Code, PyCharm).
-
Clone the Repository:
git clone https://github.com/VoxDroid/NLP-Email-Categorizer.git cd NLP-Email-Categorizer -
Set Up a Virtual Environment:
python -m venv venv source venv/bin/activate # Linux/macOS venv\Scripts\activate # Windows
-
Install Dependencies:
pip install pandas numpy scikit-learn nltk matplotlib seaborn joblib ipywidgets
-
Install NLTK Data:
import nltk nltk.download('punkt') nltk.download('stopwords') nltk.download('wordnet')
-
Prepare a Test Dataset:
- Create a sample CSV or TSV file with
SubjectandCategorycolumns (see Installation in README). - Place it in the project directory.
- Create a sample CSV or TSV file with
-
Run the Notebooks:
jupyter notebook
Open
Naive_Bayes_Text_Classification.ipynborText_Classification_Pipeline_for_Email_Subjects.ipynband test your changes.
To maintain consistency in the codebase:
- Python:
- Follow PEP 8 guidelines (e.g., 4-space indentation, 79-character line limit).
- Use descriptive variable names (e.g.,
X_train_vectorized,preprocess_text). - Add docstrings for functions and comments for complex logic.
- Example:
def preprocess_text(text): """Convert text to lowercase, remove punctuation, tokenize, and remove stopwords.""" text = text.lower() # ...
- Notebooks:
- Use markdown cells for clear section headings and explanations.
- Keep code cells focused (e.g., one task per cell: loading data, training model).
- Include logging or print statements for user feedback.
- Use
@titleannotations for cell titles (as in original notebooks).
- File Structure:
- Keep notebooks in the root directory.
- Store datasets in a
data/folder (not tracked in Git). - Save models and outputs in an
outputs/folder (not tracked).
- Reproducibility:
- Set random seeds (e.g.,
random_state=42) for consistent results. - Document dataset requirements (e.g., column names, format).
- Set random seeds (e.g.,
- Dependencies:
- Use only the libraries specified in the notebooks unless adding new functionality.
- Update dependency installation instructions in the README if new libraries are added.
Before submitting a pull request:
- Manual Testing:
- Run both notebooks end-to-end with a sample dataset.
- Verify data loading, preprocessing, training, evaluation, and prediction steps.
- Test the GUI in the advanced notebook with various inputs (e.g., valid, empty, long subjects).
- Check augmentation output for correctness (advanced notebook).
- Environment Testing:
- Test in multiple environments (e.g., local Jupyter, Google Colab).
- Verify compatibility with Python 3.6+ and listed dependencies.
- Edge Cases:
- Test with empty or malformed datasets (e.g., missing columns, NaN values).
- Test with unusual inputs in the GUI (e.g., special characters, very long text).
- Verify model performance with small or imbalanced datasets.
- Performance:
- Ensure preprocessing and training are efficient for datasets up to ~10,000 rows.
- Check visualization rendering (e.g., confusion matrix, bar plots).
- Reproducibility:
- Confirm results are consistent with the same dataset and random seed.
- Validate saved models load correctly and produce expected predictions.
Join the NLP-Email-Categorizer community:
- GitHub Discussions: Share ideas or ask questions in the Discussions section.
- Issues: Report bugs or suggest features on the Issues page.
- GitHub Stars: Show your support by starring the repository.
Thank you for contributing to NLP-Email-Categorizer! Your efforts help make this tool better for data scientists and NLP practitioners worldwide.