Skip to content

Latest commit

 

History

History
215 lines (174 loc) · 8.66 KB

File metadata and controls

215 lines (174 loc) · 8.66 KB

Contributing to NLP-Email-Categorizer

Thank you for your interest in contributing to NLP-Email-Categorizer! We welcome contributions from the community, including bug reports, feature requests, code improvements, and documentation enhancements. This guide outlines how to get involved and help improve this open-source text classification pipeline.

Table of Contents

Code of Conduct

All contributors are expected to adhere to the Code of Conduct. This ensures a respectful and inclusive environment for everyone involved in the project.

How to Contribute

Reporting Bugs

If you encounter a bug in NLP-Email-Categorizer:

  1. Check Existing Issues: Search the Issues page to see if the bug has already been reported.
  2. Create a New Issue: If the bug is new, open a new issue and provide:
    • A clear title and description of the bug.
    • Steps to reproduce the issue (e.g., specific dataset, notebook cell).
    • Expected and actual behavior.
    • Screenshots, error logs, or stack traces, if applicable.
    • Your environment (e.g., Python version, OS, Jupyter version).
  3. Use the Bug Report Template: Follow the template provided in the issue creation form for consistency.

Example Bug Report:

  • Title: "GUI Prediction Fails with Empty Input"
  • Description: Entering an empty subject in the GUI causes an error.
  • Steps: Open Naive_Bayes_Text_Classification.ipynb, run all cells, input "" in GUI, click Predict.
  • Expected: Error message for empty input.
  • Actual: ValueError in vectorizer.
  • Environment: Python 3.8, Jupyter Notebook 6.4, Windows 11.

Suggesting Features

We welcome ideas to enhance NLP-Email-Categorizer! To suggest a feature:

  1. Check Existing Requests: Review the Issues page to avoid duplicates.
  2. Submit a Feature Request: Open a new issue and include:
    • A clear title and detailed description of the feature.
    • The problem it solves or the benefit it provides (e.g., adds TF-IDF, improves GUI).
    • Any relevant examples, code snippets, or references to similar tools.
  3. Use the Feature Request Template: Follow the provided template to structure your suggestion.

Example Feature Request:

  • Title: "Add TF-IDF Vectorizer Option"
  • Description: Include TF-IDF as an alternative to CountVectorizer for feature extraction.
  • Benefit: Improves classification by weighting rare terms.

Submitting Pull Requests

To contribute code or documentation:

  1. Fork the Repository:

  2. Create a Branch:

    • Create a new branch for your changes:
      git checkout -b feature/your-feature-name
    • Use descriptive branch names (e.g., fix/gui-error, feature/tfidf-vectorizer).
  3. Make Changes:

    • Modify the notebooks or add new files as needed.
    • Follow the Code Style Guidelines below.
    • Test your changes in a Jupyter environment (local or Colab).
  4. Commit Changes:

    • Write clear, concise commit messages:
      git commit -m "Add TF-IDF vectorizer option to advanced notebook"
    • Reference related issues (e.g., Fixes #123).
  5. Push and Create a Pull Request:

    • Push your branch to your fork:
      git push origin feature/your-feature-name
    • Open a pull request (PR) against the main branch of the original repository.
    • Use the PR template and provide:
      • A description of the changes.
      • The issue number(s) addressed (if any).
      • Screenshots or outputs for notebook changes (e.g., new visualizations).
      • Testing performed (e.g., environments tested, datasets used).
  6. Code Review:

    • Respond to feedback from maintainers.
    • Make requested changes and update your PR as needed.
    • Your PR will be merged once approved.

Development Setup

To set up a development environment for NLP-Email-Categorizer:

  1. Prerequisites:

    • Python 3.6+ installed.
    • Jupyter Notebook or JupyterLab installed.
    • Git for version control.
    • A code editor (e.g., VS Code, PyCharm).
  2. Clone the Repository:

    git clone https://github.com/VoxDroid/NLP-Email-Categorizer.git
    cd NLP-Email-Categorizer
  3. Set Up a Virtual Environment:

    python -m venv venv
    source venv/bin/activate  # Linux/macOS
    venv\Scripts\activate     # Windows
  4. Install Dependencies:

    pip install pandas numpy scikit-learn nltk matplotlib seaborn joblib ipywidgets
  5. Install NLTK Data:

    import nltk
    nltk.download('punkt')
    nltk.download('stopwords')
    nltk.download('wordnet')
  6. Prepare a Test Dataset:

    • Create a sample CSV or TSV file with Subject and Category columns (see Installation in README).
    • Place it in the project directory.
  7. Run the Notebooks:

    jupyter notebook

    Open Naive_Bayes_Text_Classification.ipynb or Text_Classification_Pipeline_for_Email_Subjects.ipynb and test your changes.

Code Style Guidelines

To maintain consistency in the codebase:

  • Python:
    • Follow PEP 8 guidelines (e.g., 4-space indentation, 79-character line limit).
    • Use descriptive variable names (e.g., X_train_vectorized, preprocess_text).
    • Add docstrings for functions and comments for complex logic.
    • Example:
      def preprocess_text(text):
          """Convert text to lowercase, remove punctuation, tokenize, and remove stopwords."""
          text = text.lower()
          # ...
  • Notebooks:
    • Use markdown cells for clear section headings and explanations.
    • Keep code cells focused (e.g., one task per cell: loading data, training model).
    • Include logging or print statements for user feedback.
    • Use @title annotations for cell titles (as in original notebooks).
  • File Structure:
    • Keep notebooks in the root directory.
    • Store datasets in a data/ folder (not tracked in Git).
    • Save models and outputs in an outputs/ folder (not tracked).
  • Reproducibility:
    • Set random seeds (e.g., random_state=42) for consistent results.
    • Document dataset requirements (e.g., column names, format).
  • Dependencies:
    • Use only the libraries specified in the notebooks unless adding new functionality.
    • Update dependency installation instructions in the README if new libraries are added.

Testing

Before submitting a pull request:

  • Manual Testing:
    • Run both notebooks end-to-end with a sample dataset.
    • Verify data loading, preprocessing, training, evaluation, and prediction steps.
    • Test the GUI in the advanced notebook with various inputs (e.g., valid, empty, long subjects).
    • Check augmentation output for correctness (advanced notebook).
  • Environment Testing:
    • Test in multiple environments (e.g., local Jupyter, Google Colab).
    • Verify compatibility with Python 3.6+ and listed dependencies.
  • Edge Cases:
    • Test with empty or malformed datasets (e.g., missing columns, NaN values).
    • Test with unusual inputs in the GUI (e.g., special characters, very long text).
    • Verify model performance with small or imbalanced datasets.
  • Performance:
    • Ensure preprocessing and training are efficient for datasets up to ~10,000 rows.
    • Check visualization rendering (e.g., confusion matrix, bar plots).
  • Reproducibility:
    • Confirm results are consistent with the same dataset and random seed.
    • Validate saved models load correctly and produce expected predictions.

Community

Join the NLP-Email-Categorizer community:

  • GitHub Discussions: Share ideas or ask questions in the Discussions section.
  • Issues: Report bugs or suggest features on the Issues page.
  • GitHub Stars: Show your support by starring the repository.

Thank you for contributing to NLP-Email-Categorizer! Your efforts help make this tool better for data scientists and NLP practitioners worldwide.