Add documentation for data preparation and multi-label classification process

syedsufyan-coder · syedsufyan-coder · commit 3ea182533880 · 2026-05-18T13:56:21.000+05:00
diff --git a/documentation/notebook/Pre-Training Documentation.md b/documentation/notebook/Pre-Training Documentation.md
@@ -0,0 +1,260 @@
+# 🧼📦 Data Preparation (Pre-Training) — GitHub Issue/PR Multi‑Label Dataset
+
+> Scope: this document covers **everything done before model training** in the notebook `PR_CLassifier.ipynb` (Cells 1–5).  
+> It includes dataset loading, label normalization, multi-label extraction, filtering, export, and quick sanity-check EDA.
+
+---
+
+## 🎯 Goal
+
+Turn the Hugging Face dataset **`sharjeelyunus/github-issues-dataset`** into a clean, model-ready table for **multi-label classification** using only:
+
+- Text features: `title`, `body`
+- Target labels (5): `bug`, `enhancement`, `documentation`, `test`, `request`
+
+Each target label is represented as a binary column (`0/1`). Rows that do not match any of the 5 targets are dropped.
+
+---
+
+## 🧠 Why multi-label?
+
+GitHub issues/PRs commonly carry **multiple labels** (e.g., an issue can be both a `bug` and include `test` changes). Therefore the dataset is prepared in a **multi-label** format:
+
+- One example can map to many labels
+- Targets are stored as 5 binary columns
+
+---
+
+## 🗂️ What comes from the raw dataset?
+
+The dataset is loaded via `datasets.load_dataset(...)` and converted into a Pandas DataFrame.
+
+Key raw fields used:
+
+| Field | Used for | Notes |
+|---|---|---|
+| `title` | input text | may be empty/NaN |
+| `body` | input text | may be empty/NaN |
+| `labels` | supervision source | noisy, multi-valued, mixed formatting |
+
+---
+
+## 🧾 Label taxonomy (fixed 5 labels)
+
+The notebook compresses many possible raw label strings into **5 normalized targets** using substring matching.
+
+### ✅ Mapping table
+
+| Target column | Substrings searched in raw `labels` |
+|---|---|
+| `bug` | `bug`, `fix` |
+| `enhancement` | `enhancement`, `feature` |
+| `documentation` | `doc` |
+| `test` | `test` |
+| `request` | `request` |
+
+This mapping is implemented exactly as:
+
+```python
+label_map = {
+    'bug': ['bug', 'fix'],
+    'enhancement': ['enhancement', 'feature'],
+    'documentation': ['doc'],
+    'test': ['test'],
+    'request': ['request']
+}
+```
+
+---
+
+## 🧼 Multi-label cleaning strategy
+
+### The core idea
+
+For each row:
+
+1. Start with all targets set to `0`
+2. If `labels` is missing/empty → return all zeros
+3. Convert `labels` to lowercase text
+4. For each target, build a regex that matches **any** of its substrings
+5. If a match is found → set that target column to `1`
+
+### 🔑 Key implementation (exact notebook code)
+
+```python
+def clean_multi_labels(row_labels):
+    extracted = {key: 0 for key in label_map.keys()}
+
+    if pd.isna(row_labels) or row_labels == "":
+        return pd.Series(extracted)
+
+    # Convert label list/string to lower case for matching
+    label_text = str(row_labels).lower()
+
+    for target_column, substrings in label_map.items():
+        pattern = r'(' + '|'.join(map(re.escape, substrings)) + r')'
+        if re.search(pattern, label_text):
+            extracted[target_column] = 1
+
+    return pd.Series(extracted)
+```
+
+### ✅ Why substring + regex?
+
+The raw `labels` field can contain multiple comma-separated values and inconsistent naming. Substring matching makes the pipeline robust to variations like:
+
+- `bug`, `type: bug`, `bugfix`, `fix needed`
+- `feature`, `enhancement`, etc.
+
+---
+
+## 🔗 Building the final training table
+
+### 1) Apply the cleaner
+
+```python
+new_label_cols = df['labels'].apply(clean_multi_labels)
+```
+
+### 2) Keep only needed columns + new targets
+
+```python
+final_df = pd.concat([df[['title', 'body']], new_label_cols], axis=1)
+```
+
+### 3) Drop rows with no target labels
+
+This ensures every example has **at least one** of the 5 labels.
+
+```python
+target_cols = list(label_map.keys())
+final_df = final_df[(final_df[target_cols] != 0).any(axis=1)].reset_index(drop=True)
+```
+
+### 4) Add a stable integer ID
+
+```python
+final_df.insert(0, 'id', range(1, len(final_df) + 1))
+```
+
+### 5) Export the cleaned dataset
+
+```python
+output_file = '../data/cleaned_github_data.csv'
+final_df.to_csv(output_file, index=False)
+```
+
+---
+
+## ✅ Resulting dataset (post-clean)
+
+From the exported CSV (`data/cleaned_github_data.csv`) the cleaned dataset shape is:
+
+- Rows: **65,055**
+- Columns: **8**
+
+### Final schema
+
+| Column | Type | Description |
+|---|---|---|
+| `id` | int | sequential id starting at 1 |
+| `title` | text | issue/PR title |
+| `body` | text | issue/PR body |
+| `bug` | 0/1 | target label |
+| `enhancement` | 0/1 | target label |
+| `documentation` | 0/1 | target label |
+| `test` | 0/1 | target label |
+| `request` | 0/1 | target label |
+
+---
+
+## 📊 Quick EDA / sanity checks
+
+The notebook does a small amount of EDA to verify the dataset is usable.
+
+### 1) Label frequency
+
+Counts of positives per label (multi-label, so totals can exceed row count):
+
+| Label | Positive count |
+|---|---:|
+| `bug` | 35,684 |
+| `enhancement` | 21,732 |
+| `request` | 8,810 |
+| `documentation` | 5,569 |
+| `test` | 4,944 |
+
+This matches the visual bar chart shown in the notebook.
+
+### 2) Co-occurrence (correlation heatmap)
+
+The notebook computes:
+
+```python
+correlation = final_df[target_labels].corr()
+```
+
+This is a correlation over **binary** columns (0/1), used as a fast way to spot whether labels frequently co-occur.
+
+Notable pattern from the plotted heatmap:
+
+- `bug` vs `enhancement` shows a strong negative correlation (≈ -0.74)
+- `enhancement` vs `request` shows a moderate positive correlation (≈ 0.36)
+
+### 3) Text length distribution (title + body)
+
+The notebook combines text and measures word counts:
+
+```python
+final_df['combined_text'] = final_df['title'].fillna('') + " " + final_df['body'].fillna('')
+final_df['text_len'] = final_df['combined_text'].apply(lambda x: len(x.split()))
+```
+
+Summary stats (computed from the cleaned CSV):
+
+| Metric | Value |
+|---|---:|
+| Average words / record | 241.98 |
+| Median words / record | 170 |
+| Min words | 2 |
+| Max words | 26,381 |
+
+The notebook also sets `plt.xlim(0, 500)` to focus the histogram on the common range while acknowledging outliers.
+
+---
+
+## 🧩 Pipeline overview (Mermaid)
+
+```mermaid
+flowchart TD
+  A["Load dataset from Hugging Face\nsharjeelyunus/github-issues-dataset"] --> B["Convert train split to DataFrame"]
+  B --> C["Define label_map\n5 targets + substrings"]
+  C --> D["clean_multi_labels(labels)\n→ 5 binary columns"]
+  D --> E["Concat: title, body + label columns"]
+  E --> F["Filter: keep rows with any label == 1"]
+  F --> G["Insert sequential id"]
+  G --> H["Export cleaned_github_data.csv"]
+  H --> I["EDA: label frequency, correlation, text length"]
+```
+
+---
+
+## 🧷 Notebook navigation (what to look at)
+
+- **Cell 2**: dataset load + label mapping + cleaning + export
+- **Cell 3**: quick preview (`final_df.head()`)
+- **Cell 4**: label frequency plot + correlation heatmap + text length histogram
+- **Cell 5**: explicit marker: **“Model Training starts from here”**
+
+---
+
+## ⚠️ Notes / caveats (pre-training)
+
+- `from google.colab import files` is Colab-specific; it’s not required unless downloading from Colab.
+- The matching is substring-based; it’s intentionally permissive to reduce label noise but can over-match in rare cases (e.g., unintended substring hits).
+
+---
+
+## ✅ Output artifact
+
+- `data/cleaned_github_data.csv` — the cleaned, filtered multi-label dataset used for training.