|
| 1 | +# 🧼📦 Data Preparation (Pre-Training) — GitHub Issue/PR Multi‑Label Dataset |
| 2 | + |
| 3 | +> Scope: this document covers **everything done before model training** in the notebook `PR_CLassifier.ipynb` (Cells 1–5). |
| 4 | +> It includes dataset loading, label normalization, multi-label extraction, filtering, export, and quick sanity-check EDA. |
| 5 | +
|
| 6 | +--- |
| 7 | + |
| 8 | +## 🎯 Goal |
| 9 | + |
| 10 | +Turn the Hugging Face dataset **`sharjeelyunus/github-issues-dataset`** into a clean, model-ready table for **multi-label classification** using only: |
| 11 | + |
| 12 | +- Text features: `title`, `body` |
| 13 | +- Target labels (5): `bug`, `enhancement`, `documentation`, `test`, `request` |
| 14 | + |
| 15 | +Each target label is represented as a binary column (`0/1`). Rows that do not match any of the 5 targets are dropped. |
| 16 | + |
| 17 | +--- |
| 18 | + |
| 19 | +## 🧠 Why multi-label? |
| 20 | + |
| 21 | +GitHub issues/PRs commonly carry **multiple labels** (e.g., an issue can be both a `bug` and include `test` changes). Therefore the dataset is prepared in a **multi-label** format: |
| 22 | + |
| 23 | +- One example can map to many labels |
| 24 | +- Targets are stored as 5 binary columns |
| 25 | + |
| 26 | +--- |
| 27 | + |
| 28 | +## 🗂️ What comes from the raw dataset? |
| 29 | + |
| 30 | +The dataset is loaded via `datasets.load_dataset(...)` and converted into a Pandas DataFrame. |
| 31 | + |
| 32 | +Key raw fields used: |
| 33 | + |
| 34 | +| Field | Used for | Notes | |
| 35 | +|---|---|---| |
| 36 | +| `title` | input text | may be empty/NaN | |
| 37 | +| `body` | input text | may be empty/NaN | |
| 38 | +| `labels` | supervision source | noisy, multi-valued, mixed formatting | |
| 39 | + |
| 40 | +--- |
| 41 | + |
| 42 | +## 🧾 Label taxonomy (fixed 5 labels) |
| 43 | + |
| 44 | +The notebook compresses many possible raw label strings into **5 normalized targets** using substring matching. |
| 45 | + |
| 46 | +### ✅ Mapping table |
| 47 | + |
| 48 | +| Target column | Substrings searched in raw `labels` | |
| 49 | +|---|---| |
| 50 | +| `bug` | `bug`, `fix` | |
| 51 | +| `enhancement` | `enhancement`, `feature` | |
| 52 | +| `documentation` | `doc` | |
| 53 | +| `test` | `test` | |
| 54 | +| `request` | `request` | |
| 55 | + |
| 56 | +This mapping is implemented exactly as: |
| 57 | + |
| 58 | +```python |
| 59 | +label_map = { |
| 60 | + 'bug': ['bug', 'fix'], |
| 61 | + 'enhancement': ['enhancement', 'feature'], |
| 62 | + 'documentation': ['doc'], |
| 63 | + 'test': ['test'], |
| 64 | + 'request': ['request'] |
| 65 | +} |
| 66 | +``` |
| 67 | + |
| 68 | +--- |
| 69 | + |
| 70 | +## 🧼 Multi-label cleaning strategy |
| 71 | + |
| 72 | +### The core idea |
| 73 | + |
| 74 | +For each row: |
| 75 | + |
| 76 | +1. Start with all targets set to `0` |
| 77 | +2. If `labels` is missing/empty → return all zeros |
| 78 | +3. Convert `labels` to lowercase text |
| 79 | +4. For each target, build a regex that matches **any** of its substrings |
| 80 | +5. If a match is found → set that target column to `1` |
| 81 | + |
| 82 | +### 🔑 Key implementation (exact notebook code) |
| 83 | + |
| 84 | +```python |
| 85 | +def clean_multi_labels(row_labels): |
| 86 | + extracted = {key: 0 for key in label_map.keys()} |
| 87 | + |
| 88 | + if pd.isna(row_labels) or row_labels == "": |
| 89 | + return pd.Series(extracted) |
| 90 | + |
| 91 | + # Convert label list/string to lower case for matching |
| 92 | + label_text = str(row_labels).lower() |
| 93 | + |
| 94 | + for target_column, substrings in label_map.items(): |
| 95 | + pattern = r'(' + '|'.join(map(re.escape, substrings)) + r')' |
| 96 | + if re.search(pattern, label_text): |
| 97 | + extracted[target_column] = 1 |
| 98 | + |
| 99 | + return pd.Series(extracted) |
| 100 | +``` |
| 101 | + |
| 102 | +### ✅ Why substring + regex? |
| 103 | + |
| 104 | +The raw `labels` field can contain multiple comma-separated values and inconsistent naming. Substring matching makes the pipeline robust to variations like: |
| 105 | + |
| 106 | +- `bug`, `type: bug`, `bugfix`, `fix needed` |
| 107 | +- `feature`, `enhancement`, etc. |
| 108 | + |
| 109 | +--- |
| 110 | + |
| 111 | +## 🔗 Building the final training table |
| 112 | + |
| 113 | +### 1) Apply the cleaner |
| 114 | + |
| 115 | +```python |
| 116 | +new_label_cols = df['labels'].apply(clean_multi_labels) |
| 117 | +``` |
| 118 | + |
| 119 | +### 2) Keep only needed columns + new targets |
| 120 | + |
| 121 | +```python |
| 122 | +final_df = pd.concat([df[['title', 'body']], new_label_cols], axis=1) |
| 123 | +``` |
| 124 | + |
| 125 | +### 3) Drop rows with no target labels |
| 126 | + |
| 127 | +This ensures every example has **at least one** of the 5 labels. |
| 128 | + |
| 129 | +```python |
| 130 | +target_cols = list(label_map.keys()) |
| 131 | +final_df = final_df[(final_df[target_cols] != 0).any(axis=1)].reset_index(drop=True) |
| 132 | +``` |
| 133 | + |
| 134 | +### 4) Add a stable integer ID |
| 135 | + |
| 136 | +```python |
| 137 | +final_df.insert(0, 'id', range(1, len(final_df) + 1)) |
| 138 | +``` |
| 139 | + |
| 140 | +### 5) Export the cleaned dataset |
| 141 | + |
| 142 | +```python |
| 143 | +output_file = '../data/cleaned_github_data.csv' |
| 144 | +final_df.to_csv(output_file, index=False) |
| 145 | +``` |
| 146 | + |
| 147 | +--- |
| 148 | + |
| 149 | +## ✅ Resulting dataset (post-clean) |
| 150 | + |
| 151 | +From the exported CSV (`data/cleaned_github_data.csv`) the cleaned dataset shape is: |
| 152 | + |
| 153 | +- Rows: **65,055** |
| 154 | +- Columns: **8** |
| 155 | + |
| 156 | +### Final schema |
| 157 | + |
| 158 | +| Column | Type | Description | |
| 159 | +|---|---|---| |
| 160 | +| `id` | int | sequential id starting at 1 | |
| 161 | +| `title` | text | issue/PR title | |
| 162 | +| `body` | text | issue/PR body | |
| 163 | +| `bug` | 0/1 | target label | |
| 164 | +| `enhancement` | 0/1 | target label | |
| 165 | +| `documentation` | 0/1 | target label | |
| 166 | +| `test` | 0/1 | target label | |
| 167 | +| `request` | 0/1 | target label | |
| 168 | + |
| 169 | +--- |
| 170 | + |
| 171 | +## 📊 Quick EDA / sanity checks |
| 172 | + |
| 173 | +The notebook does a small amount of EDA to verify the dataset is usable. |
| 174 | + |
| 175 | +### 1) Label frequency |
| 176 | + |
| 177 | +Counts of positives per label (multi-label, so totals can exceed row count): |
| 178 | + |
| 179 | +| Label | Positive count | |
| 180 | +|---|---:| |
| 181 | +| `bug` | 35,684 | |
| 182 | +| `enhancement` | 21,732 | |
| 183 | +| `request` | 8,810 | |
| 184 | +| `documentation` | 5,569 | |
| 185 | +| `test` | 4,944 | |
| 186 | + |
| 187 | +This matches the visual bar chart shown in the notebook. |
| 188 | + |
| 189 | +### 2) Co-occurrence (correlation heatmap) |
| 190 | + |
| 191 | +The notebook computes: |
| 192 | + |
| 193 | +```python |
| 194 | +correlation = final_df[target_labels].corr() |
| 195 | +``` |
| 196 | + |
| 197 | +This is a correlation over **binary** columns (0/1), used as a fast way to spot whether labels frequently co-occur. |
| 198 | + |
| 199 | +Notable pattern from the plotted heatmap: |
| 200 | + |
| 201 | +- `bug` vs `enhancement` shows a strong negative correlation (≈ -0.74) |
| 202 | +- `enhancement` vs `request` shows a moderate positive correlation (≈ 0.36) |
| 203 | + |
| 204 | +### 3) Text length distribution (title + body) |
| 205 | + |
| 206 | +The notebook combines text and measures word counts: |
| 207 | + |
| 208 | +```python |
| 209 | +final_df['combined_text'] = final_df['title'].fillna('') + " " + final_df['body'].fillna('') |
| 210 | +final_df['text_len'] = final_df['combined_text'].apply(lambda x: len(x.split())) |
| 211 | +``` |
| 212 | + |
| 213 | +Summary stats (computed from the cleaned CSV): |
| 214 | + |
| 215 | +| Metric | Value | |
| 216 | +|---|---:| |
| 217 | +| Average words / record | 241.98 | |
| 218 | +| Median words / record | 170 | |
| 219 | +| Min words | 2 | |
| 220 | +| Max words | 26,381 | |
| 221 | + |
| 222 | +The notebook also sets `plt.xlim(0, 500)` to focus the histogram on the common range while acknowledging outliers. |
| 223 | + |
| 224 | +--- |
| 225 | + |
| 226 | +## 🧩 Pipeline overview (Mermaid) |
| 227 | + |
| 228 | +```mermaid |
| 229 | +flowchart TD |
| 230 | + A["Load dataset from Hugging Face\nsharjeelyunus/github-issues-dataset"] --> B["Convert train split to DataFrame"] |
| 231 | + B --> C["Define label_map\n5 targets + substrings"] |
| 232 | + C --> D["clean_multi_labels(labels)\n→ 5 binary columns"] |
| 233 | + D --> E["Concat: title, body + label columns"] |
| 234 | + E --> F["Filter: keep rows with any label == 1"] |
| 235 | + F --> G["Insert sequential id"] |
| 236 | + G --> H["Export cleaned_github_data.csv"] |
| 237 | + H --> I["EDA: label frequency, correlation, text length"] |
| 238 | +``` |
| 239 | + |
| 240 | +--- |
| 241 | + |
| 242 | +## 🧷 Notebook navigation (what to look at) |
| 243 | + |
| 244 | +- **Cell 2**: dataset load + label mapping + cleaning + export |
| 245 | +- **Cell 3**: quick preview (`final_df.head()`) |
| 246 | +- **Cell 4**: label frequency plot + correlation heatmap + text length histogram |
| 247 | +- **Cell 5**: explicit marker: **“Model Training starts from here”** |
| 248 | + |
| 249 | +--- |
| 250 | + |
| 251 | +## ⚠️ Notes / caveats (pre-training) |
| 252 | + |
| 253 | +- `from google.colab import files` is Colab-specific; it’s not required unless downloading from Colab. |
| 254 | +- The matching is substring-based; it’s intentionally permissive to reduce label noise but can over-match in rare cases (e.g., unintended substring hits). |
| 255 | + |
| 256 | +--- |
| 257 | + |
| 258 | +## ✅ Output artifact |
| 259 | + |
| 260 | +- `data/cleaned_github_data.csv` — the cleaned, filtered multi-label dataset used for training. |
0 commit comments