Skip to content

Commit 3ea1825

Browse files
Add documentation for data preparation and multi-label classification process
1 parent 1baa63d commit 3ea1825

1 file changed

Lines changed: 260 additions & 0 deletions

File tree

Lines changed: 260 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,260 @@
1+
# 🧼📦 Data Preparation (Pre-Training) — GitHub Issue/PR Multi‑Label Dataset
2+
3+
> Scope: this document covers **everything done before model training** in the notebook `PR_CLassifier.ipynb` (Cells 1–5).
4+
> It includes dataset loading, label normalization, multi-label extraction, filtering, export, and quick sanity-check EDA.
5+
6+
---
7+
8+
## 🎯 Goal
9+
10+
Turn the Hugging Face dataset **`sharjeelyunus/github-issues-dataset`** into a clean, model-ready table for **multi-label classification** using only:
11+
12+
- Text features: `title`, `body`
13+
- Target labels (5): `bug`, `enhancement`, `documentation`, `test`, `request`
14+
15+
Each target label is represented as a binary column (`0/1`). Rows that do not match any of the 5 targets are dropped.
16+
17+
---
18+
19+
## 🧠 Why multi-label?
20+
21+
GitHub issues/PRs commonly carry **multiple labels** (e.g., an issue can be both a `bug` and include `test` changes). Therefore the dataset is prepared in a **multi-label** format:
22+
23+
- One example can map to many labels
24+
- Targets are stored as 5 binary columns
25+
26+
---
27+
28+
## 🗂️ What comes from the raw dataset?
29+
30+
The dataset is loaded via `datasets.load_dataset(...)` and converted into a Pandas DataFrame.
31+
32+
Key raw fields used:
33+
34+
| Field | Used for | Notes |
35+
|---|---|---|
36+
| `title` | input text | may be empty/NaN |
37+
| `body` | input text | may be empty/NaN |
38+
| `labels` | supervision source | noisy, multi-valued, mixed formatting |
39+
40+
---
41+
42+
## 🧾 Label taxonomy (fixed 5 labels)
43+
44+
The notebook compresses many possible raw label strings into **5 normalized targets** using substring matching.
45+
46+
### ✅ Mapping table
47+
48+
| Target column | Substrings searched in raw `labels` |
49+
|---|---|
50+
| `bug` | `bug`, `fix` |
51+
| `enhancement` | `enhancement`, `feature` |
52+
| `documentation` | `doc` |
53+
| `test` | `test` |
54+
| `request` | `request` |
55+
56+
This mapping is implemented exactly as:
57+
58+
```python
59+
label_map = {
60+
'bug': ['bug', 'fix'],
61+
'enhancement': ['enhancement', 'feature'],
62+
'documentation': ['doc'],
63+
'test': ['test'],
64+
'request': ['request']
65+
}
66+
```
67+
68+
---
69+
70+
## 🧼 Multi-label cleaning strategy
71+
72+
### The core idea
73+
74+
For each row:
75+
76+
1. Start with all targets set to `0`
77+
2. If `labels` is missing/empty → return all zeros
78+
3. Convert `labels` to lowercase text
79+
4. For each target, build a regex that matches **any** of its substrings
80+
5. If a match is found → set that target column to `1`
81+
82+
### 🔑 Key implementation (exact notebook code)
83+
84+
```python
85+
def clean_multi_labels(row_labels):
86+
extracted = {key: 0 for key in label_map.keys()}
87+
88+
if pd.isna(row_labels) or row_labels == "":
89+
return pd.Series(extracted)
90+
91+
# Convert label list/string to lower case for matching
92+
label_text = str(row_labels).lower()
93+
94+
for target_column, substrings in label_map.items():
95+
pattern = r'(' + '|'.join(map(re.escape, substrings)) + r')'
96+
if re.search(pattern, label_text):
97+
extracted[target_column] = 1
98+
99+
return pd.Series(extracted)
100+
```
101+
102+
### ✅ Why substring + regex?
103+
104+
The raw `labels` field can contain multiple comma-separated values and inconsistent naming. Substring matching makes the pipeline robust to variations like:
105+
106+
- `bug`, `type: bug`, `bugfix`, `fix needed`
107+
- `feature`, `enhancement`, etc.
108+
109+
---
110+
111+
## 🔗 Building the final training table
112+
113+
### 1) Apply the cleaner
114+
115+
```python
116+
new_label_cols = df['labels'].apply(clean_multi_labels)
117+
```
118+
119+
### 2) Keep only needed columns + new targets
120+
121+
```python
122+
final_df = pd.concat([df[['title', 'body']], new_label_cols], axis=1)
123+
```
124+
125+
### 3) Drop rows with no target labels
126+
127+
This ensures every example has **at least one** of the 5 labels.
128+
129+
```python
130+
target_cols = list(label_map.keys())
131+
final_df = final_df[(final_df[target_cols] != 0).any(axis=1)].reset_index(drop=True)
132+
```
133+
134+
### 4) Add a stable integer ID
135+
136+
```python
137+
final_df.insert(0, 'id', range(1, len(final_df) + 1))
138+
```
139+
140+
### 5) Export the cleaned dataset
141+
142+
```python
143+
output_file = '../data/cleaned_github_data.csv'
144+
final_df.to_csv(output_file, index=False)
145+
```
146+
147+
---
148+
149+
## ✅ Resulting dataset (post-clean)
150+
151+
From the exported CSV (`data/cleaned_github_data.csv`) the cleaned dataset shape is:
152+
153+
- Rows: **65,055**
154+
- Columns: **8**
155+
156+
### Final schema
157+
158+
| Column | Type | Description |
159+
|---|---|---|
160+
| `id` | int | sequential id starting at 1 |
161+
| `title` | text | issue/PR title |
162+
| `body` | text | issue/PR body |
163+
| `bug` | 0/1 | target label |
164+
| `enhancement` | 0/1 | target label |
165+
| `documentation` | 0/1 | target label |
166+
| `test` | 0/1 | target label |
167+
| `request` | 0/1 | target label |
168+
169+
---
170+
171+
## 📊 Quick EDA / sanity checks
172+
173+
The notebook does a small amount of EDA to verify the dataset is usable.
174+
175+
### 1) Label frequency
176+
177+
Counts of positives per label (multi-label, so totals can exceed row count):
178+
179+
| Label | Positive count |
180+
|---|---:|
181+
| `bug` | 35,684 |
182+
| `enhancement` | 21,732 |
183+
| `request` | 8,810 |
184+
| `documentation` | 5,569 |
185+
| `test` | 4,944 |
186+
187+
This matches the visual bar chart shown in the notebook.
188+
189+
### 2) Co-occurrence (correlation heatmap)
190+
191+
The notebook computes:
192+
193+
```python
194+
correlation = final_df[target_labels].corr()
195+
```
196+
197+
This is a correlation over **binary** columns (0/1), used as a fast way to spot whether labels frequently co-occur.
198+
199+
Notable pattern from the plotted heatmap:
200+
201+
- `bug` vs `enhancement` shows a strong negative correlation (≈ -0.74)
202+
- `enhancement` vs `request` shows a moderate positive correlation (≈ 0.36)
203+
204+
### 3) Text length distribution (title + body)
205+
206+
The notebook combines text and measures word counts:
207+
208+
```python
209+
final_df['combined_text'] = final_df['title'].fillna('') + " " + final_df['body'].fillna('')
210+
final_df['text_len'] = final_df['combined_text'].apply(lambda x: len(x.split()))
211+
```
212+
213+
Summary stats (computed from the cleaned CSV):
214+
215+
| Metric | Value |
216+
|---|---:|
217+
| Average words / record | 241.98 |
218+
| Median words / record | 170 |
219+
| Min words | 2 |
220+
| Max words | 26,381 |
221+
222+
The notebook also sets `plt.xlim(0, 500)` to focus the histogram on the common range while acknowledging outliers.
223+
224+
---
225+
226+
## 🧩 Pipeline overview (Mermaid)
227+
228+
```mermaid
229+
flowchart TD
230+
A["Load dataset from Hugging Face\nsharjeelyunus/github-issues-dataset"] --> B["Convert train split to DataFrame"]
231+
B --> C["Define label_map\n5 targets + substrings"]
232+
C --> D["clean_multi_labels(labels)\n→ 5 binary columns"]
233+
D --> E["Concat: title, body + label columns"]
234+
E --> F["Filter: keep rows with any label == 1"]
235+
F --> G["Insert sequential id"]
236+
G --> H["Export cleaned_github_data.csv"]
237+
H --> I["EDA: label frequency, correlation, text length"]
238+
```
239+
240+
---
241+
242+
## 🧷 Notebook navigation (what to look at)
243+
244+
- **Cell 2**: dataset load + label mapping + cleaning + export
245+
- **Cell 3**: quick preview (`final_df.head()`)
246+
- **Cell 4**: label frequency plot + correlation heatmap + text length histogram
247+
- **Cell 5**: explicit marker: **“Model Training starts from here”**
248+
249+
---
250+
251+
## ⚠️ Notes / caveats (pre-training)
252+
253+
- `from google.colab import files` is Colab-specific; it’s not required unless downloading from Colab.
254+
- The matching is substring-based; it’s intentionally permissive to reduce label noise but can over-match in rare cases (e.g., unintended substring hits).
255+
256+
---
257+
258+
## ✅ Output artifact
259+
260+
- `data/cleaned_github_data.csv` — the cleaned, filtered multi-label dataset used for training.

0 commit comments

Comments
 (0)