Skip to content

Commit 873a32f

Browse files
Merge pull request #31 from Quantum-Software-Development/FabianaCampanari-patch-1
Update README.md
2 parents 5350f98 + b9ad993 commit 873a32f

1 file changed

Lines changed: 106 additions & 22 deletions

File tree

README.md

Lines changed: 106 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -148,61 +148,97 @@ feature_groups = {
148148
}
149149
```
150150

151+
<br>
152+
151153
This makes it easier to:
154+
152155
1. Apply specific transformations to each group
153156
2. Feed organized data to LLMs
154157
3. Understand your dataset structure
155158
4. Create modular and maintainable code
156159

157-
---
158160

159-
## 🎯 Why Use This Technique?
161+
162+
<br><br>
163+
164+
165+
166+
## Why Use This Technique?
167+
168+
<br>
160169

161170
### **For Traditional ML**
162171
- 📦 **Organized Feature Engineering**: Group numerical, categorical, and text features separately
163172
- ⚛️ **Pipeline Efficiency**: Apply different transformers to different feature groups
164173
- 🧠 **Better Understanding**: Know which features belong together conceptually
165174

175+
176+
<br>
177+
178+
166179
### **For LLM Integration**
167180
- 🤖 **Semantic Context**: LLMs perform better when features are semantically grouped
168181
- 💬 **Prompt Engineering**: Create structured prompts with organized feature groups
169182
- 🔗 **Hybrid Models**: Combine tabular data with LLM embeddings effectively
170183
- 🚀 **Feature Generation**: Use LLMs to create new features from grouped columns
171184

172-
---
185+
186+
<br><br>
187+
173188

174189
## 📝 Key Concepts
175190

176191
### 1. **Pandas GroupBy**
177192
Core Python/Pandas functionality for splitting, applying, and combining data:
193+
194+
<br>
195+
196+
178197
```python
179198
df.groupby('category').agg({'value': 'mean'})
180199
```
181200

201+
<br>
202+
182203
### 2. **Dictionary Mapping**
183204
Using dictionaries to define feature relationships:
205+
206+
<br>
207+
184208
```python
185209
column_mapping = {
186210
'group_name': ['col1', 'col2', 'col3']
187211
}
188212
```
189213

214+
<br>
215+
190216
### 3. **LLM Feature Engineering**
217+
191218
Leveraging LLMs to:
219+
192220
- Generate text embeddings from grouped text columns
193221
- Create semantic features
194222
- Enrich tabular data with contextual information
195223

196-
---
197224

198-
## 📦 Installation
225+
<br><br>
226+
227+
228+
229+
## Installation
199230

200231
### Prerequisites
232+
201233
- Python 3.8+
202234
- pip or conda
203235

236+
<br>
237+
204238
### Install Dependencies
205239

240+
<br>
241+
206242
```bash
207243
# Clone the repository
208244
git clone https://github.com/Quantum-Software-Development/16-DataMining_llm-tabular-preprocessing-dict-groups.git
@@ -212,8 +248,12 @@ cd 16-DataMining_llm-tabular-preprocessing-dict-groups
212248
pip install -r requirements.txt
213249
```
214250

251+
<br>
252+
215253
### Docker Setup (Optional)
216254

255+
<br>
256+
217257
```bash
218258
# Build Docker image
219259
docker build -t dict-groups-preprocessing .
@@ -222,12 +262,16 @@ docker build -t dict-groups-preprocessing .
222262
docker run -p 8888:8888 dict-groups-preprocessing
223263
```
224264

225-
---
226265

227-
## 🚀 Quick Start
266+
<br><br>
267+
268+
269+
## Quick Start
228270

229271
### Basic Example
230272

273+
<br>
274+
231275
```python
232276
import pandas as pd
233277

@@ -254,8 +298,14 @@ for group_name, columns in feature_dict.items():
254298
print(df[columns].head())
255299
```
256300

301+
<br>
302+
303+
257304
### Output:
258-
```
305+
306+
<br>
307+
308+
```python
259309
Processing personal:
260310
name age
261311
0 Alice 25
@@ -275,12 +325,17 @@ Processing professional:
275325
2 70000 IT
276326
```
277327

278-
---
279328

280-
## 💻 Basic Examples
329+
<br><br>
330+
331+
332+
## Basic Examples
281333

282334
### Example 1: Grouping by Data Type
283335

336+
<br>
337+
338+
284339
```python
285340
import pandas as pd
286341
import numpy as np
@@ -301,12 +356,15 @@ type_groups = {
301356
}
302357
```
303358

304-
---
359+
<br><br>
360+
305361

306362
## 🤖 Advanced Usage with LLMs
307363

308364
### LLM-Based Feature Generation
309365

366+
<br>
367+
310368
```python
311369
# Example: Using grouped text features for LLM prompts
312370
text_groups = {
@@ -322,20 +380,29 @@ def create_llm_prompt(row, group_dict):
322380
return prompt
323381
```
324382

325-
---
383+
384+
br><br>
385+
326386

327387
## 🌐 Real-World Applications
328388

389+
<br>
390+
329391
1. **E-commerce**: Group product features, pricing, and reviews
330392
2. **Healthcare**: Organize patient demographics, vitals, and medical history
331393
3. **Finance**: Separate transaction data, customer info, and risk factors
332394
4. **NLP**: Combine tabular + text data for hybrid models
333395

334-
---
396+
397+
<br><br>
398+
335399

336400
## 📂 Project Structure
337401

338-
```
402+
<br>
403+
404+
405+
```bash
339406
16-DataMining_llm-tabular-preprocessing-dict-groups/
340407
341408
├── Codes/
@@ -353,39 +420,56 @@ def create_llm_prompt(row, group_dict):
353420
└── README.pt_BR.md
354421
```
355422

356-
---
357423

358-
## 📓 Notebooks
424+
br><br>
425+
426+
427+
428+
## Notebooks
429+
430+
<br>
359431

360432
### 1. `notebooks_01_basic_example.ipynb`
433+
361434
- Introduction to dictionary-based grouping
362435
- Basic Pandas operations
363436
- Simple examples with sample data
364437

438+
<br>
439+
440+
365441
### 2. `notebooks_02_llm_preprocessing.ipynb`
442+
366443
- Advanced LLM integration
367444
- Feature generation using grouped data
368445
- Real-world dataset examples
369446

370447
👉 **Open in Colab**: [Basic Example](https://colab.research.google.com) | [LLM Preprocessing](https://colab.research.google.com)
371448

372-
---
373449

374-
## 📊 Dataset Resources
450+
<br><br>
451+
452+
453+
454+
## Dataset Resources
375455

376456
The notebooks use publicly available datasets:
377457

378458
- **UCI Machine Learning Repository**: https://archive.ics.uci.edu/ml/index.php
379459
- **Kaggle Datasets**: https://www.kaggle.com/datasets
380460
- **Hugging Face Datasets**: https://huggingface.co/datasets
381461

382-
---
383462

384-
## 📚 References
385463

386-
- **Chen, X., et al.** (2024). LLM-based feature generation from text for interpretable machine learning. *arXiv preprint*. Retrieved from [arxiv.org/html/2409.07132v2](https://arxiv.org/html/2409.07132v2)
464+
<br><br>
465+
466+
467+
## References
468+
469+
470+
[1](). **Chen, X., et al.** (2024). LLM-based feature generation from text for interpretable machine learning. *arXiv preprint*. Retrieved from [arxiv.org/html/2409.07132v2](https://arxiv.org/html/2409.07132v2)
387471

388-
- **DataCamp.** (2024). Pandas GroupBy Explained: Syntax, Examples, and Tips. Retrieved from [datacamp.com/tutorial/pandas-groupby](https://www.datacamp.com/tutorial/pandas-groupby)
472+
[2](). **DataCamp.** (2024). Pandas GroupBy Explained: Syntax, Examples, and Tips. Retrieved from [datacamp.com/tutorial/pandas-groupby](https://www.datacamp.com/tutorial/pandas-groupby)
389473

390474
- **GeeksforGeeks.** (2024). Pandas dataframe.groupby() Method. Retrieved from [geeksforgeeks.org](https://www.geeksforgeeks.org/pandas-groupby/)
391475

0 commit comments

Comments
 (0)