@@ -148,61 +148,97 @@ feature_groups = {
148148}
149149```
150150
151+ <br >
152+
151153This makes it easier to:
154+
1521551 . Apply specific transformations to each group
1531562 . Feed organized data to LLMs
1541573 . Understand your dataset structure
1551584 . Create modular and maintainable code
156159
157- ---
158160
159- ## 🎯 Why Use This Technique?
161+
162+ <br ><br >
163+
164+
165+
166+ ## Why Use This Technique?
167+
168+ <br >
160169
161170### ** For Traditional ML**
162171- 📦 ** Organized Feature Engineering** : Group numerical, categorical, and text features separately
163172- ⚛️ ** Pipeline Efficiency** : Apply different transformers to different feature groups
164173- 🧠 ** Better Understanding** : Know which features belong together conceptually
165174
175+
176+ <br >
177+
178+
166179### ** For LLM Integration**
167180- 🤖 ** Semantic Context** : LLMs perform better when features are semantically grouped
168181- 💬 ** Prompt Engineering** : Create structured prompts with organized feature groups
169182- 🔗 ** Hybrid Models** : Combine tabular data with LLM embeddings effectively
170183- 🚀 ** Feature Generation** : Use LLMs to create new features from grouped columns
171184
172- ---
185+
186+ <br ><br >
187+
173188
174189## 📝 Key Concepts
175190
176191### 1. ** Pandas GroupBy**
177192Core Python/Pandas functionality for splitting, applying, and combining data:
193+
194+ <br >
195+
196+
178197``` python
179198df.groupby(' category' ).agg({' value' : ' mean' })
180199```
181200
201+ <br >
202+
182203### 2. ** Dictionary Mapping**
183204Using dictionaries to define feature relationships:
205+
206+ <br >
207+
184208``` python
185209column_mapping = {
186210 ' group_name' : [' col1' , ' col2' , ' col3' ]
187211}
188212```
189213
214+ <br >
215+
190216### 3. ** LLM Feature Engineering**
217+
191218Leveraging LLMs to:
219+
192220- Generate text embeddings from grouped text columns
193221- Create semantic features
194222- Enrich tabular data with contextual information
195223
196- ---
197224
198- ## 📦 Installation
225+ <br ><br >
226+
227+
228+
229+ ## Installation
199230
200231### Prerequisites
232+
201233- Python 3.8+
202234- pip or conda
203235
236+ <br >
237+
204238### Install Dependencies
205239
240+ <br >
241+
206242``` bash
207243# Clone the repository
208244git clone https://github.com/Quantum-Software-Development/16-DataMining_llm-tabular-preprocessing-dict-groups.git
@@ -212,8 +248,12 @@ cd 16-DataMining_llm-tabular-preprocessing-dict-groups
212248pip install -r requirements.txt
213249```
214250
251+ <br >
252+
215253### Docker Setup (Optional)
216254
255+ <br >
256+
217257``` bash
218258# Build Docker image
219259docker build -t dict-groups-preprocessing .
@@ -222,12 +262,16 @@ docker build -t dict-groups-preprocessing .
222262docker run -p 8888:8888 dict-groups-preprocessing
223263```
224264
225- ---
226265
227- ## 🚀 Quick Start
266+ <br ><br >
267+
268+
269+ ## Quick Start
228270
229271### Basic Example
230272
273+ <br >
274+
231275``` python
232276import pandas as pd
233277
@@ -254,8 +298,14 @@ for group_name, columns in feature_dict.items():
254298 print (df[columns].head())
255299```
256300
301+ <br >
302+
303+
257304### Output:
258- ```
305+
306+ <br >
307+
308+ ``` python
259309Processing personal:
260310 name age
2613110 Alice 25
@@ -275,12 +325,17 @@ Processing professional:
2753252 70000 IT
276326```
277327
278- ---
279328
280- ## 💻 Basic Examples
329+ <br ><br >
330+
331+
332+ ## Basic Examples
281333
282334### Example 1: Grouping by Data Type
283335
336+ <br >
337+
338+
284339``` python
285340import pandas as pd
286341import numpy as np
@@ -301,12 +356,15 @@ type_groups = {
301356}
302357```
303358
304- ---
359+ <br ><br >
360+
305361
306362## 🤖 Advanced Usage with LLMs
307363
308364### LLM-Based Feature Generation
309365
366+ <br >
367+
310368``` python
311369# Example: Using grouped text features for LLM prompts
312370text_groups = {
@@ -322,20 +380,29 @@ def create_llm_prompt(row, group_dict):
322380 return prompt
323381```
324382
325- ---
383+
384+ br><br >
385+
326386
327387## 🌐 Real-World Applications
328388
389+ <br >
390+
3293911 . ** E-commerce** : Group product features, pricing, and reviews
3303922 . ** Healthcare** : Organize patient demographics, vitals, and medical history
3313933 . ** Finance** : Separate transaction data, customer info, and risk factors
3323944 . ** NLP** : Combine tabular + text data for hybrid models
333395
334- ---
396+
397+ <br ><br >
398+
335399
336400## 📂 Project Structure
337401
338- ```
402+ <br >
403+
404+
405+ ``` bash
33940616-DataMining_llm-tabular-preprocessing-dict-groups/
340407│
341408├── Codes/
@@ -353,39 +420,56 @@ def create_llm_prompt(row, group_dict):
353420└── README.pt_BR.md
354421```
355422
356- ---
357423
358- ## 📓 Notebooks
424+ br><br >
425+
426+
427+
428+ ## Notebooks
429+
430+ <br >
359431
360432### 1. ` notebooks_01_basic_example.ipynb `
433+
361434- Introduction to dictionary-based grouping
362435- Basic Pandas operations
363436- Simple examples with sample data
364437
438+ <br >
439+
440+
365441### 2. ` notebooks_02_llm_preprocessing.ipynb `
442+
366443- Advanced LLM integration
367444- Feature generation using grouped data
368445- Real-world dataset examples
369446
370447👉 ** Open in Colab** : [ Basic Example] ( https://colab.research.google.com ) | [ LLM Preprocessing] ( https://colab.research.google.com )
371448
372- ---
373449
374- ## 📊 Dataset Resources
450+ <br ><br >
451+
452+
453+
454+ ## Dataset Resources
375455
376456The notebooks use publicly available datasets:
377457
378458- ** UCI Machine Learning Repository** : https://archive.ics.uci.edu/ml/index.php
379459- ** Kaggle Datasets** : https://www.kaggle.com/datasets
380460- ** Hugging Face Datasets** : https://huggingface.co/datasets
381461
382- ---
383462
384- ## 📚 References
385463
386- - ** Chen, X., et al.** (2024). LLM-based feature generation from text for interpretable machine learning. * arXiv preprint* . Retrieved from [ arxiv.org/html/2409.07132v2] ( https://arxiv.org/html/2409.07132v2 )
464+ <br ><br >
465+
466+
467+ ## References
468+
469+
470+ [ 1] ( ) . ** Chen, X., et al.** (2024). LLM-based feature generation from text for interpretable machine learning. * arXiv preprint* . Retrieved from [ arxiv.org/html/2409.07132v2] ( https://arxiv.org/html/2409.07132v2 )
387471
388- - ** DataCamp.** (2024). Pandas GroupBy Explained: Syntax, Examples, and Tips. Retrieved from [ datacamp.com/tutorial/pandas-groupby] ( https://www.datacamp.com/tutorial/pandas-groupby )
472+ [ 2 ] ( ) . ** DataCamp.** (2024). Pandas GroupBy Explained: Syntax, Examples, and Tips. Retrieved from [ datacamp.com/tutorial/pandas-groupby] ( https://www.datacamp.com/tutorial/pandas-groupby )
389473
390474- ** GeeksforGeeks.** (2024). Pandas dataframe.groupby() Method. Retrieved from [ geeksforgeeks.org] ( https://www.geeksforgeeks.org/pandas-groupby/ )
391475
0 commit comments