Skip to content

Commit 1975dde

Browse files
authored
Revise README for clarity and structure updates
Update README to reflect new repository structure and add details on multi-task training and evaluation.
1 parent 8c68652 commit 1975dde

1 file changed

Lines changed: 139 additions & 32 deletions

File tree

README.md

Lines changed: 139 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -4,68 +4,175 @@
44

55
Large Language Models (LLMs) have proven highly effective in automating software engineering tasks, bridging natural language and code semantics to achieve notable results in code generation and summarization. However, their scale incurs substantial computational costs, making full fine-tuning impractical. Parameter-Efficient Fine-Tuning (PEFT) methods like QLoRA enable efficient specialization with lower resource demands. Recent studies show QLoRA-optimized Large Code Models (LCMs) perform strongly across diverse tasks, yet it remains unclear whether this effectiveness persists when a single model is QLoRA fine-tuned for multiple code-related tasks. The interaction between Multi-task fine-tuning and QLoRA optimization, and how transfer learning affects correctness and quality of generated artifacts, remains largely unexplored. We investigate Multi-task QLoRA fine-tuning across three representative tasks: code generation, translation, and summarization. We evaluate functional correctness through execution-based and similarity-based metrics, complemented by comprehensive code quality analysis--an aspect largely overlooked in prior work. Our findings show that Multi-task QLoRA effectively leverages transfer learning, achieving competitive or superior performance relative to both Single-task QLoRA and Multi-task full fine-tuning. Larger models demonstrate more consistent balance between correctness and quality, whereas smaller models preserve functionality but exhibit a higher incidence of quality-related issues.
66

7-
This repository contains experiments on **code summarization**, **code generation**, **code translation**, and **multitask training** using two training strategies:
8-
9-
- **FFT** (Full Fine-Tuning)
10-
- **QLoRA** (Quantized Low-Rank Adaptation)
11-
12-
Each task has its own folder with separate scripts for FFT and QLoRA.
13-
147
---
158

169
## Repository Structure
10+
1711
```
1812
.
19-
├── summarization/ # fft_train.py, qlora_train.py
20-
├── generation/ # fft_train.py, qlora_train.py
21-
├── translation/ # fft_train.py, qlora_train.py
22-
├── multitask/ # fft_train.py, qlora_train.py
23-
├── requirements.txt
13+
├── generation/ # Code generation scripts
14+
├── summarization/ # Code summarization scripts
15+
├── translation/ # Code translation scripts
16+
├── multitask/ # Multi-task training scripts
17+
├── non_functional_analysis/ # Static analysis tools (PMD, Pylint, SonarCloud, etc.)
18+
├── statistical_tests/ # Statistical analysis (Wilcoxon signed-rank tests)
19+
│ ├── R-scripts/ # R scripts and results
20+
│ └── per-instance-value/ # Per-instance metric values
21+
├── results/ # Experimental results
22+
├── dataset/ # Training datasets
23+
├── requirements.txt # Python dependencies
2424
└── README.md
2525
```
2626

2727
---
2828

2929
## Setup
30-
Install dependencies:
30+
3131
```bash
3232
pip install -r requirements.txt
3333
```
3434

35+
Set environment variables:
36+
37+
```bash
38+
export HF_TOKEN="your-huggingface-token"
39+
export OPENAI_API_KEY="your-openai-api-key" # For LLM-as-judge evaluation
40+
```
41+
3542
---
3643

37-
## Usage Examples
44+
## Task Overview
45+
46+
| Task | Folder | FFT Script | QLoRA Script |
47+
|------|--------|------------|--------------|
48+
| Code Generation | `generation/` | `fft_train.py` | `qlora_train.py` |
49+
| Code Summarization | `summarization/` | `fft_train.py` | `qlora_train.py` |
50+
| Code Translation | `translation/` | `fft_train.py` | `qlora_train.py` |
51+
| Multi-Task | `multitask/` | `fft_train.py` | `qlora_train.py` |
52+
53+
Each task folder contains a **README.md** with detailed instructions for training, inference, and evaluation.
54+
55+
---
56+
57+
## Single-Task Training & Evaluation
58+
59+
See the README in each task folder:
60+
61+
- [generation/README.md](generation/README.md)
62+
- [summarization/README.md](summarization/README.md)
63+
- [translation/README.md](translation/README.md)
64+
65+
---
66+
67+
## Multi-Task Training
68+
69+
### FFT (Full Fine-Tuning)
70+
71+
```bash
72+
python multitask/fft_train.py \
73+
--base_model_name Qwen/Qwen2.5-Coder-1.5B-Instruct \
74+
--device_batch_size 2 \
75+
--gradient_accumulation_steps 16 \
76+
--sample_size -1 \
77+
--val_sample_size -1 \
78+
--eval_samples 400 \
79+
--num_train_epochs 5
80+
```
81+
82+
### QLoRA (Quantized Low-Rank Adaptation)
3883

39-
### Multitask — FFT
4084
```bash
41-
python multitask/fft_train.py --base_model_name Qwen/Qwen2.5-Coder-1.5B-Instruct --device_batch_size 2 --gradient_accumulation_steps 16 --save_processed_data True --sample_size -1 --val_sample_size -1 --eval_samples 400 --early_stopping_patience 3 --early_stopping_threshold 0.001 --num_train_epochs 10
85+
python multitask/qlora_train.py \
86+
--base_model_name Qwen/Qwen2.5-Coder-1.5B-Instruct \
87+
--device_batch_size 2 \
88+
--gradient_accumulation_steps 16 \
89+
--sample_size -1 \
90+
--val_sample_size -1 \
91+
--eval_samples 400 \
92+
--num_train_epochs 5
4293
```
4394

44-
For a **generic FFT run**, remove the early-stopping flags:
45-
- `--early_stopping_patience`
46-
- `--early_stopping_threshold`
95+
**Arguments:**
96+
- `--base_model_name`: Base model (`Qwen/Qwen2.5-Coder-0.5B-Instruct`, `Qwen/Qwen2.5-Coder-1.5B-Instruct`, `Qwen/Qwen2.5-Coder-3B-Instruct`)
97+
- `--device_batch_size`: Per-device batch size (default: 2)
98+
- `--gradient_accumulation_steps`: Gradient accumulation steps (default: 16, effective batch size = 32)
99+
- `--sample_size`: Training samples per task (-1 for full dataset)
100+
- `--val_sample_size`: Validation samples per task (-1 for full dataset)
101+
- `--eval_samples`: Samples for evaluation during training
102+
- `--num_train_epochs`: Number of training epochs
103+
104+
### Multi-Task Evaluation
105+
106+
After training, evaluate multi-task models using the inference scripts in each task folder:
107+
- **Code Generation**: `generation/codereval/infer_generation_*.py`
108+
- **Code Summarization**: `summarization/codereval/infer_summarization_*.py`
109+
- **Code Translation**: `translation/infer_translation_*.py`
47110

48111
---
49112

50-
### Multitask — QLoRA
51-
```bash
52-
python multitask/qlora_train.py --base_model_name Qwen/Qwen2.5-Coder-1.5B-Instruct --device_batch_size 2 --gradient_accumulation_steps 16 --save_processed_data True --sample_size -1 --val_sample_size -1 --eval_samples 400 --num_train_epochs 10
113+
## Non-Functional Analysis
114+
115+
The `non_functional_analysis/` folder contains scripts for static code analysis:
116+
117+
| Script | Purpose |
118+
|--------|---------|
119+
| `analysis_PMD_checkstyle.py` | PMD and Checkstyle analysis (Java) |
120+
| `analysis_pylint_flake8.py` | Pylint and Flake8 analysis (Python) |
121+
| `lizard_analysis.py` | Cyclomatic complexity analysis |
122+
| `RoslynAnalyzer/` | C# code analysis (translation task) |
123+
124+
---
125+
126+
## Statistical Tests
127+
128+
The `statistical_tests/` folder contains Wilcoxon signed-rank tests comparing single-task vs. multi-task performance.
129+
130+
```
131+
statistical_tests/
132+
├── R-scripts/ # R scripts and results
133+
└── per-instance-value/ # Raw per-instance metric values
134+
├── pass1/
135+
├── bleu_meteor_rouge_chrf_bertscore_side/summarization/
136+
├── codebleu/translation/
137+
├── llm_judge/
138+
├── pmd/
139+
├── pylint/generation/
140+
├── sonarcloud/
141+
├── lizard/
142+
└── roslyn/java_cs/
53143
```
54144

55145
---
56146

57-
## Task Overview
147+
## Models
148+
149+
| Model | Parameters | HuggingFace |
150+
|-------|------------|-------------|
151+
| Qwen2.5-Coder-0.5B-Instruct | 0.5B | [Link](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B-Instruct) |
152+
| Qwen2.5-Coder-1.5B-Instruct | 1.5B | [Link](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct) |
153+
| Qwen2.5-Coder-3B-Instruct | 3B | [Link](https://huggingface.co/Qwen/Qwen2.5-Coder-3B-Instruct) |
154+
155+
---
156+
157+
## Datasets
58158

59-
| Task | FFT Script | QLoRA Script |
60-
|----------------|-----------------------------|--------------------------------|
61-
| Summarization | `summarization/fft_train.py` | `summarization/qlora_train.py` |
62-
| Generation | `generation/fft_train.py` | `generation/qlora_train.py` |
63-
| Translation | `translation/fft_train.py` | `translation/qlora_train.py` |
64-
| Multitask | `multitask/fft_train.py` | `multitask/qlora_train.py` |
159+
| Task | Dataset | Source |
160+
|------|---------|--------|
161+
| Code Generation | CodeXGLUE | [Link](https://github.com/microsoft/CodeXGLUE) |
162+
| Code Summarization | CodeXGLUE Code-to-Text | [Link](https://github.com/microsoft/CodeXGLUE) |
163+
| Code Translation | CodeXGLUE Java-C# | [Link](https://github.com/microsoft/CodeXGLUE) |
164+
| Evaluation | CoderEval | [Link](https://github.com/CoderEval/CoderEval) |
65165

66166
---
67167

68168
## Notes
69-
- **Batch size** = 2, **Grad accumulation** = 16 (default across all tasks).
70-
- **Validation samples**: recommendation, 5–10% of the validation set. To avoid OOM, capped to **250–400** depending on the task.
71-
- **Evaluation steps** are computed as *steps per epoch*, i.e., dataset size ÷ effective batch size.
169+
170+
- **Batch size**: 2, **Gradient accumulation**: 16 (default across all tasks, effective batch size = 32)
171+
- **Validation samples**: 5–10% of validation set recommended; capped at 250–400 to avoid OOM
172+
- **Evaluation steps**: Computed as dataset size ÷ effective batch size (steps per epoch)
173+
174+
---
175+
176+
## License
177+
178+
This project is licensed under the MIT License.

0 commit comments

Comments
 (0)