|
4 | 4 |
|
5 | 5 | Large Language Models (LLMs) have proven highly effective in automating software engineering tasks, bridging natural language and code semantics to achieve notable results in code generation and summarization. However, their scale incurs substantial computational costs, making full fine-tuning impractical. Parameter-Efficient Fine-Tuning (PEFT) methods like QLoRA enable efficient specialization with lower resource demands. Recent studies show QLoRA-optimized Large Code Models (LCMs) perform strongly across diverse tasks, yet it remains unclear whether this effectiveness persists when a single model is QLoRA fine-tuned for multiple code-related tasks. The interaction between Multi-task fine-tuning and QLoRA optimization, and how transfer learning affects correctness and quality of generated artifacts, remains largely unexplored. We investigate Multi-task QLoRA fine-tuning across three representative tasks: code generation, translation, and summarization. We evaluate functional correctness through execution-based and similarity-based metrics, complemented by comprehensive code quality analysis--an aspect largely overlooked in prior work. Our findings show that Multi-task QLoRA effectively leverages transfer learning, achieving competitive or superior performance relative to both Single-task QLoRA and Multi-task full fine-tuning. Larger models demonstrate more consistent balance between correctness and quality, whereas smaller models preserve functionality but exhibit a higher incidence of quality-related issues. |
6 | 6 |
|
7 | | -This repository contains experiments on **code summarization**, **code generation**, **code translation**, and **multitask training** using two training strategies: |
8 | | - |
9 | | -- **FFT** (Full Fine-Tuning) |
10 | | -- **QLoRA** (Quantized Low-Rank Adaptation) |
11 | | - |
12 | | -Each task has its own folder with separate scripts for FFT and QLoRA. |
13 | | - |
14 | 7 | --- |
15 | 8 |
|
16 | 9 | ## Repository Structure |
| 10 | + |
17 | 11 | ``` |
18 | 12 | . |
19 | | -├── summarization/ # fft_train.py, qlora_train.py |
20 | | -├── generation/ # fft_train.py, qlora_train.py |
21 | | -├── translation/ # fft_train.py, qlora_train.py |
22 | | -├── multitask/ # fft_train.py, qlora_train.py |
23 | | -├── requirements.txt |
| 13 | +├── generation/ # Code generation scripts |
| 14 | +├── summarization/ # Code summarization scripts |
| 15 | +├── translation/ # Code translation scripts |
| 16 | +├── multitask/ # Multi-task training scripts |
| 17 | +├── non_functional_analysis/ # Static analysis tools (PMD, Pylint, SonarCloud, etc.) |
| 18 | +├── statistical_tests/ # Statistical analysis (Wilcoxon signed-rank tests) |
| 19 | +│ ├── R-scripts/ # R scripts and results |
| 20 | +│ └── per-instance-value/ # Per-instance metric values |
| 21 | +├── results/ # Experimental results |
| 22 | +├── dataset/ # Training datasets |
| 23 | +├── requirements.txt # Python dependencies |
24 | 24 | └── README.md |
25 | 25 | ``` |
26 | 26 |
|
27 | 27 | --- |
28 | 28 |
|
29 | 29 | ## Setup |
30 | | -Install dependencies: |
| 30 | + |
31 | 31 | ```bash |
32 | 32 | pip install -r requirements.txt |
33 | 33 | ``` |
34 | 34 |
|
| 35 | +Set environment variables: |
| 36 | + |
| 37 | +```bash |
| 38 | +export HF_TOKEN="your-huggingface-token" |
| 39 | +export OPENAI_API_KEY="your-openai-api-key" # For LLM-as-judge evaluation |
| 40 | +``` |
| 41 | + |
35 | 42 | --- |
36 | 43 |
|
37 | | -## Usage Examples |
| 44 | +## Task Overview |
| 45 | + |
| 46 | +| Task | Folder | FFT Script | QLoRA Script | |
| 47 | +|------|--------|------------|--------------| |
| 48 | +| Code Generation | `generation/` | `fft_train.py` | `qlora_train.py` | |
| 49 | +| Code Summarization | `summarization/` | `fft_train.py` | `qlora_train.py` | |
| 50 | +| Code Translation | `translation/` | `fft_train.py` | `qlora_train.py` | |
| 51 | +| Multi-Task | `multitask/` | `fft_train.py` | `qlora_train.py` | |
| 52 | + |
| 53 | +Each task folder contains a **README.md** with detailed instructions for training, inference, and evaluation. |
| 54 | + |
| 55 | +--- |
| 56 | + |
| 57 | +## Single-Task Training & Evaluation |
| 58 | + |
| 59 | +See the README in each task folder: |
| 60 | + |
| 61 | +- [generation/README.md](generation/README.md) |
| 62 | +- [summarization/README.md](summarization/README.md) |
| 63 | +- [translation/README.md](translation/README.md) |
| 64 | + |
| 65 | +--- |
| 66 | + |
| 67 | +## Multi-Task Training |
| 68 | + |
| 69 | +### FFT (Full Fine-Tuning) |
| 70 | + |
| 71 | +```bash |
| 72 | +python multitask/fft_train.py \ |
| 73 | + --base_model_name Qwen/Qwen2.5-Coder-1.5B-Instruct \ |
| 74 | + --device_batch_size 2 \ |
| 75 | + --gradient_accumulation_steps 16 \ |
| 76 | + --sample_size -1 \ |
| 77 | + --val_sample_size -1 \ |
| 78 | + --eval_samples 400 \ |
| 79 | + --num_train_epochs 5 |
| 80 | +``` |
| 81 | + |
| 82 | +### QLoRA (Quantized Low-Rank Adaptation) |
38 | 83 |
|
39 | | -### Multitask — FFT |
40 | 84 | ```bash |
41 | | -python multitask/fft_train.py --base_model_name Qwen/Qwen2.5-Coder-1.5B-Instruct --device_batch_size 2 --gradient_accumulation_steps 16 --save_processed_data True --sample_size -1 --val_sample_size -1 --eval_samples 400 --early_stopping_patience 3 --early_stopping_threshold 0.001 --num_train_epochs 10 |
| 85 | +python multitask/qlora_train.py \ |
| 86 | + --base_model_name Qwen/Qwen2.5-Coder-1.5B-Instruct \ |
| 87 | + --device_batch_size 2 \ |
| 88 | + --gradient_accumulation_steps 16 \ |
| 89 | + --sample_size -1 \ |
| 90 | + --val_sample_size -1 \ |
| 91 | + --eval_samples 400 \ |
| 92 | + --num_train_epochs 5 |
42 | 93 | ``` |
43 | 94 |
|
44 | | -For a **generic FFT run**, remove the early-stopping flags: |
45 | | -- `--early_stopping_patience` |
46 | | -- `--early_stopping_threshold` |
| 95 | +**Arguments:** |
| 96 | +- `--base_model_name`: Base model (`Qwen/Qwen2.5-Coder-0.5B-Instruct`, `Qwen/Qwen2.5-Coder-1.5B-Instruct`, `Qwen/Qwen2.5-Coder-3B-Instruct`) |
| 97 | +- `--device_batch_size`: Per-device batch size (default: 2) |
| 98 | +- `--gradient_accumulation_steps`: Gradient accumulation steps (default: 16, effective batch size = 32) |
| 99 | +- `--sample_size`: Training samples per task (-1 for full dataset) |
| 100 | +- `--val_sample_size`: Validation samples per task (-1 for full dataset) |
| 101 | +- `--eval_samples`: Samples for evaluation during training |
| 102 | +- `--num_train_epochs`: Number of training epochs |
| 103 | + |
| 104 | +### Multi-Task Evaluation |
| 105 | + |
| 106 | +After training, evaluate multi-task models using the inference scripts in each task folder: |
| 107 | +- **Code Generation**: `generation/codereval/infer_generation_*.py` |
| 108 | +- **Code Summarization**: `summarization/codereval/infer_summarization_*.py` |
| 109 | +- **Code Translation**: `translation/infer_translation_*.py` |
47 | 110 |
|
48 | 111 | --- |
49 | 112 |
|
50 | | -### Multitask — QLoRA |
51 | | -```bash |
52 | | -python multitask/qlora_train.py --base_model_name Qwen/Qwen2.5-Coder-1.5B-Instruct --device_batch_size 2 --gradient_accumulation_steps 16 --save_processed_data True --sample_size -1 --val_sample_size -1 --eval_samples 400 --num_train_epochs 10 |
| 113 | +## Non-Functional Analysis |
| 114 | + |
| 115 | +The `non_functional_analysis/` folder contains scripts for static code analysis: |
| 116 | + |
| 117 | +| Script | Purpose | |
| 118 | +|--------|---------| |
| 119 | +| `analysis_PMD_checkstyle.py` | PMD and Checkstyle analysis (Java) | |
| 120 | +| `analysis_pylint_flake8.py` | Pylint and Flake8 analysis (Python) | |
| 121 | +| `lizard_analysis.py` | Cyclomatic complexity analysis | |
| 122 | +| `RoslynAnalyzer/` | C# code analysis (translation task) | |
| 123 | + |
| 124 | +--- |
| 125 | + |
| 126 | +## Statistical Tests |
| 127 | + |
| 128 | +The `statistical_tests/` folder contains Wilcoxon signed-rank tests comparing single-task vs. multi-task performance. |
| 129 | + |
| 130 | +``` |
| 131 | +statistical_tests/ |
| 132 | +├── R-scripts/ # R scripts and results |
| 133 | +└── per-instance-value/ # Raw per-instance metric values |
| 134 | + ├── pass1/ |
| 135 | + ├── bleu_meteor_rouge_chrf_bertscore_side/summarization/ |
| 136 | + ├── codebleu/translation/ |
| 137 | + ├── llm_judge/ |
| 138 | + ├── pmd/ |
| 139 | + ├── pylint/generation/ |
| 140 | + ├── sonarcloud/ |
| 141 | + ├── lizard/ |
| 142 | + └── roslyn/java_cs/ |
53 | 143 | ``` |
54 | 144 |
|
55 | 145 | --- |
56 | 146 |
|
57 | | -## Task Overview |
| 147 | +## Models |
| 148 | + |
| 149 | +| Model | Parameters | HuggingFace | |
| 150 | +|-------|------------|-------------| |
| 151 | +| Qwen2.5-Coder-0.5B-Instruct | 0.5B | [Link](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B-Instruct) | |
| 152 | +| Qwen2.5-Coder-1.5B-Instruct | 1.5B | [Link](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct) | |
| 153 | +| Qwen2.5-Coder-3B-Instruct | 3B | [Link](https://huggingface.co/Qwen/Qwen2.5-Coder-3B-Instruct) | |
| 154 | + |
| 155 | +--- |
| 156 | + |
| 157 | +## Datasets |
58 | 158 |
|
59 | | -| Task | FFT Script | QLoRA Script | |
60 | | -|----------------|-----------------------------|--------------------------------| |
61 | | -| Summarization | `summarization/fft_train.py` | `summarization/qlora_train.py` | |
62 | | -| Generation | `generation/fft_train.py` | `generation/qlora_train.py` | |
63 | | -| Translation | `translation/fft_train.py` | `translation/qlora_train.py` | |
64 | | -| Multitask | `multitask/fft_train.py` | `multitask/qlora_train.py` | |
| 159 | +| Task | Dataset | Source | |
| 160 | +|------|---------|--------| |
| 161 | +| Code Generation | CodeXGLUE | [Link](https://github.com/microsoft/CodeXGLUE) | |
| 162 | +| Code Summarization | CodeXGLUE Code-to-Text | [Link](https://github.com/microsoft/CodeXGLUE) | |
| 163 | +| Code Translation | CodeXGLUE Java-C# | [Link](https://github.com/microsoft/CodeXGLUE) | |
| 164 | +| Evaluation | CoderEval | [Link](https://github.com/CoderEval/CoderEval) | |
65 | 165 |
|
66 | 166 | --- |
67 | 167 |
|
68 | 168 | ## Notes |
69 | | -- **Batch size** = 2, **Grad accumulation** = 16 (default across all tasks). |
70 | | -- **Validation samples**: recommendation, 5–10% of the validation set. To avoid OOM, capped to **250–400** depending on the task. |
71 | | -- **Evaluation steps** are computed as *steps per epoch*, i.e., dataset size ÷ effective batch size. |
| 169 | + |
| 170 | +- **Batch size**: 2, **Gradient accumulation**: 16 (default across all tasks, effective batch size = 32) |
| 171 | +- **Validation samples**: 5–10% of validation set recommended; capped at 250–400 to avoid OOM |
| 172 | +- **Evaluation steps**: Computed as dataset size ÷ effective batch size (steps per epoch) |
| 173 | + |
| 174 | +--- |
| 175 | + |
| 176 | +## License |
| 177 | + |
| 178 | +This project is licensed under the MIT License. |
0 commit comments