|
1 | 1 | # Source Code Summarization |
2 | 2 |
|
3 | | -Currently observed approaches: |
| 3 | +LLM-based Python code summarization with AST-aware evaluation. |
4 | 4 |
|
5 | | -| Method | Source | Paper | |
6 | | -| :---: | :---: | :---: | |
7 | | -| Neural Code Sum | [repo](https://github.com/wasiahmad/NeuralCodeSum) | [arxiv](https://arxiv.org/abs/2005.00653) | |
8 | | -| Tree Transformer | [repo](https://github.com/nxphi47/tree_transformer) | [openreview](https://openreview.net/forum?id=HJxK5pEYvr) | |
9 | | -| TransCoder | [repo](https://github.com/facebookresearch/TransCoder) | [arxiv](https://arxiv.org/abs/2006.03511) | |
| 5 | +## Overview |
| 6 | + |
| 7 | +This project fine-tunes small code LLMs (1-3B parameters) via LoRA to generate |
| 8 | +docstrings for Python functions, and evaluates them using an AST-aware benchmark |
| 9 | +that tests structural understanding beyond surface-level text metrics. |
| 10 | + |
| 11 | +### Architecture |
10 | 12 |
|
11 | | -Environment setup: |
12 | | -``` |
13 | | -conda create -n scs python=3.7 |
14 | | -conda activate scs |
15 | | -pip install -r requirements.txt |
16 | 13 | ``` |
17 | | -Install linter with: |
| 14 | +Seed Dataset (C2NL, 92k examples) |
| 15 | + | |
| 16 | + v |
| 17 | +[convert_seed.py] --> HuggingFace Dataset |
| 18 | + | |
| 19 | + v |
| 20 | +[expand_with_distilabel.py] --> Expanded Dataset (teacher LLM generates more examples) |
| 21 | + | |
| 22 | + v |
| 23 | +[train_lora.py] --> LoRA-adapted Code LLM |
| 24 | + | |
| 25 | + v |
| 26 | +[serve.py] --> FastAPI Inference Server (localhost:8000) |
| 27 | + | |
| 28 | + v |
| 29 | + VS Code Extension (calls /generate endpoint) |
18 | 30 | ``` |
19 | | -pip install flake8 |
| 31 | + |
| 32 | +Evaluation runs independently via the AST-aware benchmark: |
| 33 | + |
20 | 34 | ``` |
21 | | -To run formatter execute from the source folder: |
| 35 | +Test Dataset + Model Predictions --> [benchmark.py] --> Metrics Report |
| 36 | + | |
| 37 | + Standard (BLEU, ROUGE) + AST-aware metrics |
22 | 38 | ``` |
23 | | -bash scripts/yapf.sh |
| 39 | + |
| 40 | +## Components |
| 41 | + |
| 42 | +### Data Preparation (`src/data/`) |
| 43 | + |
| 44 | +- **`convert_seed.py`** - Converts the C2NL parallel-file dataset (code.original + |
| 45 | + javadoc.original) into HuggingFace instruction-tuning format. Applies heuristic |
| 46 | + detokenization to make code readable for LLMs. |
| 47 | + |
| 48 | +- **`expand_with_distilabel.py`** - Uses distilabel to expand the seed dataset by |
| 49 | + sending code to a teacher LLM for higher-quality docstring generation. |
| 50 | + |
| 51 | +### Training (`src/training/`) |
| 52 | + |
| 53 | +- **`train_lora.py`** - LoRA fine-tuning using HuggingFace Trainer + PEFT. Supports |
| 54 | + QLoRA (4-bit quantization) for training on 1-2 A100 GPUs. |
| 55 | + |
| 56 | +- **`serve.py`** - FastAPI inference server that loads the fine-tuned model and |
| 57 | + serves docstring generation via HTTP. |
| 58 | + |
| 59 | +### Evaluation (`src/evaluation/`) |
| 60 | + |
| 61 | +- **`benchmark.py`** - Benchmark runner that evaluates docstring quality using both |
| 62 | + standard and AST-aware metrics. |
| 63 | + |
| 64 | +- **`metrics/standard.py`** - BLEU and ROUGE-L wrappers via HuggingFace evaluate. |
| 65 | + |
| 66 | +- **`metrics/ast_aware.py`** - Novel metrics that parse the source code's AST and |
| 67 | + check whether generated docstrings correctly reference identifiers, control-flow |
| 68 | + patterns, and function parameters. |
| 69 | + |
| 70 | +### AST Utilities (`src/ast_utils/`) |
| 71 | + |
| 72 | +Migrated from the original Python150k preprocessing pipeline: |
| 73 | + |
| 74 | +- **`parse_python3.py`** - Converts Python source code to a JSON AST representation. |
| 75 | +- **`ast_conversion.py`** - Transforms AST with value-node splitting and DFS traversal. |
| 76 | +- **`processor_ast.py`** - Text preprocessing for code, comments, and docstrings. |
| 77 | + |
| 78 | +## Quick Start |
| 79 | + |
| 80 | +```bash |
| 81 | +# Install dependencies |
| 82 | +pip install -e ".[dev]" |
| 83 | + |
| 84 | +# Convert to HuggingFace format (requires dataset access, see below) |
| 85 | +python -m src.data.convert_seed \ |
| 86 | + --input-dir data/raw/python-method \ |
| 87 | + --output-dir data/processed/python-method |
24 | 88 | ``` |
| 89 | + |
| 90 | +## Dataset |
| 91 | + |
| 92 | +The seed dataset comes from the [NeuralCodeSum](https://github.com/wasiahmad/NeuralCodeSum) |
| 93 | +project (ACL 2020): 92,545 Python function-docstring pairs split into train/dev/test. |
| 94 | + |
| 95 | +### Dataset Access |
| 96 | + |
| 97 | +The python-method dataset was previously available via a Google Drive download script |
| 98 | +(`data/raw/python-method/get_data.sh`). This script has been removed as the Google Drive |
| 99 | +link (file ID: `1XPE1txk9VI0aOT_TdqbAeI58Q8puKVl2`) is no longer accessible. |
| 100 | + |
| 101 | +To obtain the dataset, you can: |
| 102 | +1. Contact the [NeuralCodeSum](https://github.com/wasiahmad/NeuralCodeSum) authors |
| 103 | +2. Download from the original source if available at the project repository |
| 104 | +3. Use the alternative python150k dataset from [ETH Zurich SRI Lab](https://www.sri.inf.ethz.ch/py150) |
| 105 | + |
| 106 | +## Acknowledgments |
| 107 | + |
| 108 | +- Original C2NL dataset: [A Transformer-based Approach for Source Code Summarization](https://arxiv.org/abs/2005.00653) |
| 109 | +- Python150k dataset: [ETH Zurich SRI Lab](https://www.sri.inf.ethz.ch/py150) |
| 110 | +- Tree Transformer: [nxphi47/tree_transformer](https://github.com/nxphi47/tree_transformer) |
0 commit comments