Skip to content

Commit 9302246

Browse files
authored
Merge pull request #7 from martysai/pedantic-bhaskara
Pivot to LLM-based code summarization with AST-aware evaluation
2 parents bea1037 + cf29f1b commit 9302246

File tree

2,393 files changed

+642
-285434
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

2,393 files changed

+642
-285434
lines changed

.gitignore

Lines changed: 29 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,31 @@
1-
python150k/*tar*
2-
python150k/*txt
3-
python150k/data/
4-
python150k/data/*
5-
.vscode
6-
.vscode/*
7-
transcoder/model_cpp_python.pth
8-
python150k/sc*
9-
tree_transformer_info/*
10-
raw_code_data/*
11-
code_data_fairseq/*
12-
tree_transformer/__pycache__
13-
train_tree_transformer/*
14-
notebooks/.ipynb_checkpoints
15-
notebooks/__pycache__
16-
notebooks/new_files
17-
python150k/process_data_nadya.py
1+
# Data artifacts
2+
data/processed/
3+
data/expanded/
4+
data/raw/python-method/train/
5+
data/raw/python-method/dev/
6+
data/raw/python-method/test/
7+
data/raw/python-method/*.zip
188

9+
# Training outputs
10+
outputs/
11+
wandb/
12+
*.pt
13+
*.bin
14+
*.safetensors
1915

20-
# Test data
21-
python150k/examples/
22-
python150k/examples/*
23-
notebooks/0xcite/fingerping/
24-
notebooks/0xcite/fingerping/*
25-
notebooks/varmisuse_example.ipynb
16+
# Python
17+
__pycache__/
18+
*.egg-info/
19+
dist/
20+
build/
21+
.ruff_cache/
22+
.venv/
23+
24+
# IDE
25+
.vscode/
26+
27+
# Jupyter
28+
notebooks/.ipynb_checkpoints/
29+
30+
# Environment
31+
.env

README.md

Lines changed: 101 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,110 @@
11
# Source Code Summarization
22

3-
Currently observed approaches:
3+
LLM-based Python code summarization with AST-aware evaluation.
44

5-
| Method | Source | Paper |
6-
| :---: | :---: | :---: |
7-
| Neural Code Sum | [repo](https://github.com/wasiahmad/NeuralCodeSum) | [arxiv](https://arxiv.org/abs/2005.00653) |
8-
| Tree Transformer | [repo](https://github.com/nxphi47/tree_transformer) | [openreview](https://openreview.net/forum?id=HJxK5pEYvr) |
9-
| TransCoder | [repo](https://github.com/facebookresearch/TransCoder) | [arxiv](https://arxiv.org/abs/2006.03511) |
5+
## Overview
6+
7+
This project fine-tunes small code LLMs (1-3B parameters) via LoRA to generate
8+
docstrings for Python functions, and evaluates them using an AST-aware benchmark
9+
that tests structural understanding beyond surface-level text metrics.
10+
11+
### Architecture
1012

11-
Environment setup:
12-
```
13-
conda create -n scs python=3.7
14-
conda activate scs
15-
pip install -r requirements.txt
1613
```
17-
Install linter with:
14+
Seed Dataset (C2NL, 92k examples)
15+
|
16+
v
17+
[convert_seed.py] --> HuggingFace Dataset
18+
|
19+
v
20+
[expand_with_distilabel.py] --> Expanded Dataset (teacher LLM generates more examples)
21+
|
22+
v
23+
[train_lora.py] --> LoRA-adapted Code LLM
24+
|
25+
v
26+
[serve.py] --> FastAPI Inference Server (localhost:8000)
27+
|
28+
v
29+
VS Code Extension (calls /generate endpoint)
1830
```
19-
pip install flake8
31+
32+
Evaluation runs independently via the AST-aware benchmark:
33+
2034
```
21-
To run formatter execute from the source folder:
35+
Test Dataset + Model Predictions --> [benchmark.py] --> Metrics Report
36+
|
37+
Standard (BLEU, ROUGE) + AST-aware metrics
2238
```
23-
bash scripts/yapf.sh
39+
40+
## Components
41+
42+
### Data Preparation (`src/data/`)
43+
44+
- **`convert_seed.py`** - Converts the C2NL parallel-file dataset (code.original +
45+
javadoc.original) into HuggingFace instruction-tuning format. Applies heuristic
46+
detokenization to make code readable for LLMs.
47+
48+
- **`expand_with_distilabel.py`** - Uses distilabel to expand the seed dataset by
49+
sending code to a teacher LLM for higher-quality docstring generation.
50+
51+
### Training (`src/training/`)
52+
53+
- **`train_lora.py`** - LoRA fine-tuning using HuggingFace Trainer + PEFT. Supports
54+
QLoRA (4-bit quantization) for training on 1-2 A100 GPUs.
55+
56+
- **`serve.py`** - FastAPI inference server that loads the fine-tuned model and
57+
serves docstring generation via HTTP.
58+
59+
### Evaluation (`src/evaluation/`)
60+
61+
- **`benchmark.py`** - Benchmark runner that evaluates docstring quality using both
62+
standard and AST-aware metrics.
63+
64+
- **`metrics/standard.py`** - BLEU and ROUGE-L wrappers via HuggingFace evaluate.
65+
66+
- **`metrics/ast_aware.py`** - Novel metrics that parse the source code's AST and
67+
check whether generated docstrings correctly reference identifiers, control-flow
68+
patterns, and function parameters.
69+
70+
### AST Utilities (`src/ast_utils/`)
71+
72+
Migrated from the original Python150k preprocessing pipeline:
73+
74+
- **`parse_python3.py`** - Converts Python source code to a JSON AST representation.
75+
- **`ast_conversion.py`** - Transforms AST with value-node splitting and DFS traversal.
76+
- **`processor_ast.py`** - Text preprocessing for code, comments, and docstrings.
77+
78+
## Quick Start
79+
80+
```bash
81+
# Install dependencies
82+
pip install -e ".[dev]"
83+
84+
# Convert to HuggingFace format (requires dataset access, see below)
85+
python -m src.data.convert_seed \
86+
--input-dir data/raw/python-method \
87+
--output-dir data/processed/python-method
2488
```
89+
90+
## Dataset
91+
92+
The seed dataset comes from the [NeuralCodeSum](https://github.com/wasiahmad/NeuralCodeSum)
93+
project (ACL 2020): 92,545 Python function-docstring pairs split into train/dev/test.
94+
95+
### Dataset Access
96+
97+
The python-method dataset was previously available via a Google Drive download script
98+
(`data/raw/python-method/get_data.sh`). This script has been removed as the Google Drive
99+
link (file ID: `1XPE1txk9VI0aOT_TdqbAeI58Q8puKVl2`) is no longer accessible.
100+
101+
To obtain the dataset, you can:
102+
1. Contact the [NeuralCodeSum](https://github.com/wasiahmad/NeuralCodeSum) authors
103+
2. Download from the original source if available at the project repository
104+
3. Use the alternative python150k dataset from [ETH Zurich SRI Lab](https://www.sri.inf.ethz.ch/py150)
105+
106+
## Acknowledgments
107+
108+
- Original C2NL dataset: [A Transformer-based Approach for Source Code Summarization](https://arxiv.org/abs/2005.00653)
109+
- Python150k dataset: [ETH Zurich SRI Lab](https://www.sri.inf.ethz.ch/py150)
110+
- Tree Transformer: [nxphi47/tree_transformer](https://github.com/nxphi47/tree_transformer)

data/raw/python-method/README.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# Python Method Dataset
2+
3+
Source: [A Transformer-based Approach for Source Code Summarization](https://arxiv.org/abs/2005.00653) (ACL 2020)
4+
5+
Original repository: [wasiahmad/NeuralCodeSum](https://github.com/wasiahmad/NeuralCodeSum)
6+
7+
## Dataset Statistics
8+
9+
| Split | Examples |
10+
|-------|----------|
11+
| Train | 55,538 |
12+
| Dev | 18,505 |
13+
| Test | 18,502 |
14+
| Total | 92,545 |
15+
16+
## Format
17+
18+
Each split contains parallel files:
19+
- `code.original` - Space-separated code tokens (one function per line)
20+
- `code.original_subtoken` - Subtoken-split version (camelCase aware)
21+
- `javadoc.original` - Space-separated summary tokens (one docstring per line)
22+
23+
## Download
24+
25+
Run `get_data.sh` to download and extract the dataset from Google Drive.

neural_code_sum/LICENSE

Lines changed: 0 additions & 21 deletions
This file was deleted.

neural_code_sum/README.md

Lines changed: 0 additions & 109 deletions
This file was deleted.

0 commit comments

Comments
 (0)