martysai
diff --git a/‎.gitignore‎
Lines changed: 29 additions & 23 deletions b/‎.gitignore‎
Lines changed: 29 additions & 23 deletions
diff --git a/‎README.md‎
Lines changed: 101 additions & 15 deletions b/‎README.md‎
Lines changed: 101 additions & 15 deletions
diff --git a/‎data/raw/python-method/README.md‎
Lines changed: 25 additions & 0 deletions b/‎data/raw/python-method/README.md‎
Lines changed: 25 additions & 0 deletions
diff --git a/‎neural_code_sum/LICENSE‎
Lines changed: 0 additions & 21 deletions b/‎neural_code_sum/LICENSE‎
Lines changed: 0 additions & 21 deletions
diff --git a/‎neural_code_sum/README.md‎
Lines changed: 0 additions & 109 deletions b/‎neural_code_sum/README.md‎
Lines changed: 0 additions & 109 deletions
@@ -1,25 +1,31 @@
-python150k/*tar*
-python150k/*txt
-python150k/data/
-python150k/data/*
-.vscode
-.vscode/*
-transcoder/model_cpp_python.pth
-python150k/sc*
-tree_transformer_info/*
-raw_code_data/*
-code_data_fairseq/*
-tree_transformer/__pycache__
-train_tree_transformer/*
-notebooks/.ipynb_checkpoints
-notebooks/__pycache__
-notebooks/new_files
-python150k/process_data_nadya.py
+# Data artifacts
+data/processed/
+data/expanded/
+data/raw/python-method/train/
+data/raw/python-method/dev/
+data/raw/python-method/test/
+data/raw/python-method/*.zip
 
+# Training outputs
+outputs/
+wandb/
+*.pt
+*.bin
+*.safetensors
 
-# Test data
-python150k/examples/
-python150k/examples/*
-notebooks/0xcite/fingerping/
-notebooks/0xcite/fingerping/*
-notebooks/varmisuse_example.ipynb
+# Python
+__pycache__/
+*.egg-info/
+dist/
+build/
+.ruff_cache/
+.venv/
+
+# IDE
+.vscode/
+
+# Jupyter
+notebooks/.ipynb_checkpoints/
+
+# Environment
+.env
@@ -1,24 +1,110 @@
 # Source Code Summarization
 
-Currently observed approaches:
+LLM-based Python code summarization with AST-aware evaluation.
 
-| Method  | Source | Paper |
-| :---: | :---: | :---: |
-| Neural Code Sum  | [repo](https://github.com/wasiahmad/NeuralCodeSum)  | [arxiv](https://arxiv.org/abs/2005.00653) |
-| Tree Transformer  | [repo](https://github.com/nxphi47/tree_transformer) | [openreview](https://openreview.net/forum?id=HJxK5pEYvr) |
-| TransCoder  | [repo](https://github.com/facebookresearch/TransCoder) | [arxiv](https://arxiv.org/abs/2006.03511) |
+## Overview
+
+This project fine-tunes small code LLMs (1-3B parameters) via LoRA to generate
+docstrings for Python functions, and evaluates them using an AST-aware benchmark
+that tests structural understanding beyond surface-level text metrics.
+
+### Architecture
 
-Environment setup:
-```
-conda create -n scs python=3.7
-conda activate scs
-pip install -r requirements.txt
 ```
-Install linter with:
+Seed Dataset (C2NL, 92k examples)
+        |
+        v
+[convert_seed.py] --> HuggingFace Dataset
+        |
+        v
+[expand_with_distilabel.py] --> Expanded Dataset (teacher LLM generates more examples)
+        |
+        v
+[train_lora.py] --> LoRA-adapted Code LLM
+        |
+        v
+[serve.py] --> FastAPI Inference Server (localhost:8000)
+        |
+        v
+    VS Code Extension (calls /generate endpoint)
 ```
-pip install flake8
+
+Evaluation runs independently via the AST-aware benchmark:
+
 ```
-To run formatter execute from the source folder:
+Test Dataset + Model Predictions --> [benchmark.py] --> Metrics Report
+                                        |
+                          Standard (BLEU, ROUGE) + AST-aware metrics
 ```
-bash scripts/yapf.sh
+
+## Components
+
+### Data Preparation (`src/data/`)
+
+- **`convert_seed.py`** - Converts the C2NL parallel-file dataset (code.original +
+  javadoc.original) into HuggingFace instruction-tuning format. Applies heuristic
+  detokenization to make code readable for LLMs.
+
+- **`expand_with_distilabel.py`** - Uses distilabel to expand the seed dataset by
+  sending code to a teacher LLM for higher-quality docstring generation.
+
+### Training (`src/training/`)
+
+- **`train_lora.py`** - LoRA fine-tuning using HuggingFace Trainer + PEFT. Supports
+  QLoRA (4-bit quantization) for training on 1-2 A100 GPUs.
+
+- **`serve.py`** - FastAPI inference server that loads the fine-tuned model and
+  serves docstring generation via HTTP.
+
+### Evaluation (`src/evaluation/`)
+
+- **`benchmark.py`** - Benchmark runner that evaluates docstring quality using both
+  standard and AST-aware metrics.
+
+- **`metrics/standard.py`** - BLEU and ROUGE-L wrappers via HuggingFace evaluate.
+
+- **`metrics/ast_aware.py`** - Novel metrics that parse the source code's AST and
+  check whether generated docstrings correctly reference identifiers, control-flow
+  patterns, and function parameters.
+
+### AST Utilities (`src/ast_utils/`)
+
+Migrated from the original Python150k preprocessing pipeline:
+
+- **`parse_python3.py`** - Converts Python source code to a JSON AST representation.
+- **`ast_conversion.py`** - Transforms AST with value-node splitting and DFS traversal.
+- **`processor_ast.py`** - Text preprocessing for code, comments, and docstrings.
+
+## Quick Start
+
+```bash
+# Install dependencies
+pip install -e ".[dev]"
+
+# Convert to HuggingFace format (requires dataset access, see below)
+python -m src.data.convert_seed \
+    --input-dir data/raw/python-method \
+    --output-dir data/processed/python-method
 ```
+
+## Dataset
+
+The seed dataset comes from the [NeuralCodeSum](https://github.com/wasiahmad/NeuralCodeSum)
+project (ACL 2020): 92,545 Python function-docstring pairs split into train/dev/test.
+
+### Dataset Access
+
+The python-method dataset was previously available via a Google Drive download script
+(`data/raw/python-method/get_data.sh`). This script has been removed as the Google Drive
+link (file ID: `1XPE1txk9VI0aOT_TdqbAeI58Q8puKVl2`) is no longer accessible.
+
+To obtain the dataset, you can:
+1. Contact the [NeuralCodeSum](https://github.com/wasiahmad/NeuralCodeSum) authors
+2. Download from the original source if available at the project repository
+3. Use the alternative python150k dataset from [ETH Zurich SRI Lab](https://www.sri.inf.ethz.ch/py150)
+
+## Acknowledgments
+
+- Original C2NL dataset: [A Transformer-based Approach for Source Code Summarization](https://arxiv.org/abs/2005.00653)
+- Python150k dataset: [ETH Zurich SRI Lab](https://www.sri.inf.ethz.ch/py150)
+- Tree Transformer: [nxphi47/tree_transformer](https://github.com/nxphi47/tree_transformer)
@@ -0,0 +1,25 @@
+# Python Method Dataset
+
+Source: [A Transformer-based Approach for Source Code Summarization](https://arxiv.org/abs/2005.00653) (ACL 2020)
+
+Original repository: [wasiahmad/NeuralCodeSum](https://github.com/wasiahmad/NeuralCodeSum)
+
+## Dataset Statistics
+
+| Split | Examples |
+|-------|----------|
+| Train | 55,538 |
+| Dev | 18,505 |
+| Test | 18,502 |
+| Total | 92,545 |
+
+## Format
+
+Each split contains parallel files:
+- `code.original` - Space-separated code tokens (one function per line)
+- `code.original_subtoken` - Subtoken-split version (camelCase aware)
+- `javadoc.original` - Space-separated summary tokens (one docstring per line)
+
+## Download
+
+Run `get_data.sh` to download and extract the dataset from Google Drive.