Code Summarization

This folder contains scripts for training, inference, and evaluation of code summarization models.

Folder Structure

code-summarization/
├── fft_train.py                 # Full fine-tuning training script
├── qlora_train.py               # QLoRA training script
├── codereval/
│   ├── infer_summarization_fft.py       # Inference for FFT models
│   ├── infer_summarization_qlora.py     # Inference for QLoRA models
│   ├── evaluate_summarization_metrics.py # Compute BLEU, ROUGE, METEOR, etc.
│   ├── evaluate_summarization_llm_judge.py  # LLM-as-judge evaluation (GPT-5)
│   ├── aggregate_llm_judge_scores.py    # Aggregate LLM judge scores using mean
│   ├── cs_codereval_eval_dataset_java_v2.jsonl  # CoderEval Java benchmark
│   └── cs_codereval_eval_dataset_py_v2.jsonl    # CoderEval Python benchmark
└── dataset/
    └── code_x_glue_ct_code_to_text/
        ├── java/
        └── python/

Step 1: Training

Full Fine-Tuning (FFT)

python fft_train.py \
    --base_model_name Qwen/Qwen2.5-Coder-0.5B-Instruct \
    --device_batch_size 2 \
    --gradient_accumulation_steps 16 \
    --sample_size -1 \
    --val_sample_size -1 \
    --eval_samples 500 \
    --num_train_epochs 5

QLoRA

python qlora_train.py \
    --base_model_name Qwen/Qwen2.5-Coder-0.5B-Instruct \
    --device_batch_size 2 \
    --gradient_accumulation_steps 16 \
    --sample_size -1 \
    --val_sample_size -1 \
    --eval_samples 500 \
    --num_train_epochs 5

Arguments:

--base_model_name: Base model from HuggingFace
--device_batch_size: Per-device batch size (default: 2)
--gradient_accumulation_steps: Gradient accumulation steps (default: 16)
--sample_size: Training samples per language (-1 for full dataset)
--val_sample_size: Validation samples per language (-1 for full dataset)
--eval_samples: Samples for evaluation during training
--num_train_epochs: Number of training epochs (default: 5)

Step 2: Inference on CoderEval

FFT Models

python codereval/infer_summarization_fft.py \
    --model_path path/to/model \
    --input_jsonl codereval/cs_codereval_eval_dataset_java_v2.jsonl \
    --output_file path/to/output.csv \
    --language java \
    --batch_size 8 \
    --max_new_tokens 128

QLoRA Models

python codereval/infer_summarization_qlora.py \
    --model_path path/to/qlora_adapter \
    --input_jsonl codereval/cs_codereval_eval_dataset_java_v2.jsonl \
    --output_file path/to/output.csv \
    --language java \
    --batch_size 8 \
    --max_new_tokens 128

Arguments:

--model_path: Path to trained model checkpoint
--input_jsonl: Path to CoderEval benchmark file
--output_file: Path to save predictions
--language: Programming language (java or python)
--batch_size: Batch size for inference
--max_new_tokens: Maximum tokens to generate

Step 3: Evaluate with Reference-Based Metrics

Compute BLEU, METEOR, ROUGE, chrF, BERTScore, and SIDE (Java only):

python codereval/evaluate_summarization_metrics.py \
    --dataset_path path/to/predictions.jsonl \
    --language java \
    --summary_field generated_summary \
    --output_file path/to/results

Arguments:

--dataset_path: Path to JSONL file with predictions
--language: Programming language (java or python)
--summary_field: Field name containing generated summaries
--output_file: Output path for results (without extension)
--side_checkpoint: Path to SIDE model checkpoint (Java only, default: path/to/SIDE/checkpoint)

Note on SIDE Score: SIDE (Semantic Identifier for Documentation Evaluation) is computed only for Java. You need to download the SIDE model checkpoint from the SIDE repository and provide the path via --side_checkpoint.

Output:

<output_file>.txt: Human-readable results
<output_file>.json: Machine-readable results

Step 4: Evaluate with LLM-as-Judge

Uses GPT-5-mini to evaluate summaries on Content Adequacy, Conciseness, and Fluency (1-5 scale).

Edit configuration variables in evaluate_summarization_llm_judge.py:

INPUT_FILE = "path/to/predictions.jsonl"
OUTPUT_FOLDER = "path/to/output_folder"
LANGUAGE = "java"  # or "python"
SUMMARY_FIELD = "generated_summary"
MODEL_NAME = "model_name"
NUM_RUNS = 5

Set OpenAI API key:

export OPENAI_API_KEY="your-api-key"

Run:

python codereval/evaluate_summarization_llm_judge.py

Output:

Individual run CSVs: <output_folder>/<name>_1.csv, <name>_2.csv, etc.
Merged results: <output_folder>/<name>_FINAL_MERGED.csv

Step 5: Aggregate LLM Judge Scores (Mean)

Recalculate final scores using mean instead of voting:

Edit configuration variables in aggregate_llm_judge_scores.py:

INPUT_FOLDER = "path/to/llm_judge_results"
OUTPUT_FILE_NAME = "FINAL_MERGED_MEAN.csv"
NUM_RUNS = 5

Run:

python codereval/aggregate_llm_judge_scores.py

Complete Pipeline Example

# 1. Train model
python fft_train.py --base_model_name Qwen/Qwen2.5-Coder-1.5B-Instruct

# 2. Run inference
python codereval/infer_summarization_fft.py \
    --model_path results/Qwen2.5-Coder-1.5B-Instruct_summarization_fft \
    --input_jsonl codereval/cs_codereval_eval_dataset_java_v2.jsonl \
    --output_file results/java_predictions.csv \
    --language java

# 3. Evaluate with reference-based metrics
python codereval/evaluate_summarization_metrics.py \
    --dataset_path results/java_predictions.jsonl \
    --language java \
    --summary_field generated_summary \
    --output_file results/java_metrics

# 4. Evaluate with LLM-as-judge (edit config in script first)
python codereval/evaluate_summarization_llm_judge.py

# 5. Aggregate LLM judge scores (edit config in script first)
python codereval/aggregate_llm_judge_scores.py

Metrics

Reference-Based Metrics

BLEU: N-gram overlap
METEOR: Semantic similarity with synonyms
ROUGE-1/2/L: Recall-oriented metrics
chrF: Character-level F-score
BERTScore: Contextual embedding similarity
SIDE: Semantic similarity for code summaries (Java only) - GitHub

LLM-as-Judge Metrics

Content Adequacy: How well the summary captures code functionality
Conciseness: Absence of unnecessary information
Fluency: Readability and clarity

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code Summarization

Folder Structure

Step 1: Training

Full Fine-Tuning (FFT)

QLoRA

Step 2: Inference on CoderEval

FFT Models

QLoRA Models

Step 3: Evaluate with Reference-Based Metrics

Step 4: Evaluate with LLM-as-Judge

Step 5: Aggregate LLM Judge Scores (Mean)

Complete Pipeline Example

Metrics

Reference-Based Metrics

LLM-as-Judge Metrics

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Code Summarization

Folder Structure

Step 1: Training

Full Fine-Tuning (FFT)

QLoRA

Step 2: Inference on CoderEval

FFT Models

QLoRA Models

Step 3: Evaluate with Reference-Based Metrics

Step 4: Evaluate with LLM-as-Judge

Step 5: Aggregate LLM Judge Scores (Mean)

Complete Pipeline Example

Metrics

Reference-Based Metrics

LLM-as-Judge Metrics