Evaluating Small-Scale Code Models for Code Clone Detection

TL;DR: This repository is a reproducible benchmark suite for the paper Evaluating Small-Scale Code Models for Code Clone Detection. It trains and evaluates a growing registry of compact code models across clone-detection datasets, while capturing the artifacts needed for a rigorous journal submission.

Abstract

Code clone detection is a critical task for software maintenance, plagiarism detection, and refactoring. While large language models have shown promise, their computational cost is prohibitive for many real-time or resource-constrained environments.

This work evaluates transformer-based code models under 220M parameters for binary clone-pair classification. The repository provides shared loaders, metrics, training utilities, reproducibility manifests, confidence intervals, and result summarization scripts so that model comparisons can be audited and rerun.

Key Contributions

Systematic comparison of compact code models on established clone-detection benchmarks.
Unified train -> evaluate -> report workflow for all model/dataset pairs.
Public benchmark scripts and a registry runner for classic and expanded benchmarks.
Auditable outputs: file hashes, split/leakage diagnostics, stable example/pair identities, environment metadata, and confidence intervals.
Statistical support for bootstrap intervals, paired bootstrap differences, McNemar tests, and calibration/error-profile metrics.
Lightweight shared library (small_code_models/) for adding models with minimal code.

Model Registry

Model	Parameters	Architecture	Hugging Face Hub ID
CodeBERT	125M	Encoder-only	`microsoft/codebert-base`
GraphCodeBERT	125M	Encoder-only with data-flow pretraining	`microsoft/graphcodebert-base`
PLBART	140M	Encoder-decoder	`uclanlp/plbart-base`
PolyCoder	160M	Decoder-only	`NinedayWang/PolyCoder-160M`
UniXCoder	~200M	Unified encoder-decoder	`microsoft/unixcoder-base`
CodeT5	220M	Encoder-decoder	`Salesforce/codet5-base`
CodeT5 Small	60M	Encoder-decoder	`Salesforce/codet5-small`
CodeT5+ 220M	220M	Encoder-decoder	`Salesforce/codet5p-220m`
CodeGPT Small Python	124M	Decoder-only	`microsoft/CodeGPT-small-py`
CodeGPT Small Java	124M	Decoder-only	`microsoft/CodeGPT-small-java`
CodeBERTa Small	84M	Encoder-only	`huggingface/CodeBERTa-small-v1`
CoTexT 1-CC	220M	Encoder-decoder	`razent/cotext-1-cc`
CoTexT 2-CC	220M	Encoder-decoder	`razent/cotext-2-cc`

SynCoBERT and Code-MVP are also present in the registry as local-checkpoint baselines because no stable public Hugging Face sequence-classification checkpoint is assumed.

Benchmark Registry

Benchmark	Type	Expected local layout
BigCloneBench	Monolingual clone detection	`pair_jsonl`
POJ-104	Program similarity / retrieval	`pair_jsonl`
GCJ	Problem-solution clone detection	`pair_jsonl`
Karnalim	Educational clone detection	`pair_jsonl`
PoolC	Educational clone detection	`pair_jsonl`
CodeXGLUE BCB	Official CodeXGLUE clone detection	`pair_jsonl`
CodeXGLUE POJ-104	Official CodeXGLUE clone retrieval	`pair_jsonl`
Project CodeNet	Large-scale code similarity	problem directories or `pair_jsonl`
SemanticCloneBench	Semantic clone detection	`pair_jsonl`
GPTCloneBench	Semantic and cross-language clone detection	`pair_jsonl`
CLCDSA	Cross-language clone detection	problem directories or `pair_jsonl`
Robustness Suite	Transformation stress test	derived `pair_jsonl`

Quick Start

End-To-End Automation

For a Windows end-to-end run:

run_everything.bat

For Linux, macOS, WSL, Git Bash, or Colab-style shells:

chmod +x run_everything.sh
./run_everything.sh

Both scripts install dependencies, download automatic datasets into datasets/, normalize supported local raw datasets, write dataset diagnostics, run the benchmark matrix, summarize results, and compute pairwise comparisons where predictions exist. Benchmark runs default to SAMPLE_PCT=1.0 so each model uses a 1% deterministic subsample by default; set SAMPLE_PCT=100.0 for full-data runs. They auto-detect .venv, the bundled Codex Python runtime, py -3, or python; set PYTHON_CMD only when you want to force a specific interpreter.

Useful shortest-run override:

set EPOCHS=1
set MODELS=codebert
set BENCHMARKS=bcb poj104
run_everything.bat

EPOCHS=1 MODELS=codebert BENCHMARKS="bcb poj104" ./run_everything.sh

To only download and inspect datasets:

set INSTALL_DEPS=0
set RUN_BENCHMARKS=0
set RUN_COMPARISONS=0
run_everything.bat

INSTALL_DEPS=0 RUN_BENCHMARKS=0 RUN_COMPARISONS=0 ./run_everything.sh

# 1. Clone
git clone https://github.com/jorge-martinez-gil/small-code-models.git
cd small-code-models

# 2. Install
pip install -e ".[dev]"

# 3. Download automatically retrievable datasets
python scripts/download_datasets.py --dataset all --output_root datasets --skip_existing

# 4. Inspect normalized split diagnostics before final runs
python scripts/inspect_dataset.py datasets/bcb --strict_data

# 5. Run CodeBERT on BigCloneBench
python bcb_detection_models/codebert-bcb-01.py \
    --data_dir datasets/bcb \
    --output_dir results/codebert_bcb \
    --seed 42 \
    --bootstrap_resamples 1000

# 6. Run all models on all downloaded/normalized datasets
bash scripts/run_all_benchmarks.sh datasets

# 7. Run any registered model/benchmark pair
python scripts/run_clone_experiment.py \
    --model codet5_small \
    --benchmark codenet \
    --data_dir /path/to/prepared_codenet \
    --output_dir results/codet5_small_codenet

# 8. Prepare CodeNet/CLCDSA-style problem directories as pair_jsonl
python scripts/prepare_pair_dataset.py \
    --source_dir /path/to/problem_directories \
    --output_dir /path/to/prepared_pairs \
    --negative_ratio 1.0 \
    --split_strategy problem

# 9. Inspect normalized split diagnostics before final runs
python scripts/inspect_dataset.py /path/to/prepared_pairs \
    --strict_data \
    --output /path/to/prepared_pairs/diagnostics.json

# 10. Summarize completed runs
python scripts/summarize_results.py results

# 11. Compare two models on the same test examples
python scripts/compare_predictions.py \
    results/codebert_bcb/predictions.jsonl \
    results/graphcodebert_bcb/predictions.jsonl

Each benchmark script supports:

Argument	Purpose
`--seed`	Seeds training, sampling, and bootstrap intervals.
`--sample_pct`	Runs a deterministic subsample; defaults to `1.0` for resource-limited fine-tuning.
`--max_length`	Sets tokenizer truncation length for each code pair.
`--strict_data`	Fails on malformed rows, missing snippets, or invalid labels.
`--bootstrap_resamples`	Controls confidence-interval precision and runtime.
`--no_artifacts`	Disables JSON/JSONL artifact writing.

The full benchmark script also honors environment variables: PYTHON_BIN, RESULTS_ROOT, SAMPLE_PCT, EPOCHS, SEED, MAX_LENGTH, BOOTSTRAP_RESAMPLES, and STRICT_DATA=1.

Dataset Download

The downloader stores automatically retrievable datasets under datasets/ by default and writes a dataset_source.json source/conversion manifest in each dataset folder.

python scripts/download_datasets.py --list
python scripts/download_datasets.py \
    --dataset bcb \
    --dataset poj104 \
    --output_root datasets \
    --skip_existing

Use --inspect_after_download only when you want the downloader to run full split diagnostics immediately. For large BCB downloads, it is usually better to run scripts/inspect_dataset.py separately when preparing final runs.

Currently automated sources:

Local key	Source	Notes
`bcb`	Hugging Face `google/code_x_glue_cc_clone_detection_big_clone_bench`	Stored directly as `pair_jsonl`.
`poj104`	Hugging Face `google/code_x_glue_cc_clone_detection_poj104`	Official task is retrieval; downloader builds deterministic binary pairs for this repository.
`poolc`	Hugging Face `PoolC/5-fold-clone-detection-600k-5fold`	Stored as `pair_jsonl`; the Hugging Face `val` split is deterministically divided into validation and test rows.

For POJ-104, use --poj_pairs_per_label all to materialize exhaustive positive pairs. The default samples 1000 positive pairs per label per split and the same number of negatives, which is much smaller and suitable for smoke tests or constrained runs.

GCJ, Karnalim, CodeNet, CLCDSA, SemanticCloneBench, and GPTCloneBench still require manual source acquisition or conversion because no stable public direct-download endpoint is registered in this repository.

Local Dataset Normalization

If you already have raw datasets under datasets/, normalize the supported local formats before training:

python scripts/normalize_local_datasets.py --dataset all --input_root datasets --output_root datasets

Currently supported local conversions:

Dataset	Accepted local files	Output
`gcj`	`train.txt`, `valid.txt`, `test.txt`, plus `googlejam4_src/` files	Adds `data.jsonl`, converts `-1/1` labels to `0/1`, backs up raw splits as `raw_*.txt`.
`karnalim`	`training.json`, `validation.json`, `test.json`	Adds `data.jsonl`, `train.txt`, `valid.txt`, `test.txt`.

CodeNet, CLCDSA, SemanticCloneBench, and GPTCloneBench still need their actual source files or official pair files before they can be normalized.

Run Artifacts

Every run writes these files to --output_dir unless --no_artifacts is set:

File	Contents
`metrics.json`	Trainer metrics plus bootstrap confidence intervals.
`predictions.jsonl`	Per-example label, prediction, score, logits, correctness, example ID, pair ID, snippet IDs, and snippet hashes.
`run_manifest.json`	Model metadata, dataset diagnostics, file hashes, training arguments, package versions, CUDA details, and Git revision.

These artifacts are designed for reviewer-facing replication packages. They let readers verify the exact split files used, reproduce aggregate tables, rerun statistical comparisons, align model outputs by stable example or pair ID, and inspect individual prediction errors. Dataset inspection also reports cross-split pair and snippet overlaps, so generated benchmark subsets can be checked for train/test source-code leakage before final runs.

Repository Structure

small-code-models/
|-- small_code_models/          # Shared Python library
|   |-- artifacts.py            # JSON/JSONL artifact writing
|   |-- data.py                 # Dataset loading and diagnostics
|   |-- metrics.py              # Classification and ranking metrics
|   |-- modeling.py             # Registry-based model loading
|   |-- pair_builder.py         # Problem-directory pair generation
|   |-- registry.py             # Model and benchmark metadata
|   |-- reproducibility.py      # Seeds, environment, and Git metadata
|   |-- statistics.py           # Bootstrap and paired tests
|   `-- trainer.py              # Generic fine-tuning trainer
|-- bcb_detection_models/       # BigCloneBench scripts (x6 models)
|-- gcj_clone_detection_models/ # Google Code Jam scripts
|-- karnalim_clone_detection_models/
|-- poj104_clone_detection_models/
|-- poolc_clone_detection_models/
|-- notebooks/
|   `-- quick_start.ipynb
|-- scripts/
|   |-- run_all_benchmarks.sh
|   |-- compare_predictions.py
|   |-- download_datasets.py
|   |-- inspect_dataset.py
|   |-- prepare_pair_dataset.py
|   |-- run_clone_experiment.py
|   `-- summarize_results.py
|-- docs/
|   |-- REPRODUCIBILITY.md
|   `-- RESULTS.md
|-- run_everything.bat
|-- run_everything.sh
|-- pyproject.toml
`-- requirements.txt

Reproducibility Checklist

For a journal submission, report at minimum:

Dataset source and version, plus the SHA-256 hashes stored in run_manifest.json.
Model checkpoint ID, seed, epoch count, tokenizer max length, and sampling percentage.
Hardware, CUDA, package versions, and Git commit from run_manifest.json.
Accuracy, precision, recall, F1, balanced accuracy, MCC, ROC-AUC, PR-AUC, specificity, false-positive/false-negative rates, Brier score, log loss, and expected calibration error.
Bootstrap confidence intervals from metrics.json.
Paired significance tests over saved predictions.jsonl files, aligned by example_id or pair_id when available, for model comparisons.
Cross-split pair/snippet overlap diagnostics from inspect_dataset_directory, with generated problem-directory corpora prepared using --split_strategy problem.

Citing This Work

If this repository or paper helps your research, please cite:

@article{martinezgil2025smallscale,
  author        = {Jorge Martinez-Gil},
  title         = {Evaluating Small-Scale Code Models for Code Clone Detection},
  journal       = {CoRR},
  volume        = {abs/2506.10995},
  year          = {2025},
  url           = {https://doi.org/10.48550/arXiv.2506.10995},
  eprint        = {2506.10995},
  archivePrefix = {arXiv},
  primaryClass  = {cs.SE}
}

APA: Martinez-Gil, J. (2025). Evaluating small-scale code models for code clone detection. CoRR, abs/2506.10995. https://doi.org/10.48550/arXiv.2506.10995

License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluating Small-Scale Code Models for Code Clone Detection

Abstract

Key Contributions

Model Registry

Benchmark Registry

Quick Start

End-To-End Automation

Dataset Download

Local Dataset Normalization

Run Artifacts

Repository Structure

Reproducibility Checklist

Citing This Work

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.github		.github
bcb_detection_models		bcb_detection_models
docs		docs
gcj_clone_detection_models		gcj_clone_detection_models
karnalim_clone_detection_models		karnalim_clone_detection_models
notebooks		notebooks
poj104_clone_detection_models		poj104_clone_detection_models
poolc_clone_detection_models		poolc_clone_detection_models
scripts		scripts
small_code_models		small_code_models
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_everything.bat		run_everything.bat
run_everything.sh		run_everything.sh

Folders and files

Latest commit

History

Repository files navigation

Evaluating Small-Scale Code Models for Code Clone Detection

Abstract

Key Contributions

Model Registry

Benchmark Registry

Quick Start

End-To-End Automation

Dataset Download

Local Dataset Normalization

Run Artifacts

Repository Structure

Reproducibility Checklist

Citing This Work

License

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages