TL;DR: This repository is the official, reproducible benchmark suite for the paper Evaluating Small-Scale Code Models for Code Clone Detection, with unified scripts to train and evaluate six compact code models across five clone-detection datasets.
Code clone detection is a critical task for software maintenance, plagiarism detection, and refactoring. While Large Language Models (LLMs) have shown promise, their computational cost is prohibitive for many real-time or resource-constrained environments.
This work rigorously evaluates small-scale transformer-based code models (<220M parameters) to determine their efficacy in distinguishing clone pairs. We provide a unified evaluation framework across five benchmark datasets, offering insights into the trade-offs between model size, architecture (Encoder-only vs. Encoder-Decoder), and detection accuracy. All evaluation scripts, pre-processed dataset loaders, and results are publicly available in this repository to facilitate reproducibility and further research.
- First systematic comparison of six <220 M parameter code models on five established clone-detection benchmarks.
- Unified, reproducible evaluation harness (train → evaluate → report F1/Precision/Recall) runnable in a single command.
- Publicly released, pre-configured training scripts for BigCloneBench, POJ104, GCJ, Karnalim, and PoolC.
- Evidence that encoder-only models (CodeBERT, GraphCodeBERT) consistently outperform decoder-only counterparts on clone-detection tasks.
- Lightweight shared library (
small_code_models/) enabling researchers to plug in new models with ≤30 lines of code.
| Model | Parameters | Architecture | HuggingFace Hub ID |
|---|---|---|---|
| CodeBERT | 125 M | Encoder-only | microsoft/codebert-base |
| GraphCodeBERT | 125 M | Encoder-only (Data-Flow) | microsoft/graphcodebert-base |
| PLBART | 140 M | Encoder-Decoder | uclanlp/plbart-base |
| PolyCoder | 160 M | Decoder-only | NinedayWang/PolyCoder-0.4B |
| UniXCoder | ~200 M | Unified Enc-Dec | microsoft/unixcoder-base |
| Salesforce CodeT5 | 220 M | Encoder-Decoder | Salesforce/codet5-base |
# 1. Clone
git clone https://github.com/jorge-martinez-gil/small-code-models.git
cd small-code-models
# 2. Install
pip install -e ".[dev]"
# 3. Run CodeBERT on BigCloneBench
python bcb_detection_models/codebert-bcb-01.py \
--data_dir /path/to/bcb \
--output_dir results/codebert_bcb
# 4. Run ALL models on ALL datasets (bash)
bash scripts/run_all_benchmarks.sh /path/to/datasetssmall-code-models/
├── small_code_models/ # Shared Python library
│ ├── __init__.py
│ ├── data.py # Dataset loading utilities
│ ├── metrics.py # Evaluation metrics
│ └── trainer.py # Generic fine-tuning trainer
├── bcb_detection_models/ # BigCloneBench scripts (×6 models)
├── gcj_clone_detection_models/ # Google Code Jam scripts
├── karnalim_clone_detection_models/
├── poj104_clone_detection_models/
├── poolc_clone_detection_models/
├── notebooks/
│ └── quick_start.ipynb # Interactive demo
├── scripts/
│ └── run_all_benchmarks.sh # Full reproduction script
├── docs/
│ └── RESULTS.md # Detailed results & analysis
├── pyproject.toml
└── requirements.txt
If this repository or paper helps your research, please cite:
BibTeX
@article{martinezgil2025smallscale,
author = {Jorge Martinez-Gil},
title = {Evaluating Small-Scale Code Models for Code Clone Detection},
journal = {CoRR},
volume = {abs/2506.10995},
year = {2025},
url = {https://doi.org/10.48550/arXiv.2506.10995},
eprint = {2506.10995},
archivePrefix = {arXiv},
primaryClass = {cs.SE}
}APA
Martinez-Gil, J. (2025). Evaluating small-scale code models for code clone detection. CoRR, abs/2506.10995. https://doi.org/10.48550/arXiv.2506.10995
A CITATION.cff file is included for one-click citation via GitHub's "Cite this repository" button.
- Ramachandran, R., Vijayan, P., Anilkumar, A., et al. (2025). AI Assisted System for Automated Evaluation of Entity-Relationship Diagram and Schema Diagram Using Large Language Models. Big Data and Cognitive Computing.
- Li, C., Konpang, J., Sirikham, A., & Wang, Y. (2025). Nuanced Code Clone Detection Through LLM-Based Code Revision and AST Graph Modeling. IEEE Access.
- Yang, J., Liu, X., Lv, W., Deng, K., Guo, S., Jing, L., Li, Y., et al. (2025). From Code Foundation Models to Agents and Applications: A Comprehensive Survey and Practical Guide to Code Intelligence. arXiv preprint.
This project is licensed under the MIT License.