Evaluating Small-Scale Code Models for Code Clone Detection

TL;DR: This repository is the official, reproducible benchmark suite for the paper Evaluating Small-Scale Code Models for Code Clone Detection, with unified scripts to train and evaluate six compact code models across five clone-detection datasets.

Abstract

Code clone detection is a critical task for software maintenance, plagiarism detection, and refactoring. While Large Language Models (LLMs) have shown promise, their computational cost is prohibitive for many real-time or resource-constrained environments.

This work rigorously evaluates small-scale transformer-based code models (<220M parameters) to determine their efficacy in distinguishing clone pairs. We provide a unified evaluation framework across five benchmark datasets, offering insights into the trade-offs between model size, architecture (Encoder-only vs. Encoder-Decoder), and detection accuracy. All evaluation scripts, pre-processed dataset loaders, and results are publicly available in this repository to facilitate reproducibility and further research.

Key Contributions

First systematic comparison of six <220 M parameter code models on five established clone-detection benchmarks.
Unified, reproducible evaluation harness (train → evaluate → report F1/Precision/Recall) runnable in a single command.
Publicly released, pre-configured training scripts for BigCloneBench, POJ104, GCJ, Karnalim, and PoolC.
Evidence that encoder-only models (CodeBERT, GraphCodeBERT) consistently outperform decoder-only counterparts on clone-detection tasks.
Lightweight shared library (small_code_models/) enabling researchers to plug in new models with ≤30 lines of code.

Models Evaluated

Model	Parameters	Architecture	HuggingFace Hub ID
CodeBERT	125 M	Encoder-only	`microsoft/codebert-base`
GraphCodeBERT	125 M	Encoder-only (Data-Flow)	`microsoft/graphcodebert-base`
PLBART	140 M	Encoder-Decoder	`uclanlp/plbart-base`
PolyCoder	160 M	Decoder-only	`NinedayWang/PolyCoder-0.4B`
UniXCoder	~200 M	Unified Enc-Dec	`microsoft/unixcoder-base`
Salesforce CodeT5	220 M	Encoder-Decoder	`Salesforce/codet5-base`

Quick Start

# 1. Clone
git clone https://github.com/jorge-martinez-gil/small-code-models.git
cd small-code-models

# 2. Install
pip install -e ".[dev]"

# 3. Run CodeBERT on BigCloneBench
python bcb_detection_models/codebert-bcb-01.py \
    --data_dir /path/to/bcb \
    --output_dir results/codebert_bcb

# 4. Run ALL models on ALL datasets (bash)
bash scripts/run_all_benchmarks.sh /path/to/datasets

Repository Structure

small-code-models/
├── small_code_models/          # Shared Python library
│   ├── __init__.py
│   ├── data.py                 # Dataset loading utilities
│   ├── metrics.py              # Evaluation metrics
│   └── trainer.py              # Generic fine-tuning trainer
├── bcb_detection_models/       # BigCloneBench scripts (×6 models)
├── gcj_clone_detection_models/ # Google Code Jam scripts
├── karnalim_clone_detection_models/
├── poj104_clone_detection_models/
├── poolc_clone_detection_models/
├── notebooks/
│   └── quick_start.ipynb       # Interactive demo
├── scripts/
│   └── run_all_benchmarks.sh  # Full reproduction script
├── docs/
│   └── RESULTS.md              # Detailed results & analysis
├── pyproject.toml
└── requirements.txt

Citing this work

If this repository or paper helps your research, please cite:

BibTeX

@article{martinezgil2025smallscale,
  author       = {Jorge Martinez-Gil},
  title        = {Evaluating Small-Scale Code Models for Code Clone Detection},
  journal      = {CoRR},
  volume       = {abs/2506.10995},
  year         = {2025},
  url          = {https://doi.org/10.48550/arXiv.2506.10995},
  eprint       = {2506.10995},
  archivePrefix = {arXiv},
  primaryClass = {cs.SE}
}

APA

Martinez-Gil, J. (2025). Evaluating small-scale code models for code clone detection. CoRR, abs/2506.10995. https://doi.org/10.48550/arXiv.2506.10995

A CITATION.cff file is included for one-click citation via GitHub's "Cite this repository" button.

Research citing this work

Ramachandran, R., Vijayan, P., Anilkumar, A., et al. (2025). AI Assisted System for Automated Evaluation of Entity-Relationship Diagram and Schema Diagram Using Large Language Models. Big Data and Cognitive Computing.
Li, C., Konpang, J., Sirikham, A., & Wang, Y. (2025). Nuanced Code Clone Detection Through LLM-Based Code Revision and AST Graph Modeling. IEEE Access.
Yang, J., Liu, X., Lv, W., Deng, K., Guo, S., Jing, L., Li, Y., et al. (2025). From Code Foundation Models to Agents and Applications: A Comprehensive Survey and Practical Guide to Code Intelligence. arXiv preprint.

License & Contact

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluating Small-Scale Code Models for Code Clone Detection

Abstract

Key Contributions

Models Evaluated

Quick Start

Repository Structure

Citing this work

Research citing this work

License & Contact

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Evaluating Small-Scale Code Models for Code Clone Detection

Abstract

Key Contributions

Models Evaluated

Quick Start

Repository Structure

Citing this work

Research citing this work

License & Contact