Thank you for your interest in contributing to small-code-models! This project supports the research paper Evaluating Small-Scale Code Models for Code Clone Detection, and community contributions help make the findings more reproducible and extensible.
If you encounter errors when running the scripts, loading datasets, or reproducing results, please open a GitHub Issue using the Bug Report template.
Please include:
- The script name and dataset you were using (e.g.,
bcb_detection_models/codebert-bcb-01.py) - The full error traceback
- Your Python version and GPU/CPU environment
- The versions of key packages (
torch,transformers,datasets)
- Pick an existing script from a relevant directory (e.g.,
bcb_detection_models/codebert-bcb-01.py) as a template. - Replace the model name, tokenizer class, and model class with the new model's equivalents from the HuggingFace Hub.
- Name the new file following the existing convention:
<model>-<dataset>-01.py. - Verify that the script runs end-to-end and reports F1 / Precision / Recall.
- Open a Pull Request and fill out the checklist below.
- Create a new directory at the repository root named after the dataset (e.g.,
newdataset_clone_detection_models/). - Add one script per model following the existing naming convention.
- Document the dataset source and format in a short comment block at the top of each script.
- Open a Pull Request and fill out the checklist below.
- Follow PEP 8 for all Python code.
- Use descriptive variable and function names.
- Keep imports at the top of the file, grouped (standard library β third-party β local).
- Add docstrings to all public functions.
- Keep each script self-contained and runnable without extra configuration files.
- Clone the repository and install dependencies:
git clone https://github.com/jorge-martinez-gil/small-code-models.git cd small-code-models pip install -r requirements.txt - Download the required dataset (see dataset links in the README).
- Update the dataset path inside the script you want to run (the path variables are clearly marked near the top of each
main()function). - Run the script:
python bcb_detection_models/codebert-bcb-01.py
Before opening a PR, please confirm:
- The new or modified script runs without errors.
- The script follows the existing naming convention.
- Code is PEP 8 compliant.
- Docstrings are present for all new functions.
- The PR description explains what was changed and why.
- No hardcoded private paths or credentials are included.
- If a new model is added, the README model table has been updated.
- If a new dataset is added, the README dataset list has been updated.
Feel free to open an Issue or start a GitHub Discussion if you have any questions.