🤝 Contributing

Thank you for your interest in contributing to small-code-models! This project supports the research paper Evaluating Small-Scale Code Models for Code Clone Detection, and community contributions help make the findings more reproducible and extensible.

🐛 Reporting Bugs

If you encounter errors when running the scripts, loading datasets, or reproducing results, please open a GitHub Issue using the Bug Report template.

Please include:

The script name and dataset you were using (e.g., bcb_detection_models/codebert-bcb-01.py)
The full error traceback
Your Python version and GPU/CPU environment
The versions of key packages (torch, transformers, datasets)

➕ Adding a New Model or Dataset

Adding a new model

Pick an existing script from a relevant directory (e.g., bcb_detection_models/codebert-bcb-01.py) as a template.
Replace the model name, tokenizer class, and model class with the new model's equivalents from the HuggingFace Hub.
Name the new file following the existing convention: <model>-<dataset>-01.py.
Verify that the script runs end-to-end and reports F1 / Precision / Recall.
Open a Pull Request and fill out the checklist below.

Adding a new dataset

Create a new directory at the repository root named after the dataset (e.g., newdataset_clone_detection_models/).
Add one script per model following the existing naming convention.
Document the dataset source and format in a short comment block at the top of each script.
Open a Pull Request and fill out the checklist below.

🎨 Code Style Conventions

Follow PEP 8 for all Python code.
Use descriptive variable and function names.
Keep imports at the top of the file, grouped (standard library → third-party → local).
Add docstrings to all public functions.
Keep each script self-contained and runnable without extra configuration files.

▶️ Running the Existing Scripts

Clone the repository and install dependencies:

git clone https://github.com/jorge-martinez-gil/small-code-models.git
cd small-code-models
pip install -r requirements.txt

Download the required dataset (see dataset links in the README).
Update the dataset path inside the script you want to run (the path variables are clearly marked near the top of each main() function).

Run the script:

python bcb_detection_models/codebert-bcb-01.py

✅ Pull Request Checklist

Before opening a PR, please confirm:

The new or modified script runs without errors.
The script follows the existing naming convention.
Code is PEP 8 compliant.
Docstrings are present for all new functions.
The PR description explains what was changed and why.
No hardcoded private paths or credentials are included.
If a new model is added, the README model table has been updated.
If a new dataset is added, the README dataset list has been updated.

💬 Questions?

Feel free to open an Issue or start a GitHub Discussion if you have any questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🤝 Contributing

🐛 Reporting Bugs

➕ Adding a New Model or Dataset

Adding a new model

Adding a new dataset

🎨 Code Style Conventions

▶️ Running the Existing Scripts

✅ Pull Request Checklist

💬 Questions?

FilesExpand file tree

CONTRIBUTING.md

Latest commit

History

CONTRIBUTING.md

File metadata and controls

🤝 Contributing

🐛 Reporting Bugs

➕ Adding a New Model or Dataset

Adding a new model

Adding a new dataset

🎨 Code Style Conventions

▶️ Running the Existing Scripts

✅ Pull Request Checklist

💬 Questions?