Skip to content

Commit cea2891

Browse files
authored
Update README with CoderEval setup and evaluation steps
Added instructions for setting up and evaluating with CoderEval.
1 parent 505a300 commit cea2891

1 file changed

Lines changed: 60 additions & 8 deletions

File tree

generation/README.md

Lines changed: 60 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -9,14 +9,18 @@ code-generation/
99
├── fft_train.py # Full fine-tuning training script
1010
├── qlora_train.py # QLoRA training script
1111
├── codereval/
12-
├── infer_generation_fft.py # Inference for FFT models
13-
├── infer_generation_qlora.py # Inference for QLoRA models
14-
├── filter_codereval_ids.py # Filter unreliable test cases
15-
├── extract_code_from_jsonl.py # Extract code to individual files
16-
├── add_java_wrappers_cg.py # Add class wrappers for static analysis
17-
├── ids_to_discard.json # IDs with unreliable tests
18-
├── CEJavaHumanLabel.jsonl # CoderEval Java benchmark
19-
└── CEPythonHumanLabel.jsonl # CoderEval Python benchmark
12+
│ ├── infer_generation_fft.py # Inference for FFT models
13+
│ ├── infer_generation_qlora.py # Inference for QLoRA models
14+
│ ├── filter_codereval_ids.py # Filter unreliable test cases
15+
│ ├── extract_code_from_jsonl.py # Extract code to individual files
16+
│ ├── add_java_wrappers_cg.py # Add class wrappers for static analysis
17+
│ ├── ids_to_discard.json # IDs with unreliable tests
18+
│ ├── CEJavaHumanLabel.jsonl # CoderEval Java benchmark
19+
│ └── CEPythonHumanLabel.jsonl # CoderEval Python benchmark
20+
└── dataset/
21+
└── codegen_codexglue/
22+
├── java/
23+
└── python/
2024
```
2125

2226
## Step 1: Training
@@ -130,6 +134,31 @@ The wrapped files can then be analyzed with:
130134
- **PMD**: For code quality metrics
131135
- **SonarCloud**: For static code analysis
132136

137+
## Step 6: Evaluate with CoderEval (Pass@k)
138+
139+
For functional correctness evaluation (Pass@k), we use the [CoderEval](https://github.com/CoderEval/CoderEval) benchmark platform.
140+
141+
### Setup CoderEval Environment
142+
143+
1. Download the Docker environment from [Google Drive](https://drive.google.com/drive/folders/1F8M7e25MgHZ3XJ4RSOGWindFSWC5QOvI?usp=sharing)
144+
145+
2. Import the Docker image:
146+
```bash
147+
docker load -i codereval_docker.tar
148+
```
149+
150+
3. Run the Docker container with your predictions:
151+
```bash
152+
docker run -v /path/to/predictions:/data codereval
153+
```
154+
155+
### CoderEval Resources
156+
- **Repository**: https://github.com/CoderEval/CoderEval
157+
- **Benchmark Data**: `CoderEval4Java.json`, `CoderEval4Python.json`
158+
- **Docker Environment**: Contains pre-configured runtime for 43 Python projects and 10 Java projects
159+
160+
For detailed instructions on running evaluations, refer to the [CoderEval README](https://github.com/CoderEval/CoderEval).
161+
133162
## Complete Pipeline Example
134163

135164
```bash
@@ -155,4 +184,27 @@ python codereval/extract_code_from_jsonl.py
155184
python codereval/add_java_wrappers_cg.py \
156185
--input_dir results/java_predictions_java_files \
157186
--output_dir results/java_predictions_java_files_wrapped
187+
188+
# 6. Evaluate with CoderEval Docker (see Step 6 for setup)
189+
```
190+
191+
## Metrics
192+
193+
### Functional Correctness
194+
- **Pass@k**: Probability that at least one of k generated samples passes all test cases (computed via [CoderEval](https://github.com/CoderEval/CoderEval))
195+
196+
### Code Quality (Static Analysis)
197+
- **PMD**: Code quality violations and metrics
198+
- **SonarCloud**: Code smells, bugs, vulnerabilities
199+
200+
## Requirements
201+
202+
```
203+
torch
204+
transformers
205+
datasets
206+
trl
207+
peft
208+
bitsandbytes
209+
codebleu
158210
```

0 commit comments

Comments
 (0)