@@ -9,14 +9,18 @@ code-generation/
99├── fft_train.py # Full fine-tuning training script
1010├── qlora_train.py # QLoRA training script
1111├── codereval/
12- ├── infer_generation_fft.py # Inference for FFT models
13- ├── infer_generation_qlora.py # Inference for QLoRA models
14- ├── filter_codereval_ids.py # Filter unreliable test cases
15- ├── extract_code_from_jsonl.py # Extract code to individual files
16- ├── add_java_wrappers_cg.py # Add class wrappers for static analysis
17- ├── ids_to_discard.json # IDs with unreliable tests
18- ├── CEJavaHumanLabel.jsonl # CoderEval Java benchmark
19- └── CEPythonHumanLabel.jsonl # CoderEval Python benchmark
12+ │ ├── infer_generation_fft.py # Inference for FFT models
13+ │ ├── infer_generation_qlora.py # Inference for QLoRA models
14+ │ ├── filter_codereval_ids.py # Filter unreliable test cases
15+ │ ├── extract_code_from_jsonl.py # Extract code to individual files
16+ │ ├── add_java_wrappers_cg.py # Add class wrappers for static analysis
17+ │ ├── ids_to_discard.json # IDs with unreliable tests
18+ │ ├── CEJavaHumanLabel.jsonl # CoderEval Java benchmark
19+ │ └── CEPythonHumanLabel.jsonl # CoderEval Python benchmark
20+ └── dataset/
21+ └── codegen_codexglue/
22+ ├── java/
23+ └── python/
2024```
2125
2226## Step 1: Training
@@ -130,6 +134,31 @@ The wrapped files can then be analyzed with:
130134- ** PMD** : For code quality metrics
131135- ** SonarCloud** : For static code analysis
132136
137+ ## Step 6: Evaluate with CoderEval (Pass@k)
138+
139+ For functional correctness evaluation (Pass@k), we use the [ CoderEval] ( https://github.com/CoderEval/CoderEval ) benchmark platform.
140+
141+ ### Setup CoderEval Environment
142+
143+ 1 . Download the Docker environment from [ Google Drive] ( https://drive.google.com/drive/folders/1F8M7e25MgHZ3XJ4RSOGWindFSWC5QOvI?usp=sharing )
144+
145+ 2 . Import the Docker image:
146+ ``` bash
147+ docker load -i codereval_docker.tar
148+ ```
149+
150+ 3 . Run the Docker container with your predictions:
151+ ``` bash
152+ docker run -v /path/to/predictions:/data codereval
153+ ```
154+
155+ ### CoderEval Resources
156+ - ** Repository** : https://github.com/CoderEval/CoderEval
157+ - ** Benchmark Data** : ` CoderEval4Java.json ` , ` CoderEval4Python.json `
158+ - ** Docker Environment** : Contains pre-configured runtime for 43 Python projects and 10 Java projects
159+
160+ For detailed instructions on running evaluations, refer to the [ CoderEval README] ( https://github.com/CoderEval/CoderEval ) .
161+
133162## Complete Pipeline Example
134163
135164``` bash
@@ -155,4 +184,27 @@ python codereval/extract_code_from_jsonl.py
155184python codereval/add_java_wrappers_cg.py \
156185 --input_dir results/java_predictions_java_files \
157186 --output_dir results/java_predictions_java_files_wrapped
187+
188+ # 6. Evaluate with CoderEval Docker (see Step 6 for setup)
189+ ```
190+
191+ ## Metrics
192+
193+ ### Functional Correctness
194+ - ** Pass@k** : Probability that at least one of k generated samples passes all test cases (computed via [ CoderEval] ( https://github.com/CoderEval/CoderEval ) )
195+
196+ ### Code Quality (Static Analysis)
197+ - ** PMD** : Code quality violations and metrics
198+ - ** SonarCloud** : Code smells, bugs, vulnerabilities
199+
200+ ## Requirements
201+
202+ ```
203+ torch
204+ transformers
205+ datasets
206+ trl
207+ peft
208+ bitsandbytes
209+ codebleu
158210```
0 commit comments