This repository contains the code associated with the paper, “Evaluating the Performance of LLMs in Drafting NIH Data Management Plans.” In the paper, we evaluated the performance of Llama 3.3 and GPT 4.1 in drafting NIH-compliant Data Management Plans (DMPs) using two complementary approaches: automatic reference-based evaluation and human expert evaluation.
The repository includes the complete automated and human evaluation workflows. Please refer to the project inventory for all related resources, including the paper.
The overall codebase is organized in alignment with the FAIR-BioRS guidelines. All Python code follows PEP 8 conventions, including consistent formatting, inline comments, and docstrings. Project dependencies are fully captured in requirements.txt.
git clone https://github.com/fairdataihub/nih-dmp-llm-evaluation-paper-code.git
cd dmpchef
code .Windows (cmd):
python -m venv venv
venv\Scripts\activate.batmacOS/Linux:
python -m venv venv
source venv/bin/activatepip install -r requirements.txtThis repository supports two complementary evaluation workflows. Use the appropriate notebook depending on the evaluation approach you want to run.
Use: Automated-evaluation.ipynb
The Jupyter notebook makes use of files in the dataset associated with the paper. You will need to download the dataset at add it in the input folder (call the dataset folder 'dataset'). Please refer to the project inventory for a link to the dataset.
All outputs from both evaluation pipelines (tables and figures) are saved under the outputs/ directory.
This work is licensed under the MIT License. See LICENSE for more information.
Use GitHub Issues to submit feedback, report problems, or suggest improvements. You can also fork the repository and submit a Pull Request with your changes.
If you use this code, please cite this repository using following the instructions in the CITATION.cff file.