Alpaca Eval tries to simulate human judgement using a preference model (their ML model is trained on human-labelled data).
Here, we seek to use the pairwise evaluation functionality to determine if the modifed output is indeed better than the original.
conda create --name alpaca_eval python=3.10 conda activate alpaca_eval
First, download alpaca-eval from the open-source repository. We need to download the source code because we will be making some modifications.
Then, install required packages with:
pip install -r requirements.txt
Replace alpaca_eval/src/alpaca_eval/evaluators_configs/alpaca_eval_gpt4_turbo_fn with ./alpaca_eval_gpt4_turbo_fn. This modifies the judgging process.
After obtaining the prediction json files, you should have run converter to convert the prediction files into a format suitable for alpaca-eval.
export OPENAI_API_KEY=<your_api_key>
./run_scorer.sh