This project automates the evaluation of language model responses using classification-based metrics and LLMScore. It supports testing against various models, including OpenAI and Google Vertex AI. It also serves as an evaluation benchmark for comparing multiple versions of ORAssistant.
-
Classification-based Metrics:
- Categorizes responses into True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN).
- Computes metrics such as Accuracy, Precision, Recall, and F1 Score.
-
LLMScore:
- Assigns a score between 0 and 1 by comparing the ground truth against the generated response's quality and accuracy.
Create a .env file in the root directory with the following variables:
GOOGLE_API_KEY=your_google_api_key
OPENAI_API_KEY=your_openai_api_key
-
Input File:
data/data.csv- This file should contain the questions to be tested. Ensure it is formatted as a CSV file with the following columns:
Question,Answer.
- This file should contain the questions to be tested. Ensure it is formatted as a CSV file with the following columns:
-
Output File:
data/data_result.csv- This file will be generated after running the script. It contains the results of the evaluation.
-
Activate virtual environment
From the previous directory (
evaluation), make sure you have run the commandmake initbefore activating virtual environment. It is needed to recognise this folder as a submodule. -
Run the Script
Use the following command to execute the script with customizable options:
python main.py --env-path /path/to/.env --iterations 10 --llms "base-gemini-1.5-flash,base-gpt-4o" --agent-retrievers "v1=http://url1.com,v2=http://url2.com"
--env-path: Path to the.envfile.--iterations: Number of iterations per question.--llms: Comma-separated list of LLMs to test.--agent-retrievers: Comma-separated list of agent-retriever names and URLs.
-
View Results
Results will be saved in a CSV file named after the input data file with
_resultappended.
python main.py- Uses the default
.envfile in the project root. - Default
data/data.csvas input. - 5 iterations per question.
- Tests all available LLMs.
- No additional agent-retrievers.
python main.py --env-path /path/to/.envpython main.py --iterations 10 --llms "base-gpt-4o,base-gemini-1.5-flash"python main.py --agent-retrievers "v1=http://url1.com,v2=http://url2.com"python main.py \
--env-path /path/to/.env \
--iterations 10 \
--llms "base-gemini-1.5-flash,base-gpt-4o" \
--agent-retrievers "v1=http://url1.com,v2=http://url2.com"To view all available command-line options:
python main.py --helpAfter generating results, you can perform analysis using the provided analysis.py script. To run the analysis, execute the following command:
streamlit run analysis.py-
To compare three versions of ORAssistant, use:
python main.py --agent-retrievers "orassistant-v1=http://url1.com,orassistant-v2=http://url2.com,orassistant-v3=http://url3.com"Note: Each URL is the endpoint of the ORAssistant backend.
-
To compare ORAssistant with base-gpt-4o, use:
python main.py --llms "base-gpt-4o" --agent-retrievers "orassistant=http://url.com"
Note: The URL is the endpoint of the ORAssistant backend.