This directory contains an evaluator for measuring the calibration of LLM classifiers. It calculates:
- Accuracy: Fraction of correct predictions.
- Brier Score: Mean squared error of the probabilities. Lower is better.
- ECE (Expected Calibration Error): Weighted average of the difference between confidence and accuracy in bins. Lower is better.
-
Install dependencies:
pip install datasets numpy openai python-dotenv
-
Set your Fireworks API key in
.envor environment variables:export FIREWORKS_API_KEY=your_key -
Run the evaluation script:
python run_calibration.py
evaluator.py: Contains thecalibration_evaluatorbatch reward function.run_calibration.py: Script to load AG News dataset and run the evaluation on specified models.
You can modify run_calibration.py to:
- Change the models being evaluated (
MODELSlist). - Change the dataset or number of samples.
- Adjust the class mapping if using a different dataset.
You can modify evaluator.py to:
- Change the class tokens (
CLASS_TOKENS) if the model uses different tokenization. - Adjust
top_logprobsif needed (note that some models limit this to 5).