Calibration Evaluator

This directory contains an evaluator for measuring the calibration of LLM classifiers. It calculates:

Accuracy: Fraction of correct predictions.
Brier Score: Mean squared error of the probabilities. Lower is better.
ECE (Expected Calibration Error): Weighted average of the difference between confidence and accuracy in bins. Lower is better.

Usage

Install dependencies:

pip install datasets numpy openai python-dotenv

Set your Fireworks API key in .env or environment variables:
```
export FIREWORKS_API_KEY=your_key
```
Run the evaluation script:
```
python run_calibration.py
```

evaluator.py: Contains the calibration_evaluator batch reward function.
run_calibration.py: Script to load AG News dataset and run the evaluation on specified models.

You can modify run_calibration.py to:

You can modify evaluator.py to:

Change the class tokens (CLASS_TOKENS) if the model uses different tokenization.
Adjust top_logprobs if needed (note that some models limit this to 5).