Tip
Many inference parameters can be automatically loaded from a .env file in your working directory. See the example .env file for an example with links to the Inspect documentation on relevant environment variables.
To create your own .env file based on the example, run cp .env.example .env from the base directory, and modify the result as needed.
To choose your model for evaluation, set the INSPECT_EVAL_MODEL variable to a valid model identifier.
Copied from here.
Run with python examples/security_guide.py or inspect eval examples/security_guide.py
-
Download the Drug Reviews (Drugs.com) data set and extract. Pay attention to their terms:
- Only use the data for research purposes
- Don't use the data for any commerical purposes
- Don't distribute the data to anyone else
- Cite us
-
Preprocess the data
python examples/eval/uci_drug/prepare_data.py <path/to/extracted/data> <path/to/output>
-
Make sure you have configured a model for evaluation (see tip above).
-
Run the evaluation
python examples/eval/uci_drug/uci_drug.py <path/to/preprocessed/data/dev.csv>
Using an LLM as a binary classifier can be effective, but unlike traditional classifiers, they make binary classifications discretely, rather than assigning a 0–1 probability score. This can make it hard to use traditional classifier evaluation metrics such as AUROC.
This example demonstrates a way to convert discrete binary predictions to probability scores by running a tournament where samples from a dataset are put head-to-head to determine which one is 'more positive'. In practice, this is essentially what we ask the LLM: "which of these two samples is more positive based on our criteria?"
Each sample has an ELO score, which is updated over the course of the tournament. When the tournament is over, we can convert these ELO scores to probabilities.
The example in examples/eval/cola runs a tournament with the
CoLA dataset, classifying sentences as linguistically
acceptable or unacceptable. You can preprocess the data with preprocess_cola.py,
and run the tournament with cola_tournament.py path/to/preprocessed/cola/in_domain_dev.tsv.
Alternatively you can use the command line:
cnlp_llm tournament path/to/preprocessed/cola/in_domain_dev.tsv \
examples/eval/cola/cola.prompt \
--task acceptable \
--pos-label Yes \
--model ollama/llama3.2:1b \
--rounds 10 \
--scheduler graph