Evaluation Guide

Prepare Enviroment

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt (optional: -r requirements-transformers.txt -r requirements-exl2.txt -r requirements-vllm.txt)

Common Options

All interview scripts accept the following common options:

--interview directly run instruct-completion (default: senior)
--input run a pre-prepared interview used for completion and fim

API Evaluations (Instruct)

Evaluate a Remote API (LiteLLM)

python ./interview_litellm.py --model <provider>/<model_id> --apikey <key>

See LiteLLM documentation for the full list of supported providers.

Evaluate a Local/Self-Hosted API (LiteLLM)

python ./interview_litellm.py --model openai/<model_id> --apibase http://<host>:<port>/

If the runtime cannot be inferred from the endpoint, you will be asked to provide --runtime

Evaluate a Local ollama API

ollama serve <model_id>

python ./interview_litellm.py --model ollama_chat/<model_id>

Evaluate a Local llama-server API

llama-server -m /home/mike/models/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf -c 8192 -fa -ngl 99 --host 0.0.0.0 --port 8080

Note: -fa enables flash attention, -ngl 99 enables GPU offloading

python3 ./interview-litellm.py --model openai/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf --apibase http://127.0.0.1:8080

Evaluate a Local koboldcpp API

koboldcpp /home/mike/models/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf --contextsize 8192 --flashattention --gpulayers 99 --usecublas 1 --host 0.0.0.0 --port 8080

Note: --flashattention enables flash attention, --gpulayers 99 --usecublas 1 enables GPU offloading

python3 ./interview-litellm.py --model openai/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf --apibase http://127.0.0.1:8080

Evaluate a Local OobaBooga API

See Ooba Docs for how to launch.

python3 ./interview-litellm.py --model openai/<modelid>> --apibase http://127.0.0.1:8080 --runtime oobabooga

Model Evaluations (Instruct)

Evaluate a model with local GPU (CUDA)

The local CUDA executor will use all available GPUs by default, use CUDA_VISIBLE_DEVICES if you have connected accelerators you don't want used.

Note that you can install all 3 backends into a single venv.

Backend: Transformers

pip install -r requirements.txt -r requirements-transformers.txt

python ./interview_cuda.py --model <model> --runtime transformers

Backend: vLLM

pip install -r requirements.txt -r requirements-vllm.txt

python ./interview_cuda.py --model <model> --runtime vllm

Backend: Exllamav2

pip install wheel && pip install -r requirements.txt -r requirements-exl2.txt

python ./interview_cuda.py --model <model> --runtime exllama2

Evaluate a model with remote GPU (Modal)

python ./interview_modal.py --model <model> --runtime <runtime> --gpu <gpu>

See modal docs for valid GPUs.

Model Evaluations (Code Completion)

TODO

Model Evaluations (Fill-in-the-Middle)

TODO

Run Self-Checks and View Results

bulk-eval.sh is a quick and easy way to run the evaluate.py script for all results/interview* it finds.

streamlit run app.py "results/eval*" will then show you local results only.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation Guide

Prepare Enviroment

Common Options

API Evaluations (Instruct)

Evaluate a Remote API (LiteLLM)

Evaluate a Local/Self-Hosted API (LiteLLM)

Evaluate a Local ollama API

Evaluate a Local llama-server API

Evaluate a Local koboldcpp API

Evaluate a Local OobaBooga API

Model Evaluations (Instruct)

Evaluate a model with local GPU (CUDA)

Backend: Transformers

Backend: vLLM

Backend: Exllamav2

Evaluate a model with remote GPU (Modal)

Model Evaluations (Code Completion)

Model Evaluations (Fill-in-the-Middle)

Run Self-Checks and View Results

FilesExpand file tree

GUIDE.md

Latest commit

History

GUIDE.md

File metadata and controls

Evaluation Guide

Prepare Enviroment

Common Options

API Evaluations (Instruct)

Evaluate a Remote API (LiteLLM)

Evaluate a Local/Self-Hosted API (LiteLLM)

Evaluate a Local ollama API

Evaluate a Local llama-server API

Evaluate a Local koboldcpp API

Evaluate a Local OobaBooga API

Model Evaluations (Instruct)

Evaluate a model with local GPU (CUDA)

Backend: Transformers

Backend: vLLM

Backend: Exllamav2

Evaluate a model with remote GPU (Modal)

Model Evaluations (Code Completion)

Model Evaluations (Fill-in-the-Middle)

Run Self-Checks and View Results