python -m venv venvsource venv/bin/activatepip install -r requirements.txt(optional:-r requirements-transformers.txt -r requirements-exl2.txt -r requirements-vllm.txt)
All interview scripts accept the following common options:
--interviewdirectly run instruct-completion (default: senior)--inputrun a pre-prepared interview used for completion and fim
python ./interview_litellm.py --model <provider>/<model_id> --apikey <key>
See LiteLLM documentation for the full list of supported providers.
python ./interview_litellm.py --model openai/<model_id> --apibase http://<host>:<port>/
If the runtime cannot be inferred from the endpoint, you will be asked to provide --runtime
ollama serve <model_id>
python ./interview_litellm.py --model ollama_chat/<model_id>
llama-server -m /home/mike/models/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf -c 8192 -fa -ngl 99 --host 0.0.0.0 --port 8080
Note: -fa enables flash attention, -ngl 99 enables GPU offloading
python3 ./interview-litellm.py --model openai/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf --apibase http://127.0.0.1:8080
koboldcpp /home/mike/models/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf --contextsize 8192 --flashattention --gpulayers 99 --usecublas 1 --host 0.0.0.0 --port 8080
Note: --flashattention enables flash attention, --gpulayers 99 --usecublas 1 enables GPU offloading
python3 ./interview-litellm.py --model openai/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf --apibase http://127.0.0.1:8080
See Ooba Docs for how to launch.
python3 ./interview-litellm.py --model openai/<modelid>> --apibase http://127.0.0.1:8080 --runtime oobabooga
The local CUDA executor will use all available GPUs by default, use CUDA_VISIBLE_DEVICES if you have connected accelerators you don't want used.
Note that you can install all 3 backends into a single venv.
pip install -r requirements.txt -r requirements-transformers.txt
python ./interview_cuda.py --model <model> --runtime transformers
pip install -r requirements.txt -r requirements-vllm.txt
python ./interview_cuda.py --model <model> --runtime vllm
pip install wheel && pip install -r requirements.txt -r requirements-exl2.txt
python ./interview_cuda.py --model <model> --runtime exllama2
python ./interview_modal.py --model <model> --runtime <runtime> --gpu <gpu>
See modal docs for valid GPUs.
TODO
TODO
bulk-eval.sh is a quick and easy way to run the evaluate.py script for all results/interview* it finds.
streamlit run app.py "results/eval*" will then show you local results only.