llama-perplexity
llama-bench
Before submitting your PR:
Search for existing PRs to prevent duplicating efforts
llama.cpp uses the ggml tensor library for model evaluation. If you are unfamiliar with ggml, consider taking a look at the examples in the ggml repository. simple shows the bare minimum for using ggml. gpt-2 has minimal implementations for language model inference using GPT-2. mnist demonstrates how to train and evaluate a simple image classifier
Test your changes:
Execute the full CI locally on your machine before publishing
Verify that the perplexity and the performance are not affected negatively by your changes (use llama-perplexity and llama-bench)
If you modified the ggml source, run the test-backend-ops tool to check whether different backend implementations of the ggml operators produce consistent results (this requires access to at least two different ggml backends)
If you modified a ggml operator or added a new one, add the corresponding test cases to test-backend-ops
Create separate PRs for each feature or fix:
Avoid combining unrelated changes in a single PR
For intricate features, consider opening a feature request first to discuss and align expectations
When adding support for a new model or feature, focus on CPU support only in the initial PR unless you have a good reason not to. Add support for other backends like CUDA in follow-up PRs
In particular, adding new data types (extension of the ggml_type enum) carries with it a disproportionate maintenance burden. As such, to add a new quantization type you will need to meet the following additional criteria at minimum:
convert a small model to GGUF using the new type and upload it to HuggingFace
provide perplexity comparisons to FP16/BF16 (whichever is the native precision) as well as to types of similar size
provide KL divergence data calculated vs. the FP16/BF16 (whichever is the native precision) version for both the new type as well as types of similar size
provide performance data for the new type in comparison to types of similar size on pure CPU
Consider allowing write access to your branch for faster reviews, as reviewers can push commits directly
If you are a new contributor, limit your open PRs to 1.
After submitting your PR:
Expect requests for modifications to ensure the code meets llama.cpp's standards for quality and long-term maintainability
Maintainers will rely on your insights and approval when making a final decision to approve and merge a PR
If your PR becomes stale, rebase it on top of latest master to get maintainers attention
Consider adding yourself to CODEOWNERS to indicate your availability for fixing related issues and reviewing related PRs
llama-perplexity
llama-bench
Before submitting your PR:
Search for existing PRs to prevent duplicating efforts
llama.cpp uses the ggml tensor library for model evaluation. If you are unfamiliar with ggml, consider taking a look at the examples in the ggml repository. simple shows the bare minimum for using ggml. gpt-2 has minimal implementations for language model inference using GPT-2. mnist demonstrates how to train and evaluate a simple image classifier
Test your changes:
Execute the full CI locally on your machine before publishing
Verify that the perplexity and the performance are not affected negatively by your changes (use llama-perplexity and llama-bench)
If you modified the ggml source, run the test-backend-ops tool to check whether different backend implementations of the ggml operators produce consistent results (this requires access to at least two different ggml backends)
If you modified a ggml operator or added a new one, add the corresponding test cases to test-backend-ops
Create separate PRs for each feature or fix:
Avoid combining unrelated changes in a single PR
For intricate features, consider opening a feature request first to discuss and align expectations
When adding support for a new model or feature, focus on CPU support only in the initial PR unless you have a good reason not to. Add support for other backends like CUDA in follow-up PRs
In particular, adding new data types (extension of the ggml_type enum) carries with it a disproportionate maintenance burden. As such, to add a new quantization type you will need to meet the following additional criteria at minimum:
convert a small model to GGUF using the new type and upload it to HuggingFace
provide perplexity comparisons to FP16/BF16 (whichever is the native precision) as well as to types of similar size
provide KL divergence data calculated vs. the FP16/BF16 (whichever is the native precision) version for both the new type as well as types of similar size
provide performance data for the new type in comparison to types of similar size on pure CPU
Consider allowing write access to your branch for faster reviews, as reviewers can push commits directly
If you are a new contributor, limit your open PRs to 1.
After submitting your PR:
Expect requests for modifications to ensure the code meets llama.cpp's standards for quality and long-term maintainability
Maintainers will rely on your insights and approval when making a final decision to approve and merge a PR
If your PR becomes stale, rebase it on top of latest master to get maintainers attention
Consider adding yourself to CODEOWNERS to indicate your availability for fixing related issues and reviewing related PRs