Skip to content

Commit 80d8770

Browse files
authored
docs: more extensive RoPE documentation [no ci] (#21953)
* more extensive ggml_rope documentation * add more docs * nits
1 parent 8dc530b commit 80d8770

2 files changed

Lines changed: 72 additions & 1 deletion

File tree

docs/development/HOWTO-add-model.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,6 +130,23 @@ Note:
130130
- Adding a model-specific API or CLI is an anti-pattern in `libmtmd`. The goal of `libmtmd` is to provide an easy-to-use, model-agnostic library for multimodal pipeline.
131131
- In most cases, `llama-mtmd-cli` should not be modified. If a model requires a specific prompt, either let the user provide it or bake it into the Jinja chat template.
132132

133+
## Tips and tricks
134+
135+
### Working with ggml_rope_ext
136+
137+
PyTorch implementations usually prefer explicitly calculating `freq_cis`/`sin`/`cos` components. However, in llama.cpp, most RoPE operations can be handled via `ggml_rope_ext`, which does not require a sin/cos matrix. This saves memory while allowing the GGML RoPE kernel to be fused with other ops.
138+
139+
However, since `ggml_rope_ext` only provides a subset of the RoPE implementations that models use, converting models from PyTorch to llama.cpp may require some creative adaptations.
140+
141+
For more information about `ggml_rope_ext`, please refer to the in-code documentation in `ggml.h`.
142+
143+
Examples:
144+
- `libmtmd` implements 2D RoPE with `GGML_ROPE_TYPE_NORMAL` ordering by splitting the input tensor in half, applying `ggml_rope_ext` separately to each half, then joining them back together using `ggml_concat`.
145+
- The [Kimi-K2.5](https://github.com/ggml-org/llama.cpp/pull/19170) vision encoder uses vision RoPE with interleaved frequencies. The weights must be permuted during conversion in order to reuse the `build_rope_2d()` function.
146+
- [Gemma 4](https://github.com/ggml-org/llama.cpp/pull/21309) uses "proportional" RoPE. We employ a trick where `rope_freqs` is set to a very large value in the last dimensions to prevent those dimensions from being rotated. See the `Gemma4Model` class in `convert_hf_to_gguf.py`.
147+
- Some models require scaling the input position. For example, `[0, 1, 2, ...]` becomes `[0, 0.5, 1, ...]`. In this case, you can provide the scaling via `freq_scale = 0.5f`.
148+
- Some models use learned RoPE frequencies instead of relying on `powf(freq_base, -2.0 * i / n_dims)`. In this case, you can provide the learned frequencies via the `rope_freqs` tensor (corresponding to the `c` argument in `ggml_rope_ext`), then set `freq_base = 1.0f`. An important note is that `rope_freqs` in GGML is the **inverse** (`theta = pos[i] / rope_freqs`), so you may need to invert `rope_freqs` during conversion.
149+
133150
## GGUF specification
134151

135152
https://github.com/ggml-org/ggml/blob/master/docs/gguf.md

ggml/include/ggml.h

Lines changed: 55 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1773,8 +1773,32 @@ extern "C" {
17731773
int n_dims,
17741774
int mode);
17751775

1776-
// custom RoPE
1776+
// RoPE operations with extended options
1777+
// a is the input tensor to apply RoPE to, shape [n_embd, n_head, n_token]
1778+
// b is an int32 vector with size n_token
17771779
// c is freq factors (e.g. phi3-128k), (optional)
1780+
// mode can be GGML_ROPE_TYPE_NORMAL or NEOX; for MROPE and VISION mode, use ggml_rope_multi
1781+
//
1782+
// pseudo-code for computing theta:
1783+
// for i in [0, n_dims/2):
1784+
// theta[i] = b[i] * powf(freq_base, -2.0 * i / n_dims);
1785+
// theta[i] = theta[i] / c[i]; # if c is provided, divide theta by c
1786+
// theta[i] = rope_yarn(theta[i], ...); # note: theta = theta * freq_scale is applied here
1787+
//
1788+
// other params are used by YaRN RoPE scaling, these default values will disable YaRN:
1789+
// freq_scale = 1.0f
1790+
// ext_factor = 0.0f
1791+
// attn_factor = 1.0f
1792+
// beta_fast = 0.0f
1793+
// beta_slow = 0.0f
1794+
//
1795+
// example:
1796+
// (marking: c = cos, s = sin, 0 = unrotated)
1797+
// given a single head with size = 8 --> [00000000]
1798+
// GGML_ROPE_TYPE_NORMAL n_dims = 4 --> [cscs0000]
1799+
// GGML_ROPE_TYPE_NORMAL n_dims = 8 --> [cscscscs]
1800+
// GGML_ROPE_TYPE_NEOX n_dims = 4 --> [ccss0000]
1801+
// GGML_ROPE_TYPE_NEOX n_dims = 8 --> [ccccssss]
17781802
GGML_API struct ggml_tensor * ggml_rope_ext(
17791803
struct ggml_context * ctx,
17801804
struct ggml_tensor * a,
@@ -1790,6 +1814,36 @@ extern "C" {
17901814
float beta_fast,
17911815
float beta_slow);
17921816

1817+
// multi-dimensional RoPE, for Qwen-VL and similar vision models
1818+
// mode can be either VISION, MROPE, IMROPE, cannot be combined with NORMAL or NEOX
1819+
// sections specify how many dimensions to rotate in each section:
1820+
// section length is equivalent to number of cos/sin pairs, NOT the number of dims
1821+
// (i.e. sum of 4 sections are expected to be n_dims/2)
1822+
// last sections can be 0, means ignored
1823+
// all other options are identical to ggml_rope_ext
1824+
//
1825+
// important note:
1826+
// - NEOX ordering is automatically applied and cannot be disabled for MROPE and VISION
1827+
// if you need normal ordering, there are 2 methods:
1828+
// (1) split the tensor manually using ggml_view
1829+
// (2) permute the weight upon conversion
1830+
// - for VISION, n_dims must be head_size/2
1831+
//
1832+
// example M-RoPE:
1833+
// given sections = [t=4, y=2, x=2, 0]
1834+
// given a single head with size = 18 --> [000000000000000000]
1835+
// GGML_ROPE_TYPE_MROPE n_dims = 16 --> [ttttyyxxttttyyxx00] (cos/sin are applied in NEOX ordering)
1836+
// GGML_ROPE_TYPE_IMROPE n_dims = 16 --> [ttyxttyxttyxttyx00] (interleaved M-RoPE, still NEOX ordering)
1837+
// note: the theta for each dim is computed the same way as ggml_rope_ext, no matter the section
1838+
// in other words, idx used for theta: [0123456789... until n_dims/2], not reset for each section
1839+
//
1840+
// example vision RoPE:
1841+
// given sections = [y=4, x=4, 0, 0] (last 2 sections are ignored)
1842+
// given a single head with size = 8 --> [00000000]
1843+
// GGML_ROPE_TYPE_VISION n_dims = 4 --> [yyyyxxxx]
1844+
// other values of n_dims are untested and is undefined behavior
1845+
// note: unlike MROPE, the theta for each dim is computed differently for each section
1846+
// in other words, idx used for theta: [0123] for y section, then [0123] for x section
17931847
GGML_API struct ggml_tensor * ggml_rope_multi(
17941848
struct ggml_context * ctx,
17951849
struct ggml_tensor * a,

0 commit comments

Comments
 (0)