Skip to content

Commit 6313acb

Browse files
authored
docs: add guide on how to add multimodal support (ggml-org#21778)
* docs: add guide on how to add multimodal support * nits
1 parent ff5ef82 commit 6313acb

1 file changed

Lines changed: 16 additions & 0 deletions

File tree

docs/development/HOWTO-add-model.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ Adding a model requires few steps:
55
1. Convert the model to GGUF
66
2. Define the model architecture in `llama.cpp`
77
3. Build the GGML graph implementation
8+
4. Optional: Add multimodal encoder implementation
89

910
After following these steps, you can open PR.
1011

@@ -114,6 +115,21 @@ Some `ggml` backends do not support all operations. Backend implementations can
114115

115116
Note: to debug the inference graph: you can use [llama-eval-callback](/examples/eval-callback/).
116117

118+
### 4. Optional: Add multimodal encoder implementation
119+
120+
If the new model supports multimodal inputs, you will need to add a new encoder definition in `libmtmd`. You can find more information about llama.cpp's multimodal support in [the docs](../multimodal.md) and in the `tools/mtmd` source directory.
121+
122+
1. In the conversion script, make sure you add a subclass that extends `MmprojModel` or another class that inherits from the same base class.
123+
2. Add the encoder definition in `clip.cpp`.
124+
3. Implement the preprocessor in `mtmd.cpp`. In most cases, you can reuse an existing preprocessor.
125+
4. Implement the encoder GGML graph, either in a dedicated file if the model is truly different from existing ones, or by reusing an existing implementation (for example: siglip, pixtral, or qwen) and adding a model-specific projector.
126+
127+
Note:
128+
- Many multimodal encoders are based on models that are already supported. Make sure to read the existing encoder definitions in `tools/mtmd/models` before adding a new one. In `libmtmd`, it is generally better to extend an existing model than to duplicate code.
129+
- To debug the multimodal preprocessor and encoder, you can use [llama-mtmd-debug](tools/mtmd/debug/mtmd-debug.cpp).
130+
- Adding a model-specific API or CLI is an anti-pattern in `libmtmd`. The goal of `libmtmd` is to provide an easy-to-use, model-agnostic library for multimodal pipeline.
131+
- In most cases, `llama-mtmd-cli` should not be modified. If a model requires a specific prompt, either let the user provide it or bake it into the Jinja chat template.
132+
117133
## GGUF specification
118134

119135
https://github.com/ggml-org/ggml/blob/master/docs/gguf.md

0 commit comments

Comments
 (0)