readme updates (#408)

ManuelFay · ManuelFay · web-flow · commit 9c56c6182f30 · 2026-05-11T11:50:09.000+02:00
* readme updates

* readme updates

---------

Co-authored-by: ManuelFay &lt;manuel.faysse@epfl.ch&gt;
diff --git a/README.md b/README.md
@@ -18,11 +18,11 @@
 
 ## Associated Paper
 
-This repository contains the code used for training the vision retrievers in the [*ColPali: Efficient Document Retrieval with Vision Language Models*](https://arxiv.org/abs/2407.01449) paper. In particular, it contains the code for training the ColPali model, which is a vision retriever based on the ColBERT architecture and the PaliGemma model.
+This repository contains the code used for training and running visual document retrievers introduced by the [*ColPali: Efficient Document Retrieval with Vision Language Models*](https://arxiv.org/abs/2407.01449) paper. It includes the original ColPali model, based on the ColBERT architecture and the PaliGemma model, along with later ColVision and bi-encoder retriever variants.
 
 ## Introduction
 
-With our new model *ColPali*, we propose to leverage VLMs to construct efficient multi-vector embeddings in the visual space for document retrieval. By feeding the ViT output patches from PaliGemma-3B to a linear projection, we create a multi-vector representation of documents. We train the model to maximize the similarity between these document embeddings and the query embeddings, following the ColBERT method.
+With *ColPali*, we propose to leverage VLMs to construct efficient multi-vector embeddings in the visual space for document retrieval. By feeding the ViT output patches from PaliGemma-3B to a linear projection, we create a multi-vector representation of documents. We train the model to maximize the similarity between these document embeddings and the query embeddings, following the ColBERT method.
 
 Using ColPali removes the need for potentially complex and brittle layout recognition and OCR pipelines with a single model that can take into account both the textual and visual content (layout, charts, ...) of a document.
 
@@ -38,7 +38,7 @@ Using ColPali removes the need for potentially complex and brittle layout recogn
 | [vidore/colpali-v1.3](https://huggingface.co/vidore/colpali-v1.3)   | 84.8                                                                          | Gemma      | • Similar to `vidore/colpali-v1.2`.<br />• Trained with a larger effective batch size of 256 batch size for 3 epochs.                                          | ✅                   |
 | [vidore/colqwen2-v0.1](https://huggingface.co/vidore/colqwen2-v0.1) | 87.3                                                                          | Apache 2.0 | • Based on `Qwen/Qwen2-VL-2B-Instruct`.<br />• Supports dynamic resolution.<br />• Trained using 768 image patches per page and an effective batch size of 32. | ✅                   |
 | [vidore/colqwen2-v1.0](https://huggingface.co/vidore/colqwen2-v1.0) | 89.3                                                                          | Apache 2.0 | • Similar to `vidore/colqwen2-v0.1`, but trained with more powerful GPUs and with a larger effective batch size (256).                                         | ✅                   |
-| [vidore/colqwen2.5-v0.1](https://huggingface.co/vidore/colqwen2.5-v0.1) | 88.8                                                                          | Apache 2.0 | • Based on `Qwen/Qwen2 5-VL-3B-Instruct`<br />• Supports dynamic resolution.<br />• Trained using 768 image patches per page and an effective batch size of 32.                                         | ✅                   |
+| [vidore/colqwen2.5-v0.1](https://huggingface.co/vidore/colqwen2.5-v0.1) | 88.8                                                                          | Apache 2.0 | • Based on `Qwen/Qwen2.5-VL-3B-Instruct`<br />• Supports dynamic resolution.<br />• Trained using 768 image patches per page and an effective batch size of 32.                                         | ✅                   |
 | [vidore/colqwen2.5-v0.2](https://huggingface.co/vidore/colqwen2.5-v0.2) | 89.4                                                                          | Apache 2.0 | • Similar to `vidore/colqwen2.5-v0.1`, but trained with slightly different hyper parameters                                        | ✅                   |
 | [TomoroAI/tomoro-colqwen3-embed-4b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-4b) | 90.6                                                                           | Apache 2.0 | • Based on the Qwen3-VL backbone.<br />• 320-dim ColBERT-style embeddings with dynamic resolution.<br />• Trained for multi-vector document retrieval.          | ✅                   |
 | [athrael-soju/colqwen3.5-4.5B-v3](https://huggingface.co/athrael-soju/colqwen3.5-4.5B-v3) | 90.9                                                                           | Apache 2.0 | • Based on `Qwen/Qwen3.5-4B` (hybrid GatedDeltaNet + full-attention).<br />• 320-dim ColBERT-style embeddings.<br />• 4.5B params, LoRA-trained.          | ✅                   |
@@ -49,10 +49,10 @@ Using ColPali removes the need for potentially complex and brittle layout recogn
 
 ## Setup
 
-We used Python 3.11.6 and PyTorch 2.4 to train and test our models, but the codebase is compatible with Python >=3.10 and recent PyTorch versions. To install the package, run:
+The codebase is compatible with Python >=3.10,<3.15 and recent PyTorch versions. To install the package, run:
 
 ```bash
-pip install colpali-engine # from PyPi
+pip install colpali-engine # from PyPI
 pip install git+https://github.com/illuin-tech/colpali # from source
 ```
 
@@ -61,6 +61,10 @@ Mac users using MPS with the ColQwen models have reported errors with torch 2.6.
 > [!WARNING]
 > For ColPali versions above v1.0, make sure to install the `colpali-engine` package from source or with a version above v0.2.0.
 
+## Development docs
+
+- [Adding a new model family](docs/add_model_family.md)
+
 ## Usage
 
 ### Quick start
@@ -224,7 +228,7 @@ For a more detailed example, you can refer to the interpretability notebooks fro
 
 [Token pooling](https://doi.org/10.48550/arXiv.2409.14683) is a CRUDE-compliant method (document addition/deletion-friendly) that aims at reducing the sequence length of multi-vector embeddings. For ColPali, many image patches share redundant information, e.g. white background patches. By pooling these patches together, we can reduce the amount of embeddings while retaining most of the page's signal. Retrieval performance with hierarchical mean token pooling on image embeddings can be found in the [ColPali paper](https://doi.org/10.48550/arXiv.2407.01449). In our experiments, we found that a pool factor of 3 offered the optimal trade-off: the total number of vectors is reduced by $66.7\%$ while $97.8\%$ of the original performance is maintained.
 
-To use token pooling, you can use the `HierarchicalEmbeddingPooler` class from the `colpali-engine` package:
+To use token pooling, you can use the `HierarchicalTokenPooler` class from the `colpali-engine` package:
 
 <details>
 <summary><strong>🔽 Click to expand code snippet</strong></summary>
@@ -343,7 +347,7 @@ When your PR is ready, ping one of the repository maintainers. We will do our be
 
 ## Community Projects
 
-Several community projects and ressources have been developed around ColPali to facilitate its usage. Feel free to reach out if you want to add your project to this list!
+Several community projects and resources have been developed around ColPali to facilitate its usage. Feel free to reach out if you want to add your project to this list!
 
 <details>
 <summary><strong>🔽 Libraries 📚</strong></summary>
diff --git a/docs/add_model_family.md b/docs/add_model_family.md
@@ -0,0 +1,260 @@
+# Adding a new model family
+
+This guide describes the expected steps for adding a new model family to
+`colpali_engine.models`. A model family is a backbone-specific package such as
+`qwen3`, `gemma3`, `idefics3`, or `paligemma` that exposes one or more retriever
+variants.
+
+Most families contain:
+
+- A `Col*` late-interaction model that returns one normalized vector per token
+  and is scored with MaxSim.
+- Optionally, a `Bi*` dense retrieval model that pools to one normalized vector
+  per input.
+- One processor per variant, responsible for image/text formatting and scoring.
+
+## 1. Choose the package layout
+
+Create a family directory under `colpali_engine/models`:
+
+```text
+colpali_engine/models/<family>/
+    __init__.py
+    col<family>/
+        __init__.py
+        modeling_col<family>.py
+        processing_col<family>.py
+    bi<family>/
+        __init__.py
+        modeling_bi<family>.py
+        processing_bi<family>.py
+```
+
+Only add `bi<family>/` if the family supports a dense bi-encoder variant.
+Follow the naming style used by nearby families. For example, Qwen variants use
+`ColQwen3`, `ColQwen3Processor`, `BiQwen3`, and `BiQwen3Processor`.
+
+## 2. Implement the Col model
+
+The `Col*` class should usually inherit from the corresponding Transformers
+backbone model, not from a generic wrapper. See
+`colpali_engine/models/qwen3/colqwen3/modeling_colqwen3.py` and
+`colpali_engine/models/gemma3/colgemma3/modeling_colgemma.py` for current
+patterns.
+
+The class should define:
+
+- `main_input_name = "doc_input_ids"` for Transformers compatibility.
+- A retrieval projection layer, usually `self.custom_text_proj`.
+- `self.dim`, the embedding size returned by the retriever head.
+- `self.padding_side`, matching the processor and backbone requirements.
+- `self.mask_non_image_embeddings` when the family supports image-only masking.
+- A `forward` method that returns a `torch.Tensor` shaped
+  `(batch_size, sequence_length, dim)`.
+
+The forward pass should:
+
+1. Accept the batch produced by the processor for both images and text.
+2. Adapt processor-specific image tensors before calling the backbone when the
+   backbone expects a flattened visual-token layout.
+3. Call the parent model with `use_cache=False`, `output_hidden_states=True`,
+   and `return_dict=True`.
+4. Project the last hidden states with `custom_text_proj`.
+5. L2-normalize the projected embeddings on the last dimension.
+6. Multiply by `attention_mask.unsqueeze(-1)` so padding tokens score as zero.
+7. If `mask_non_image_embeddings=True`, zero non-image token embeddings for
+   image batches.
+
+Expose patch metadata needed by interpretability when the backbone supports it:
+
+```python
+@property
+def patch_size(self) -> int:
+    return self.visual.config.patch_size
+
+@property
+def spatial_merge_size(self) -> int:
+    return self.visual.config.spatial_merge_size
+```
+
+Adjust the properties to match the backbone config. Some models only expose
+`patch_size`.
+
+## 3. Handle checkpoint key mappings
+
+Adapter checkpoints often contain PEFT or backbone-specific prefixes that do not
+match the retriever class. Add a `_checkpoint_conversion_mapping` to the model
+when needed:
+
+```python
+_checkpoint_conversion_mapping = {
+    r"^base_model\.model\.custom_text_proj": "custom_text_proj",
+    r"^model\.visual": "visual",
+    r"^model\.language_model": "language_model",
+    r"^model\.": "",
+}
+```
+
+Override `from_pretrained` to pass the mapping through `key_mapping`:
+
+```python
+@classmethod
+def from_pretrained(cls, *args, **kwargs):
+    key_mapping = kwargs.pop("key_mapping", None)
+    if key_mapping is None:
+        key_mapping = dict(getattr(super(), "_checkpoint_conversion_mapping", {}))
+        key_mapping.update(getattr(cls, "_checkpoint_conversion_mapping", {}))
+    return super().from_pretrained(*args, **kwargs, key_mapping=key_mapping)
+```
+
+If Transformers requires registration for the model type, register the mapping
+with `register_checkpoint_conversion_mapping`, as in the Qwen and ModernVBert
+implementations.
+
+Add tests to `tests/models/test_checkpoint_key_mappings.py` for every custom
+mapping that rewrites adapter keys.
+
+## 4. Implement the processor
+
+Processors should inherit from `BaseVisualRetrieverProcessor` and the matching
+Transformers processor:
+
+```python
+class ColNewFamilyProcessor(BaseVisualRetrieverProcessor, NewFamilyProcessor):
+    ...
+```
+
+The processor must implement:
+
+- `process_images(self, images)`: converts PIL images to model-ready batches.
+- `process_texts(self, texts)`: converts text inputs to model-ready batches.
+- `score(self, qs, ps, device=None, **kwargs)`: delegates to
+  `score_multi_vector` for `Col*` models.
+- `get_n_patches(...)`: returns `(n_patches_x, n_patches_y)` for
+  interpretability.
+
+Set prompt and token attributes when the backbone needs them:
+
+```python
+visual_prompt_prefix = "..."
+query_prefix = "..."
+query_augmentation_token = "..."
+image_token = "..."
+```
+
+Use the backbone's chat template or special tokens consistently with the
+checkpoint used for training. Also set `self.tokenizer.padding_side` in
+`__init__` when the model requires left or right padding.
+
+If the processor pads per-image visual tensors for distributed training, the
+model forward pass must undo that padding before calling the backbone. The Qwen
+processors and models are the reference pattern for this.
+
+## 5. Implement an optional Bi model
+
+Add a `Bi*` model when the family needs dense single-vector retrieval. The class
+normally shares the same backbone and processor conventions, but the forward pass
+returns `(batch_size, hidden_size_or_dim)` instead of per-token embeddings.
+
+Support the local pooling styles used elsewhere when possible:
+
+- `cls`: first token.
+- `last`: last token.
+- `mean`: attention-mask-weighted mean.
+
+Normalize the pooled embedding before returning it. Its processor `score` method
+should delegate to `score_single_vector`.
+
+## 6. Export the new classes
+
+Wire the imports at every package level:
+
+```python
+# colpali_engine/models/<family>/col<family>/__init__.py
+from .modeling_col<family> import ColNewFamily
+from .processing_col<family> import ColNewFamilyProcessor
+
+# colpali_engine/models/<family>/__init__.py
+from .col<family> import ColNewFamily, ColNewFamilyProcessor
+from .bi<family> import BiNewFamily, BiNewFamilyProcessor
+
+# colpali_engine/models/__init__.py
+from .<family> import BiNewFamily, BiNewFamilyProcessor, ColNewFamily, ColNewFamilyProcessor
+```
+
+Keep these exports stable because users import models directly from
+`colpali_engine.models`.
+
+## 7. Add tests
+
+Create tests under `tests/models/<family>/<variant>/`, following existing model
+families.
+
+Processor tests should verify:
+
+- `from_pretrained` returns the custom processor class.
+- `process_images` returns expected keys and tensor batch dimensions.
+- `process_texts` returns text tensors with the expected batch size.
+- `process_queries` remains compatible with the legacy evaluator path.
+
+Model tests should verify:
+
+- `from_pretrained` returns the custom model class.
+- Image forward pass returns a tensor with shape
+  `(batch_size, sequence_length, model.dim)` for `Col*`.
+- Query forward pass returns the same embedding dimension.
+- Retrieval smoke tests rank matching image/query pairs correctly when a small
+  public checkpoint is available.
+
+Use `@pytest.mark.slow` for tests that download or run full checkpoints.
+
+Run the targeted tests before opening a PR:
+
+```bash
+pytest tests/models/<family>
+pytest tests/models/test_checkpoint_key_mappings.py
+```
+
+Run the linter before submitting:
+
+```bash
+ruff check .
+```
+
+## 8. Add training and example entry points when needed
+
+If the family is trainable from this repository, add a config under
+`scripts/configs/<family>/` and update any training scripts that need to import
+the new classes. Keep the config names aligned with the model class names, for
+example `train_col<family>_model.py` or `train_col<family>_model.yaml`.
+
+If interpretability is supported, add an example under
+`examples/interpretability/<variant>/` and make sure `get_n_patches` plus
+`get_image_mask` return masks in the same token order as the model embeddings.
+
+## 9. Update user-facing documentation
+
+When the checkpoint is public and supported, update the model table in
+`README.md` with:
+
+- The Hugging Face model id.
+- The base backbone.
+- The license.
+- Notes about dynamic resolution, embedding dimension, or masking behavior.
+- Whether the model is currently supported.
+
+Add usage snippets only if loading or preprocessing differs from the existing
+quick start pattern.
+
+## Review checklist
+
+Before submitting the change, check that:
+
+- The model and processor can be imported from `colpali_engine.models`.
+- `process_images`, `process_texts`, and `process_queries` all work.
+- `model(**processor.process_images(...))` and
+  `model(**processor.process_queries(...))` return normalized tensors.
+- Padding embeddings are zeroed for `Col*` outputs.
+- Checkpoint mappings load LoRA or adapter checkpoints without manual key edits.
+- Slow tests are marked, and fast tests do not download large checkpoints unless
+  the existing family tests already do the same.