The Embedder module in gemma/gm/nn/_modules.py currently implements an encode_vision method which acts as the critical "bridge" for multimodal inference. This method projects visual features (e.g., from SigLiP) into the Transformer's unified embedding space using RMSNorm and an Einsum projection.
Currently, there are no dedicated unit tests for this path, as noted by the TODO at line 74 of gemma/gm/nn/_modules_test.py.
Goal:
- Implement a robust test suite for
encode_vision.
- Verify that initializing the
Embedder with vision_proj_dim correctly creates the mm_input_projection and mm_soft_embedding_norm parameters.
- Ensure that visual tokens are correctly projected to the model's
embed_dim.
The
Embeddermodule ingemma/gm/nn/_modules.pycurrently implements anencode_visionmethod which acts as the critical "bridge" for multimodal inference. This method projects visual features (e.g., from SigLiP) into the Transformer's unified embedding space usingRMSNormand anEinsumprojection.Currently, there are no dedicated unit tests for this path, as noted by the TODO at line 74 of
gemma/gm/nn/_modules_test.py.Goal:
encode_vision.Embedderwithvision_proj_dimcorrectly creates themm_input_projectionandmm_soft_embedding_normparameters.embed_dim.