Implement Joycaption as a custom Llava model#659
Conversation
|
Hey @nArn0 A could nits:
|
|
Oh! Thanks for your feedback! I'll look into it, JoyCaption looks like a classic llava but since it uses Siglip2, i had to get your implementation from mlx-embeddings for the vision part which didn't fit the classic CLIP llava used. Indeed, a notebook would help for the example. My only issue i really didn't find how to fix is that if torchvision is present, torch.nn.functional.interpolate fails with a cryptic error. |
|
My pleasure! In that case just import all the common components from llava (inherit from them) and add the new. For instance inherit Model and override vision_tower. Check mistral3 https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/mistral3/mistral3.py#L6 |
Implement Joycaption as a specific llava model. All information and quantized Joycaption model available here: https://huggingface.co/n-Arno/joycaption-mlx-mxfp4
I know this PR may not fit perfectly in mlx-vlm, but just in case, i prefer proposing the PR even if it may be rejected.
Thanks for your work on mlx-vlm!