model: Add LanguageBind video and audio model wrapper#4557
model: Add LanguageBind video and audio model wrapper#4557myang333 wants to merge 16 commits intoembeddings-benchmark:mainfrom
Conversation
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
AdnanElAssadi56
left a comment
There was a problem hiding this comment.
we can remove uv.lock I guess
|
@Michelleyyy333 How do you install languagebind? In logs
No, we shouldn't delete it |
|
Hey! The LanguageBind doesn't actually have a setup.py or pyproject.toml in their repo, so it can't be pip-installed. When I was testing it on GPU, I just cloned the repo and added it to PYTHONPATH: |
Yes. Can you add an instruction with setup in class docstring? |
|
@myang333 Can you run this models on some tasks to see scores? |
|
CI tests are failing due to a timeout in the setup step (Launchpad PPA connection timeout) , doesn't seem related to my code changes. Should I retrigger the workflow or will it resolve on its own? |
|
I think we need to wait a bit, and I'll restart CI |
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
|
@myang333 Can you run these models on tasks from paper and try to reproduce their results? |
|
Sure! I'll check the paper for their reported benchmarks and try to reproduce. |
| device: str = "cuda" if torch.cuda.is_available() else "cpu", | ||
| fps: float | None = None, | ||
| max_frames: int | None = None, | ||
| num_frames: int | None = 8, |
There was a problem hiding this comment.
Did you verify this is the number it expects (used in training or recommened)?
| if has_text: | ||
| text_emb = self.video_model.get_text_embeddings( | ||
| inputs, prompt_type=prompt_type, **kwargs | ||
| ) | ||
| embeddings = text_emb if embeddings is None else embeddings + text_emb | ||
| if has_image: | ||
| image_emb = self.image_model.get_image_embeddings(inputs, **kwargs) | ||
| embeddings = image_emb if embeddings is None else embeddings + image_emb | ||
| if has_audio: | ||
| audio_emb = self.audio_model.get_audio_embeddings(inputs, **kwargs) | ||
| embeddings = audio_emb if embeddings is None else embeddings + audio_emb | ||
| if has_video: | ||
| video_emb = self.video_model.get_video_embeddings(inputs, **kwargs) | ||
| embeddings = video_emb if embeddings is None else embeddings + video_emb |
There was a problem hiding this comment.
Is this how they fuse modalities in original implementation?
There was a problem hiding this comment.
No, they're separate models. I implemented similarly with how we implemented speech t5

Hey! This adds model integration for LanguageBind (ICLR 2024) — a multimodal embedding model that aligns video, audio, and text into a shared embedding space.
Models added:
LanguageBind/LanguageBind_Video_FT (video + text, MIT license)
LanguageBind/LanguageBind_Audio_FT (audio + text, MIT license)
Implementation notes:
LanguageBind uses its own library (needs to be cloned from GitHub, built on OpenCLIP)
Video and audio models are loaded separately but share the same text encoder
Embedding dim: 768
Results:
Ran evaluation on VideoRetrieval task, results JSON attached
Refs: Paper | HuggingFace | Related to #4130
Model checklist:
I have filled out the ModelMeta object to the extent possible
I have ensured that my model can be loaded using
mteb.get_model(model_name, revision) and
mteb.get_model_meta(model_name, revision)
I have tested the implementation works on a representative set of tasks.
The model is public, i.e., is available either as an API or the weights are publicly available to download
I reproduced results from the original paper (if applicable) on at least one benchmark, and I am including the results in the PR description.(Note on the last item: LanguageBind's original paper doesn't include MTEB benchmarks, so there are no paper results to reproduce against.)