Skip to content

model: Add LanguageBind video and audio model wrapper#4557

Open
myang333 wants to merge 16 commits intoembeddings-benchmark:mainfrom
myang333:main
Open

model: Add LanguageBind video and audio model wrapper#4557
myang333 wants to merge 16 commits intoembeddings-benchmark:mainfrom
myang333:main

Conversation

@myang333
Copy link
Copy Markdown

Hey! This adds model integration for LanguageBind (ICLR 2024) — a multimodal embedding model that aligns video, audio, and text into a shared embedding space.
Models added:

LanguageBind/LanguageBind_Video_FT (video + text, MIT license)
LanguageBind/LanguageBind_Audio_FT (audio + text, MIT license)

Implementation notes:

LanguageBind uses its own library (needs to be cloned from GitHub, built on OpenCLIP)
Video and audio models are loaded separately but share the same text encoder
Embedding dim: 768

Results:

Ran evaluation on VideoRetrieval task, results JSON attached

Refs: Paper | HuggingFace | Related to #4130

Model checklist:
I have filled out the ModelMeta object to the extent possible
I have ensured that my model can be loaded using
mteb.get_model(model_name, revision) and
mteb.get_model_meta(model_name, revision)
I have tested the implementation works on a representative set of tasks.
The model is public, i.e., is available either as an API or the weights are publicly available to download
I reproduced results from the original paper (if applicable) on at least one benchmark, and I am including the results in the PR description.(Note on the last item: LanguageBind's original paper doesn't include MTEB benchmarks, so there are no paper results to reproduce against.)

Comment thread VideoRetrieval.json Outdated
Comment thread mteb/models/model_implementations/language_bind_models.py Outdated
Comment thread mteb/models/model_implementations/language_bind_models.py
Comment thread mteb/models/model_implementations/language_bind_models.py Outdated
Comment thread mteb/models/model_implementations/language_bind_models.py
Comment thread mteb/models/model_implementations/language_bind_models.py Outdated
Comment thread mteb/models/model_implementations/language_bind_models.py Outdated
Comment thread mteb/models/model_implementations/language_bind_models.py Outdated
Comment thread mteb/models/model_implementations/language_bind_models.py Outdated
@Samoed Samoed added new model Questions related to adding a new model to the benchmark video video extension labels Apr 29, 2026
@myang333 myang333 mentioned this pull request Apr 30, 2026
72 tasks
Comment thread pyproject.toml Outdated
Comment thread pyproject.toml Outdated
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Copy link
Copy Markdown
Contributor

@AdnanElAssadi56 AdnanElAssadi56 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove uv.lock I guess

@Samoed
Copy link
Copy Markdown
Member

Samoed commented May 1, 2026

@Michelleyyy333 How do you install languagebind? In logs

Updating https://github.com/PKU-YuanGroup/LanguageBind.git (HEAD)
    Updated https://github.com/PKU-YuanGroup/LanguageBind.git (7070c53375661cdb235801176b564b45f96f0648)
  × Failed to download and build `languagebind @
  │ git+[https://github.com/PKU-YuanGroup/LanguageBind.git`](https://github.com/PKU-YuanGroup/LanguageBind.git%60)
  ╰─▶ /home/runner/work/_temp/setup-uv-cache/git-v0/checkouts/a994bd661ba96114/7070c53
      does not appear to be a Python project, as neither `pyproject.toml` nor
      `setup.py` are present in the directory

we can remove uv.lock I guess

No, we shouldn't delete it

@myang333
Copy link
Copy Markdown
Author

myang333 commented May 1, 2026

@Samoed

Hey! The LanguageBind doesn't actually have a setup.py or pyproject.toml in their repo, so it can't be pip-installed. When I was testing it on GPU, I just cloned the repo and added it to PYTHONPATH:
git clone https://github.com/PKU-YuanGroup/LanguageBind.git
export PYTHONPATH="/path/to/LanguageBind:$PYTHONPATH"
Not sure what the best way to handle this in the pyproject.toml is — should I just remove the languagebind entry from optional deps and set extra_requirements_groups=[]?

@Samoed
Copy link
Copy Markdown
Member

Samoed commented May 1, 2026

should I just remove the languagebind entry from optional deps and set extra_requirements_groups=[]?

Yes. Can you add an instruction with setup in class docstring?

@Samoed Samoed requested a review from KennethEnevoldsen May 1, 2026 06:06
Comment thread uv.lock
@Samoed
Copy link
Copy Markdown
Member

Samoed commented May 1, 2026

@myang333 Can you run this models on some tasks to see scores?

@myang333
Copy link
Copy Markdown
Author

myang333 commented May 1, 2026

CI tests are failing due to a timeout in the setup step (Launchpad PPA connection timeout) , doesn't seem related to my code changes. Should I retrigger the workflow or will it resolve on its own?

@Samoed
Copy link
Copy Markdown
Member

Samoed commented May 1, 2026

I think we need to wait a bit, and I'll restart CI

@myang333
Copy link
Copy Markdown
Author

myang333 commented May 1, 2026

gpu_result
Task Type Model Split Main Score
VideoRetrieval Retrieval LanguageBind_Video_FT dev 0.089
CREMA_D AudioClassification LanguageBind_Audio_FT train 0.334
ESC50_Zeroshot AudioZeroshotClassification LanguageBind_Audio_FT train 0.822
CIFAR10 ImageClassification LanguageBind_Image test 0.985
CIFAR100ZeroShot ZeroShotClassification LanguageBind_Image test 0.852

Also found and fixed a bug in '_transform_audio', the audio processor expects a (waveform, sample_rate) tuple. Pushed the fix.

Comment thread mteb/models/model_implementations/language_bind_models.py Outdated
Comment thread mteb/models/model_implementations/language_bind_models.py Outdated
Comment thread mteb/models/model_implementations/language_bind_models.py
@Samoed Samoed removed the request for review from KennethEnevoldsen May 1, 2026 12:35
myang333 and others added 2 commits May 4, 2026 20:38
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
@Samoed
Copy link
Copy Markdown
Member

Samoed commented May 5, 2026

@myang333 Can you run these models on tasks from paper and try to reproduce their results?

@myang333
Copy link
Copy Markdown
Author

myang333 commented May 5, 2026

Sure! I'll check the paper for their reported benchmarks and try to reproduce.

@Samoed Samoed added audio Audio extension image The image extension of MTEB labels May 5, 2026
device: str = "cuda" if torch.cuda.is_available() else "cpu",
fps: float | None = None,
max_frames: int | None = None,
num_frames: int | None = 8,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you verify this is the number it expects (used in training or recommened)?

Comment on lines +458 to +471
if has_text:
text_emb = self.video_model.get_text_embeddings(
inputs, prompt_type=prompt_type, **kwargs
)
embeddings = text_emb if embeddings is None else embeddings + text_emb
if has_image:
image_emb = self.image_model.get_image_embeddings(inputs, **kwargs)
embeddings = image_emb if embeddings is None else embeddings + image_emb
if has_audio:
audio_emb = self.audio_model.get_audio_embeddings(inputs, **kwargs)
embeddings = audio_emb if embeddings is None else embeddings + audio_emb
if has_video:
video_emb = self.video_model.get_video_embeddings(inputs, **kwargs)
embeddings = video_emb if embeddings is None else embeddings + video_emb
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this how they fuse modalities in original implementation?

Copy link
Copy Markdown
Member

@Samoed Samoed May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, they're separate models. I implemented similarly with how we implemented speech t5

Copy link
Copy Markdown
Contributor

@AdnanElAssadi56 AdnanElAssadi56 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

audio Audio extension image The image extension of MTEB new model Questions related to adding a new model to the benchmark video video extension

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants