This is an official implementation of the multi-modalities pre-training model in InternVideo, which is resposible for multi-modalities tasks including zero-shot action recognition, zero-shot multiple choice, zero-shot retrieval, video question answering, video-text retrieval, and also one of the component in the final InternVideo model.
We currently provide the B/16 model, please download the model from aliyun and place them under folder models. The model uses UniformerV2 as backbone, and is trained for 12 days using 128 NVIDIA A100 GPUs.
To classify the demo video of an airplane taking off, run python demo.py, and hopefully you'll see the results as (L/14 model)
Label probs:
an airplane is taking off : 0.9562
an airplane is flying : 0.0438
a dog is chasing a ball : 0.0000
This folder aims at providing a minimal inference implementation for easier usage. For training and fine-tuning for downstream tasks, please refer to other specific folders.
If you intend to use InternVideo for your own video-language tasks, use the video encoder and text encoder only for alignment tasks such as retrieval, and use the features from cross-modality decoder at the same time if your task involves modalities fusion such as video question answering.
The training code and L/14 model is on its way.