Skip to content

olivesgatech/MoIR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MoIR: Multi-modal Information Router

Reference implementation of MoIR: Multi-modal Information Router, an information-level fusion method for mitigating modality dominance in Vision-Language Models.

MoIR sits between modality-specific encoders and the LLM decoder. It uses a truncated SVD on each modality's token sequence to identify less-informative channels and then routes complementary information from the other modality into those channels through learnable, per-channel gates. By rebalancing information before fusion, MoIR shifts modality dominance through the information availability of inputs rather than purely through attention.

                    +-------------------------------------+
                    |             LLM Decoder             |
                    +-------------------+-----------------+
                                        ^
                          [text tokens] | [vision tokens]
                                        |
                    +-------------------+-----------------+
                    |       MoIR (Information Router)     |
                    |   - per-channel SVD informativeness |
                    |   - bottom-k' channel selection     |
                    |   - learnable α gates per channel   |
                    +-------------------+-----------------+
                          ^                          ^
                          |                          |
                  text encoder                vision encoder

Repository Layout

MoIR/
├── moir/
│   ├── __init__.py
│   ├── router.py        # MoIR module (Section 3.2 - 3.3 of the paper)
│   ├── inject.py        # Injection into LLaVA-1.5 and Qwen2.5-VL
│   ├── model.py         # Backbone loaders + LoRA wiring
│   ├── dataloader.py    # ScienceQA / VizWiz / MMBench-Video datasets
│   ├── metrics.py       # MDI, AEI, Effective Rank (Δ_I, Δ_T)
│   ├── trainer.py       # Generic LoRA + MoIR training loop
│   └── utils.py         # HF cache, seeding, prompt builders
├── configs/
│   ├── scienceqa_llava7b.yaml
│   ├── scienceqa_llava13b.yaml
│   ├── vizwiz_llava7b.yaml
│   └── mmbench_video_qwen.yaml
├── scripts/
│   ├── train_scienceqa.sh
│   ├── train_vizwiz.sh
│   └── train_mmbench_video.sh
├── train.py             # train entry point
├── inference.py         # generate predictions
├── eval.py              # MDI / AEI / EffRank evaluation
└── main.py              # train + eval pipeline

Installation

pip install -r requirements.txt

Usage

Train MoIR on ScienceQA with LLaVA-1.5-7B (attention-layer LoRA placement):

python train.py --config configs/scienceqa_llava7b.yaml --placement llm_attn

Evaluate the trained checkpoint (MDI / AEI / Rank Δ):

python eval.py --config configs/scienceqa_llava7b.yaml \
               --checkpoint outputs/scienceqa_llava7b/llm_attn/final

Run inference only:

python inference.py --config configs/scienceqa_llava7b.yaml \
                    --checkpoint outputs/scienceqa_llava7b/llm_attn/final \
                    --output predictions.jsonl

End-to-end (train then evaluate):

python main.py --config configs/scienceqa_llava7b.yaml --placement llm_attn

Hyperparameters

The defaults follow the paper:

Parameter Value
LoRA rank r 16
LoRA α 32
LoRA dropout 0.05
Optimizer AdamW
Learning rate 2e-4
Weight decay 0
Batch size 8 (effective via grad accumulation)
Epochs 10
Exchange ratio k' 0.10
Routing gate init α 0.5
Max sequence length 2048

Citation

@article{kim2026moir,
  title  = {Information Router for Mitigating Modality Dominance in Vision-Language Models},
  author = {Kim, Seulgi and Prabhushankar, Mohit and AlRegib, Ghassan},
  year   = {2026}
}

About

Information Router for Mitigating Modality Dominance in Vision-Language Models (ICIP 2026)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors