Our lab: https://alregib.ece.gatech.edu/
Reference implementation of MoIR: Multi-modal Information Router, an information-level fusion method for mitigating modality dominance in Vision-Language Models.
MoIR sits between modality-specific encoders and the LLM decoder. It uses a truncated SVD on each modality's token sequence to identify less-informative channels and then routes complementary information from the other modality into those channels through learnable, per-channel gates. By rebalancing information before fusion, MoIR shifts modality dominance through the information availability of inputs rather than purely through attention.
+-------------------------------------+
| LLM Decoder |
+-------------------+-----------------+
^
[text tokens] | [vision tokens]
|
+-------------------+-----------------+
| MoIR (Information Router) |
| - per-channel SVD informativeness |
| - bottom-k' channel selection |
| - learnable α gates per channel |
+-------------------+-----------------+
^ ^
| |
text encoder vision encoder
MoIR/
├── moir/
│ ├── __init__.py
│ ├── router.py # MoIR module (Section 3.2 - 3.3 of the paper)
│ ├── inject.py # Injection into LLaVA-1.5 and Qwen2.5-VL
│ ├── model.py # Backbone loaders + LoRA wiring
│ ├── dataloader.py # ScienceQA / VizWiz / MMBench-Video datasets
│ ├── metrics.py # MDI, AEI, Effective Rank (Δ_I, Δ_T)
│ ├── trainer.py # Generic LoRA + MoIR training loop
│ └── utils.py # HF cache, seeding, prompt builders
├── configs/
│ ├── scienceqa_llava7b.yaml
│ ├── scienceqa_llava13b.yaml
│ ├── vizwiz_llava7b.yaml
│ └── mmbench_video_qwen.yaml
├── scripts/
│ ├── train_scienceqa.sh
│ ├── train_vizwiz.sh
│ └── train_mmbench_video.sh
├── train.py # train entry point
├── inference.py # generate predictions
├── eval.py # MDI / AEI / EffRank evaluation
└── main.py # train + eval pipeline
pip install -r requirements.txtTrain MoIR on ScienceQA with LLaVA-1.5-7B (attention-layer LoRA placement):
python train.py --config configs/scienceqa_llava7b.yaml --placement llm_attnEvaluate the trained checkpoint (MDI / AEI / Rank Δ):
python eval.py --config configs/scienceqa_llava7b.yaml \
--checkpoint outputs/scienceqa_llava7b/llm_attn/finalRun inference only:
python inference.py --config configs/scienceqa_llava7b.yaml \
--checkpoint outputs/scienceqa_llava7b/llm_attn/final \
--output predictions.jsonlEnd-to-end (train then evaluate):
python main.py --config configs/scienceqa_llava7b.yaml --placement llm_attnThe defaults follow the paper:
| Parameter | Value |
|---|---|
LoRA rank r |
16 |
LoRA α |
32 |
| LoRA dropout | 0.05 |
| Optimizer | AdamW |
| Learning rate | 2e-4 |
| Weight decay | 0 |
| Batch size | 8 (effective via grad accumulation) |
| Epochs | 10 |
Exchange ratio k' |
0.10 |
Routing gate init α |
0.5 |
| Max sequence length | 2048 |
@article{kim2026moir,
title = {Information Router for Mitigating Modality Dominance in Vision-Language Models},
author = {Kim, Seulgi and Prabhushankar, Mohit and AlRegib, Ghassan},
year = {2026}
}