You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Vision-based Retrieval-Augmented Generation (VisRAG) leverages vision-language models (VLMs) to jointly retrieve relevant visual documents and generate grounded answers based on multimodal evidence.
237
232
However, existing VisRAG models degrade in performance when visual inputs suffer from distortions such as blur, noise, low light, or shadow, where semantic and degradation factors become entangled within pretrained visual encoders, leading to errors in both retrieval and generation stages.
Together with the proposed Non-Causal Distortion Modeling and Causal Semantic Alignment objectives, the framework enforces a clear separation between semantics and degradations, enabling stable retrieval and generation under challenging visual conditions. To evaluate robustness under realistic conditions, we introduce the Distortion-VisRAG dataset, a large-scale benchmark containing both synthetic and real-world degraded documents across seven domains, with 12 synthetic and 5 real distortion types that comprehensively reflect practical visual degradations.
242
237
Experimental results show that RobustVisRAG improves retrieval, generation, and end-to-end performance by 7.35%, 6.35%, and 12.40%, respectively, on real-world degradations, while maintaining comparable accuracy on clean inputs.
We introduce a dedicated non-causal token to aggregate degradation signals via <i>unidirectional attention</i>, producing a degradation representation: \( Z_{zeg} \)
325
325
Patch tokens do not attend back to this token, preventing degradation leakage into semantic representations.
326
326
</p>
327
-
<hrclass="dashed-line">
327
+
<!-- <hr class="dashed-line"> -->
328
328
329
329
<pclass="content has-text-justified">
330
330
<b>Non-Causal Distortion Modeling (NCDM):</b> To structure the degradation subspace, we apply a triplet contrastive objective:
The causal branch aggregates patch tokens bidirectionally to produce purified semantic embeddings: \( Z_{sem} \)
347
347
This path is isolated from degradation tokens and is the only representation used at inference.
348
348
</p>
349
-
<hrclass="dashed-line">
349
+
<!-- <hr class="dashed-line"> -->
350
350
351
351
<pclass="content has-text-justified">
352
352
<b>Causal Semantic Alignment (CSA):</b> To ensure degradation-invariant semantics, we aligns degraded semantic embeddings with their clean counterparts while enforcing independence between semantic and degradation representations.:
0 commit comments