Skip to content

Commit 0cc3395

Browse files
committed
update
1 parent 4d82475 commit 0cc3395

4 files changed

Lines changed: 69 additions & 18 deletions

File tree

index.html

Lines changed: 69 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -219,19 +219,14 @@ <h1 class="title is-1 publication-title">
219219
</body>
220220
</html> -->
221221

222-
<section class="hero teaser">
223-
<div class="img-container">
224-
<div class="image-with-caption">
225-
<img class="regular-gif" src="./robustvisrag_files/Overview.png" style="width: 60%; height: auto;">
226-
</div>
227-
</div>
222+
223+
<section class="section">
228224
<div class="container is-max-desktop">
229-
<div class="hero-body">
230-
<!-- <video autoplay muted loop playsinline style="width: 100%; height: auto; border-radius: 12px;">
231-
<source src="" type="video/mp4">
232-
Your browser does not support the video tag.
233-
</video> -->
234-
<div class="content has-text-justified">
225+
<!-- Abstract. -->
226+
<div class="columns is-centered has-text-centered">
227+
<div class="column is-four-fifths">
228+
<h2 class="title is-3">Abstract</h2>
229+
<div class="content has-text-justified">
235230
<p>
236231
Vision-based Retrieval-Augmented Generation (VisRAG) leverages vision-language models (VLMs) to jointly retrieve relevant visual documents and generate grounded answers based on multimodal evidence.
237232
However, existing VisRAG models degrade in performance when visual inputs suffer from distortions such as blur, noise, low light, or shadow, where semantic and degradation factors become entangled within pretrained visual encoders, leading to errors in both retrieval and generation stages.
@@ -241,10 +236,11 @@ <h1 class="title is-1 publication-title">
241236
Together with the proposed Non-Causal Distortion Modeling and Causal Semantic Alignment objectives, the framework enforces a clear separation between semantics and degradations, enabling stable retrieval and generation under challenging visual conditions. To evaluate robustness under realistic conditions, we introduce the Distortion-VisRAG dataset, a large-scale benchmark containing both synthetic and real-world degraded documents across seven domains, with 12 synthetic and 5 real distortion types that comprehensively reflect practical visual degradations.
242237
Experimental results show that RobustVisRAG improves retrieval, generation, and end-to-end performance by 7.35%, 6.35%, and 12.40%, respectively, on real-world degradations, while maintaining comparable accuracy on clean inputs.
243238
</p>
239+
</div>
244240
</div>
245241
</div>
246242
</div>
247-
</section>
243+
</section>
248244

249245
<style>
250246
.module-block {
@@ -268,9 +264,13 @@ <h1 class="title is-1 publication-title">
268264
<div class="container is-max-desktop">
269265
<div class="content has-text-justified">
270266
<h2 class="title is-3 has-text-centered">Proposed Method</h2>
271-
<p>
272-
RobustVisRAG enhances Vision-based Retrieval-Augmented Generation (VisRAG) under visual degradations through causality-guided semantic–degradation disentanglement. By explicitly separating degradation and semantic factors inside the vision encoder, our framework suppresses degradation-induced bias while preserving task-relevant representations — without introducing additional inference cost.
273-
</div>
267+
<div class="has-text-centered">
268+
<img style="width: 100%;" src="./robustvisrag_files/Overview.png"
269+
alt="Overview of RobustVisRAG."/>
270+
<div class="content has-text-justified">
271+
<p>
272+
RobustVisRAG enhances Vision-based Retrieval-Augmented Generation (VisRAG) under visual degradations through causality-guided semantic–degradation disentanglement. By explicitly separating degradation and semantic factors inside the vision encoder, our framework suppresses degradation-induced bias while preserving task-relevant representations — without introducing additional inference cost.
273+
</div>
274274

275275
<h3 class="title is-4 has-text-centered">Preliminary</h3>
276276
<div class="columns is-multiline is-variable is-6">
@@ -324,7 +324,7 @@ <h3 class="title is-4 has-text-centered">RobustVisRAG</h3>
324324
We introduce a dedicated non-causal token to aggregate degradation signals via <i>unidirectional attention</i>, producing a degradation representation: \( Z_{zeg} \)
325325
Patch tokens do not attend back to this token, preventing degradation leakage into semantic representations.
326326
</p>
327-
<hr class="dashed-line">
327+
<!-- <hr class="dashed-line"> -->
328328

329329
<p class="content has-text-justified">
330330
<b>Non-Causal Distortion Modeling (NCDM):</b> To structure the degradation subspace, we apply a triplet contrastive objective:
@@ -346,7 +346,7 @@ <h3 class="title is-4 has-text-centered">RobustVisRAG</h3>
346346
The causal branch aggregates patch tokens bidirectionally to produce purified semantic embeddings: \( Z_{sem} \)
347347
This path is isolated from degradation tokens and is the only representation used at inference.
348348
</p>
349-
<hr class="dashed-line">
349+
<!-- <hr class="dashed-line"> -->
350350

351351
<p class="content has-text-justified">
352352
<b>Causal Semantic Alignment (CSA):</b> To ensure degradation-invariant semantics, we aligns degraded semantic embeddings with their clean counterparts while enforcing independence between semantic and degradation representations.:
@@ -388,6 +388,57 @@ <h2 class="title is-3 has-text-centered">Distortion-VisRAG Dataset</h2>
388388
</div>
389389
</div>
390390
</div>
391+
</section>
392+
393+
<hr/>
394+
<section class="section">
395+
<div class="container is-max-desktop">
396+
<div class="content has-text-justified">
397+
<h2 class="title is-3 has-text-centered">Quantitative Results</h2>
398+
<p>
399+
We evaluate RobustVisRAG across retrieval, generation, and end-to-end settings
400+
under clean, synthetic, and real-world degradations.
401+
Our method consistently improves robustness without additional inference cost.
402+
</p>
403+
<div class="img-container">
404+
<div class="column is-half">
405+
<div class="image-with-caption">
406+
<div class="caption"></b>Overall retrieval performance (MRR@10).</b></div>
407+
<img style="width: 100%;" src="./robustvisrag_files/Exp_Ret.png"/>
408+
</div>
409+
<div class="image-with-caption">
410+
<div class="caption"></b>End-to-end retrieval–generation performance.</b></div>
411+
<img style="width: 100%;" src="./robustvisrag_files/Exp_E2E.png"/>
412+
</div>
413+
</div>
414+
415+
<div class="column is-half">
416+
<div class="image-with-caption">
417+
<div class="caption"></b>Overall generation performance (Accuracy).</b></div>
418+
<img style="width: 100%;" src="./robustvisrag_files/Exp_Gen.png"/>
419+
</div>
420+
</div>
421+
</div>
422+
</div>
423+
</div>
424+
</section>
425+
426+
<hr/>
427+
<section class="section" id="BibTeX">
428+
<div class="container content is-max-desktop">
429+
<h2 class="title">BibTeX</h2>
430+
<pre><code>@misc{chen2026robustvisragcausalityawarevisionbasedretrievalaugmented,
431+
title={RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations},
432+
author={I-Hsiang Chen and Yu-Wei Liu and Tse-Yu Wu and Yu-Chien Chiang and Jen-Chien Yang and Wei-Ting Chen},
433+
year={2026},
434+
eprint={2602.22013},
435+
archivePrefix={arXiv},
436+
primaryClass={cs.CV},
437+
url={https://arxiv.org/abs/2602.22013},
438+
}
439+
</code></pre>
440+
</div>
441+
</section>
391442

392443
<script type="text/javascript" src="./static/slick/slick.min.js"></script>
393444
</body>

robustvisrag_files/Exp_E2E.png

118 KB
Loading

robustvisrag_files/Exp_Gen.png

290 KB
Loading

robustvisrag_files/Exp_Ret.png

263 KB
Loading

0 commit comments

Comments
 (0)