Skip to content

Commit b57889b

Browse files
fix(paper_reviewer): include real paper body + bibliography in prompt; normalize score (#197)
Reviewers were issuing "no LaTeX source" / "no bibliography" verdicts on arXiv-intake papers because they literally never saw the paper content: * _concat_tex sorted .tex files alphabetically with a 60KB budget. For a typical arXiv tarball (extra_pkgs.tex ≈ 3KB sorts first; main.tex ≈ 250KB sorts later), the budget got consumed by package declarations and main.tex was always skipped. The reviewer's prompt contained 3KB of \usepackage lines and a "(truncated; remaining files: 2)" footer — no abstract, no methods, no results. * state/citations/<PROJ>.yaml is never populated for arXiv-intake papers, so the bibliography section was always "(no citations recorded)" — even when paper/source/ref.bib was right there with 100+ entries. * One specialist per project (~1/13) failed pydantic validation because the LLM picked "accept" verdict but wrote score=0.0 (or "minor_revision" with score=0.5). The score is purely derived from the verdict — normalize on parse instead of losing a substantive review to a numeric formatting slip. Fixes: 1. _concat_tex now promotes the entry-point file (containing \documentclass) to the front of the ordering, truncates IT to fit budget if necessary (vs. silently skipping it), and the default budget grew from 60KB → 180KB (~45K tokens, leaves room for the response in a 128K context). 2. _summarize_bibfile fallback: when state/citations is empty, inline paper/source/*.bib (capped at 30KB) so the reviewer can see what's cited and judge the reference set. 3. handle_response normalizes score from verdict before validation. Verified against 8 previously-failing projects (PROJ-564, 565, 566, 568, 570, 571, 576, 578). All 8 now produce substantive 13-specialist reviews instead of crashing or emitting boilerplate "no source provided" verdicts. Aggregate verdicts: * accept : PROJ-564, 565, 566, 576 * minor_revision : PROJ-568, 570, 571 * major_revision_sci: PROJ-578 (correctly flagged "GPT-5.4 / Claude Sonnet 4.5 / Gemini-3.1-Pro" as unverifiable model names) Reviews now reference specific Algorithms, Tables, Figures, and hyperparameters by name — the LLM is reading and reasoning about the actual paper, not the package preamble. Adds 9 new unit tests (17 total in test_paper_reviewer_arxiv_intake). Full unit suite (395 tests) passes. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent aa223ec commit b57889b

130 files changed

Lines changed: 3576 additions & 28 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
---
2+
artifact_hash: 815458de8568b35ab5a02599bda9f602ed2dc04d545bca014bc4749f57af838e
3+
artifact_path: projects/PROJ-564-qwen-image-vae-2-0-technical-report/paper/metadata.json
4+
backend: dartmouth
5+
feedback: High-compression VAE with SOTA reconstruction and novel text benchmark;
6+
publication-ready.
7+
github_authenticated: false
8+
model_name: qwen.qwen3.5-122b
9+
prompt_version: 1.0.0
10+
reviewed_at: '2026-05-17T14:36:52.393991Z'
11+
reviewer_kind: llm
12+
reviewer_name: paper_reviewer
13+
score: 0.5
14+
verdict: accept
15+
---
16+
17+
# Free-form review body
18+
19+
## Strengths
20+
- **Architectural Innovation**: The introduction of Global Skip Connections (GSC) and the asymmetric encoder-decoder design effectively addresses the information bottleneck in high-compression ($f16/f32$) regimes, preserving fine-grained details that are typically lost.
21+
- **Benchmark Contribution**: OmniDoc-TokenBench fills a critical gap in evaluating text-rich image reconstruction, moving beyond pixel metrics (PSNR/SSIM) to semantic legibility (NED). The construction methodology is transparent and reproducible.
22+
- **Empirical Rigor**: Extensive quantitative comparisons against strong baselines (FLUX, Hunyuan, Cosmos) across multiple compression tiers demonstrate state-of-the-art performance. The inclusion of downstream DiT convergence experiments validates the "diffusability" claims.
23+
- **Training Strategy**: The multi-stage training paradigm (resolution curriculum, text data infusion, semantic alignment calibration) is well-justified and aligns with the architectural goals.
24+
- **Writing Quality**: The technical report is well-structured, clearly articulating the trade-offs between compression, fidelity, and generation efficiency.
25+
26+
## Concerns
27+
- **Citation Verification**: While the bibliography is complete, the internal pipeline's `verification_status` for citations is not populated in the input metadata. For strict adherence to internal `accept` rules, this should be flagged for administrative verification, though it does not impact scientific merit.
28+
- **Training Scale Specifics**: The claim of "billions of images" is standard for foundation models but lacks specific dataset composition details (e.g., exact sources, deduplication ratios) which could aid full reproducibility of the training curve.
29+
- **Future-Dated Citations**: Several references (e.g., `2025`, `2026`) reflect the simulation context of the arXiv ingestion. In a real-world setting, ensure these correspond to valid pre-prints or publications.
30+
31+
## Recommendation
32+
The paper presents a significant advancement in high-compression VAEs, resolving the traditional tripartite trade-off between compression ratio, reconstruction fidelity, and diffusability. The proposed OmniDoc-TokenBench is a valuable community resource. The methodology is sound, results are compelling, and the writing is clear. I recommend **accept** for publication, subject to minor administrative completion of citation verification flags in the internal system.
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
---
2+
artifact_hash: 815458de8568b35ab5a02599bda9f602ed2dc04d545bca014bc4749f57af838e
3+
artifact_path: projects/PROJ-564-qwen-image-vae-2-0-technical-report/paper/metadata.json
4+
backend: dartmouth
5+
feedback: ''
6+
github_authenticated: false
7+
model_name: qwen.qwen3.5-122b
8+
prompt_version: 1.0.0
9+
reviewed_at: '2026-05-17T14:42:33.699367Z'
10+
reviewer_kind: llm
11+
reviewer_name: paper_reviewer_claim_accuracy
12+
score: 0.0
13+
verdict: minor_revision
14+
---
15+
16+
The review identifies specific discrepancies between factual claims and their supporting citations or internal consistency.
17+
18+
1. **OCR Version Mismatch**: In Section 5.2 (Benchmark Construction) and Section 3 (Data), the text claims the use of "PP-OCRv5" while citing `cui2025paddleocr30technicalreport`. The bibliography entry title explicitly reads "PaddleOCR 3.0 Technical Report". This version mismatch (v5 vs 3.0) constitutes a factual inaccuracy regarding the tool version used, which impacts reproducibility and claim validity. Authors must align the text with the correct citation or update the citation if v5 exists separately.
19+
20+
2. **GAN Loss Attribution**: In Section 4.1 (Training Loss), the paper states GAN loss is "conventionally used to sharpen visual detail" and cites `gan` (Isola et al., 2017). While Isola et al. introduced conditional GANs (Pix2Pix), the specific application of GAN loss for VAE reconstruction sharpening is more commonly associated with VAE-GAN (Larsen et al.) or LSGAN (Mao et al.) literature. Citing Pix2Pix for general VAE sharpening is imprecise and reduces claim accuracy regarding the methodological lineage.
21+
22+
3. **Baseline Coverage for "First" Claim**: Section 6.1.2 asserts, "To the best of our knowledge, this is the first f16 autoencoder to achieve text fidelity exceeding f8 VAEs." While qualified, this strong claim depends on the comprehensiveness of Table 3. The table omits several potential f16 baselines (e.g., specific video VAEs or non-public models). Authors should ensure the baseline selection is exhaustive or soften the claim to "among evaluated baselines" to maintain accuracy.
23+
24+
4. **Semantic Alignment Citation**: Section 4.2 attributes the finding that "channel expansion... results in an over-complex and unstructured latent distribution" to `qiu2025image`. This specific mechanism should be explicitly verified in the cited work, as titles often generalize.
25+
26+
These issues require correction to ensure all factual claims are strictly supported by the provided evidence or citations.
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
---
2+
artifact_hash: 815458de8568b35ab5a02599bda9f602ed2dc04d545bca014bc4749f57af838e
3+
artifact_path: projects/PROJ-564-qwen-image-vae-2-0-technical-report/paper/metadata.json
4+
backend: dartmouth
5+
feedback: ''
6+
github_authenticated: false
7+
model_name: qwen.qwen3.5-122b
8+
prompt_version: 1.0.0
9+
reviewed_at: '2026-05-17T14:49:24.958764Z'
10+
reviewer_kind: llm
11+
reviewer_name: paper_reviewer_code_quality_paper
12+
score: 0.0
13+
verdict: minor_revision
14+
---
15+
16+
The provided manuscript for PROJ-564 (Qwen-Image-VAE-2.0) is a complete technical report in LaTeX format, detailing architectural innovations and benchmark results. However, from the perspective of code quality and reproducibility, the submission is critically incomplete. The input package contains only LaTeX source files and bibliography; no source code artifacts (e.g., `train.py`, `model.py`, `config.yaml`, `requirements.txt`) were provided. Consequently, I cannot assess code readability, modularity, test coverage, dependency hygiene, or the ability to reproduce the results from scratch.
17+
18+
For a technical report claiming state-of-the-art performance on Variational Autoencoder architectures, the absence of the implementation code prevents verification of critical components. Specifically, the Global Skip Connection (GSC) implementation, the semantic alignment loss using DINOv2 features, and the OmniDoc-TokenBench evaluation pipeline are described textually but lack executable verification. The training strategy (multi-stage resolution, synthetic data rendering) also requires code to confirm the curriculum learning logic.
19+
20+
To meet the `code_quality_paper` standards, the authors must provide a public repository link or attach the code artifacts in the next revision. Specifically, the training loop, model definition, and evaluation scripts need to be modularized. Dependency hygiene requires a `requirements.txt` or `environment.yml` file to ensure consistent environments. Additionally, a `Dockerfile` would aid reproducibility for large-scale training environments. Without these, the reproducibility claim remains unverified, and the code quality cannot be evaluated. The absence of test suites for the benchmark metrics further reduces confidence in the reported NED scores. Please include the full codebase in the revised submission.
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
---
2+
artifact_hash: 815458de8568b35ab5a02599bda9f602ed2dc04d545bca014bc4749f57af838e
3+
artifact_path: projects/PROJ-564-qwen-image-vae-2-0-technical-report/paper/metadata.json
4+
backend: dartmouth
5+
feedback: ''
6+
github_authenticated: false
7+
model_name: qwen.qwen3.5-122b
8+
prompt_version: 1.0.0
9+
reviewed_at: '2026-05-17T14:50:14.853634Z'
10+
reviewer_kind: llm
11+
reviewer_name: paper_reviewer_data_quality_paper
12+
score: 0.0
13+
verdict: minor_revision
14+
---
15+
16+
**Data Quality Review — Missing Provenance, Licensing, and Versioning**
17+
18+
This review focuses exclusively on data quality aspects: provenance, licensing, schema, missing-data handling, version control, and link rot of external sources. Several critical gaps require minor revision before publication.
19+
20+
**1. Training Data Provenance & Licensing (sec/data.tex, lines 1-15)**
21+
The paper states training on "billions of images" covering "various categories, resolutions and aspect ratios" but provides **no license information**, **no source attribution**, and **no data card** for this corpus. Without explicit licensing (e.g., CC-BY, public domain, or commercial licenses), downstream users cannot legally reuse the model or verify compliance with data rights. This is a significant omission for a technical report claiming billion-scale training.
22+
23+
**2. Dataset Versioning (sec/experiment.tex, Table 1-2)**
24+
Benchmarks ImageNet~\citep{deng2009imagenet} and FFHQ~\citep{Karras2018ASG} lack version specifications. ImageNet has multiple splits (e.g., 2012 validation, 2015 challenge); FFHQ has versions (e.g., 70k, 100k). Without version numbers, results cannot be reproduced. Similarly, OmniDocBench~\citep{Ouyang2024OmniDocBenchBD} requires a version tag or commit hash for the benchmark construction pipeline.
25+
26+
**3. Benchmark Construction Transparency (sec/bench.tex, lines 1-30)**
27+
OmniDoc-TokenBench construction references OmniDocBench but omits:
28+
- The exact version of OmniDocBench used
29+
- License status of derived benchmark data
30+
- Character count thresholds ([200, 600] for Chinese, [300, 600] for English) lack justification for reproducibility
31+
- Human inspection criteria are qualitative ("blurred, visually redundant") without objective metrics
32+
33+
**4. External Link Stability (bibliography)**
34+
Multiple GitHub URLs (e.g., `hunyuanimage2.1`, `flux2`) are subject to link rot. arXiv URLs are stable but should include access dates. Consider adding DOIs where available.
35+
36+
**5. Data Privacy & Consent**
37+
No mention of privacy safeguards for the billion-scale corpus or human inspection of OmniDoc-TokenBench samples.
38+
39+
**Required Actions:**
40+
- Add dataset licenses and sources for training corpus (sec/data.tex)
41+
- Specify version numbers for all benchmark datasets (Table 1-2 captions)
42+
- Provide a data card or datasheet link for OmniDoc-TokenBench
43+
- Add access dates to external URLs in bibliography
44+
45+
These revisions are necessary for reproducibility and legal compliance without affecting technical claims.
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
---
2+
artifact_hash: 815458de8568b35ab5a02599bda9f602ed2dc04d545bca014bc4749f57af838e
3+
artifact_path: projects/PROJ-564-qwen-image-vae-2-0-technical-report/paper/metadata.json
4+
backend: dartmouth
5+
feedback: ''
6+
github_authenticated: false
7+
model_name: qwen.qwen3.5-122b
8+
prompt_version: 1.0.0
9+
reviewed_at: '2026-05-17T15:05:18.569204Z'
10+
reviewer_kind: llm
11+
reviewer_name: paper_reviewer_figure_critic
12+
score: 0.0
13+
verdict: minor_revision
14+
---
15+
16+
## Figure Quality Review
17+
18+
This review examines all figures in the manuscript for clarity, accessibility, and whether they effectively support the paper's claims.
19+
20+
### Figures Identified
21+
22+
1. **Figure 1 (`fig:vae_arch`)** – Architecture comparison (NSC/LSC/GSC)
23+
2. **Figure 2 (`fig:vae_bench`)** – OmniDoc-TokenBench illustration
24+
3. **Figure 3 (`fig:text_recon_comparison`)** – Qualitative text reconstruction (f16/f32)
25+
4. **Figure 4 (`fig:sample_images`)** – Generated ImageNet samples
26+
27+
### Issues Requiring Attention
28+
29+
**1. Missing Alt Text (Accessibility)**
30+
None of the figures include alt text specifications. Per accessibility standards, all figures should have descriptive alt text for screen reader users. Add `\alttext{...}` to each `\includegraphics` command or provide equivalent accessibility metadata.
31+
32+
**2. Caption Typo (Line ~485, `sec/experiment.tex`)**
33+
The caption for Figure 3 reads: "Qualitative comparison of text reconstruction on **Ours** OmniDoc-TokenBench." This should be corrected to "on OmniDoc-TokenBench."
34+
35+
**3. Color Accessibility Not Documented**
36+
The manuscript uses color highlighting (e.g., `\colorbox{blue!5}` in tables, `\textcolor[RGB]{215,36,36}` in captions). No colorblindness considerations or grayscale reproduction guidelines are mentioned. Ensure figures remain interpretable in grayscale.
37+
38+
**4. Figure 2 Caption Insufficient**
39+
`fig:vae_bench` caption: "OmniDoc-TokenBench, a curated collection of ~3K text-rich images." This is too minimal. It should describe what specific elements are shown (e.g., sample categories, representative document types, or the benchmark construction pipeline).
40+
41+
**5. Figure 3 Subfigure Labels**
42+
The subfigures (Figure 3a/3b) are labeled "f16 Compression VAEs" and "f32 Compression VAEs" but the caption mentions "Top/Middle/Bottom" row structure. This row-level explanation should appear in the caption, not just the subfigure captions, to aid readers viewing single-column prints.
43+
44+
**6. Print Scale Legibility**
45+
Figure 3 contains zoomed-in word patches. Verify at 100% print scale that text remains legible. The current LaTeX uses `width=1.0\textwidth` for full-width figures; consider ensuring critical details (character-level crops) remain readable when printed in single-column format.
46+
47+
### Positive Observations
48+
49+
- Figure 1 effectively illustrates the GSC architectural contribution with ablation context.
50+
- Figure 3 directly supports the paper's central claim about text reconstruction superiority.
51+
- All figures are properly referenced in the text with appropriate `\ref{}` commands.
52+
- Figure 4 provides visual validation of diffusability claims.
53+
54+
### Recommended Actions
55+
56+
1. Add alt text to all figure environments
57+
2. Correct the "Ours" typo in Figure 3 caption
58+
3. Expand Figure 2 caption with descriptive content
59+
4. Document color accessibility considerations
60+
5. Verify print-scale legibility of text patches in Figure 3
61+
62+
These revisions will improve accessibility and ensure figures fully support their intended claims without requiring external interpretation.
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
---
2+
artifact_hash: 815458de8568b35ab5a02599bda9f602ed2dc04d545bca014bc4749f57af838e
3+
artifact_path: projects/PROJ-564-qwen-image-vae-2-0-technical-report/paper/metadata.json
4+
backend: dartmouth
5+
feedback: ''
6+
github_authenticated: false
7+
model_name: qwen.qwen3.5-122b
8+
prompt_version: 1.0.0
9+
reviewed_at: '2026-05-17T15:07:33.265062Z'
10+
reviewer_kind: llm
11+
reviewer_name: paper_reviewer_jargon_police
12+
score: 0.0
13+
verdict: minor_revision
14+
---
15+
16+
This review flags excessive jargon and undefined acronyms that hinder accessibility for non-specialist readers. Several critical acronyms appear before their definition. In the **Introduction**, "PSNR" and "SSIM" are used without expansion; they are not defined until **Section 6.1**. "OCR" appears in the **Abstract** without defining "Optical Character Recognition." Similarly, **Section 5** introduces "MAE," "PE-Spatial," "DINOv3," and "PP-OCRv5" without explaining what these models/tools are (e.g., Masked Autoencoders, OCR engine). **Section 6.2** uses "SiT," "FID," and "gFID" without full expansion (Scalable Interpolant Transformers, Fréchet Inception Distance).
17+
18+
Beyond acronyms, the text relies on specialized vocabulary that can be simplified. The term "native" in "native high-resolution synthesis" (**Introduction**) is industry jargon; "direct" or "inherent" is clearer. "Tripartite trade-off" (**Introduction**, **Conclusion**) should be "three-way trade-off." "Backbone" (**Model Architecture**, **Training**) is standard ML slang but "core architecture" is more accessible. "Paradigm" (**Training**, **Conclusion**) is overused; "approach" or "method" suffices. "Semantic manifold" (**Training**) is dense mathematical jargon; "semantic structure" or "space" is plainer. "Curriculum-based" (**Training**) can be "progressive." "Data infusion" (**Training**) should be "data integration." "Logographic" (**Data**) is technical; "character-based" (for Chinese) is clearer. "Multi-granularity supervision" (**Data**) should be "multi-level supervision." "Generation-friendly" (**Training**) is coined jargon; "suitable for generation" is better. "Open-vocabulary conditioning" (**Conclusion**) can be "flexible text conditioning."
19+
20+
Finally, the term "diffusability" is defined only in a footnote (**Abstract**), yet it is a central coined term used throughout. While footnotes help, defining it in the main text upon first use is standard practice. The mathematical formulation of loss functions in **Section 5.2** assumes familiarity with cosine similarity and distance matrices; adding a brief plain-English summary of what these terms measure (e.g., "aligning feature directions") would aid non-experts. Please revise to define all acronyms at first mention and replace opaque jargon with plain English equivalents.
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
---
2+
artifact_hash: 815458de8568b35ab5a02599bda9f602ed2dc04d545bca014bc4749f57af838e
3+
artifact_path: projects/PROJ-564-qwen-image-vae-2-0-technical-report/paper/metadata.json
4+
backend: dartmouth
5+
feedback: ''
6+
github_authenticated: false
7+
model_name: qwen.qwen3.5-122b
8+
prompt_version: 1.0.0
9+
reviewed_at: '2026-05-17T14:40:31.437663Z'
10+
reviewer_kind: llm
11+
reviewer_name: paper_reviewer_logical_consistency
12+
score: 0.0
13+
verdict: minor_revision
14+
---
15+
16+
The paper presents a coherent argument for high-compression VAEs using Global Skip Connections (GSC) and semantic alignment. However, a significant logical inconsistency exists between the claim of superior diffusability and the empirical evidence provided in the results section.
17+
18+
In Section 6.1.3 ("Performance of Diffusability", `sec/experiment.tex`), the authors state: "Qwen-Image-VAE-2.0 demonstrates superior latent space diffusability, consistently outperforming existing high-compression baselines in overall generation quality." This claim is not fully supported by Table 1 (`sec/experiment.tex`). Specifically, within the "$f16$ Compression VAEs" block, VAVAE (f16c32) achieves an Inception Score (IS) of 129.80 and gFID of 6.03. In contrast, Qwen-Image-VAE-2.0-f16c128 (f16c128) achieves an IS of 92.42 and gFID of 10.29. Since VAVAE is categorized under the same compression tier and exhibits significantly better generation metrics, the claim of "consistently outperforming" is logically invalid based on the presented data.
19+
20+
Additionally, in Table 2 (`sec/experiment.tex`), FLUX.2-dev (f16c128) achieves a lower FID (0.73) compared to Qwen-Image-VAE-2.0-f16c128 (0.79) on OmniDoc-TokenBench. While the text focuses on SSIM/PSNR for reconstruction fidelity, the inclusion of FID in the table without qualification creates ambiguity regarding the "superior" claim across all metrics.
21+
22+
To restore logical consistency, the claim in Section 6.1.3 should be revised to acknowledge VAVAE's superior generation scores or clarify the distinction (e.g., channel dimension vs. compression ratio). The conclusion that the model achieves "superior diffusability" relative to specific high-channel baselines (like FLUX.2-dev) is supported, but the generalization to "all baselines" is contradicted by the evidence. Please revise the text to accurately reflect the comparative performance shown in Table 1.

0 commit comments

Comments
 (0)