Hi, thanks for releasing ReconViaGen. I’m studying the training code and had a question about the local per-view condition used for SLAT Flow.
In the paper, Section 3.2 and Figure 2 seem to describe the SLAT/PVC path as using a Condition Net similar to the SS/global path: random/learnable per-view tokens are updated by cross-attention blocks over VGGT features, producing per-view token lists T_k. Figure 2 also shows per-view cross-attention in SLAT Flow followed by a learned/weighted fusion.
In the released code, the SLAT conditioner appears to be simpler:
ModulatedSLATMultiViewCond in trellis/models/structured_latent_flow.py starts from DINO image_cond, not learned/random per-view query tokens.
- It applies four
Linear(ctx_channels -> channels) + ReLU blocks over concatenated VGGT features and the current condition.
- I do not see self-attention or cross-attention blocks inside the SLAT conditioner, unlike the SS conditioner.
- In
trellis/modules/transformer/modulated.py, when the SLAT DiT receives a list of per-view conditions, it seems to average per-view cross-attention outputs with / len(context) rather than using a learned weighted fusion MLP as described in the paper.
- In the
v0.5 branch I noticed fuse_blocks are declared in ModulatedSLATMultiViewCond, but they do not appear to be used in forward.
Could you clarify whether this released implementation is the one used for the paper’s SLAT/PVC results? If so, was the simpler MLP-based SLAT conditioner chosen for stability, memory, or ease of training compared with the cross-attention Condition Net shown in the paper? Or is the paper diagram describing an earlier/internal variant?
I’m asking because the SS/global conditioner in code closely matches the paper, while the SLAT/PVC path looks architecturally different. Any explanation of the intended design tradeoff would be very helpful.
Hi, thanks for releasing ReconViaGen. I’m studying the training code and had a question about the local per-view condition used for SLAT Flow.
In the paper, Section 3.2 and Figure 2 seem to describe the SLAT/PVC path as using a Condition Net similar to the SS/global path: random/learnable per-view tokens are updated by cross-attention blocks over VGGT features, producing per-view token lists
T_k. Figure 2 also shows per-view cross-attention in SLAT Flow followed by a learned/weighted fusion.In the released code, the SLAT conditioner appears to be simpler:
ModulatedSLATMultiViewCondintrellis/models/structured_latent_flow.pystarts from DINOimage_cond, not learned/random per-view query tokens.Linear(ctx_channels -> channels) + ReLUblocks over concatenated VGGT features and the current condition.trellis/modules/transformer/modulated.py, when the SLAT DiT receives a list of per-view conditions, it seems to average per-view cross-attention outputs with/ len(context)rather than using a learned weighted fusion MLP as described in the paper.v0.5branch I noticedfuse_blocksare declared inModulatedSLATMultiViewCond, but they do not appear to be used inforward.Could you clarify whether this released implementation is the one used for the paper’s SLAT/PVC results? If so, was the simpler MLP-based SLAT conditioner chosen for stability, memory, or ease of training compared with the cross-attention Condition Net shown in the paper? Or is the paper diagram describing an earlier/internal variant?
I’m asking because the SS/global conditioner in code closely matches the paper, while the SLAT/PVC path looks architecturally different. Any explanation of the intended design tradeoff would be very helpful.