Skip to content

VAR reconstruction in training #2

@minrui-liu

Description

@minrui-liu

Hi authors, thanks for the great work!

I have a question regarding the image reconstruction pipeline. The paper mentions:

"The predicted representations are then passed through the VAR model to reconstruct pixel observations"

However, in the data preparation code, future_imgs are converted into discrete VAR tokens offline:
for img_path in future_imgs: var_path = ipath_to_varpath(img_path) scale = scale_schedule[0] arr = torch.load(var_path, map_location='cpu')[scale].tolist() var_token = [f"<|{token_id}|>" for token_id in arr] var_token = "".join(var_token)

It looks like the training phase only performs standard next-token prediction on these offline tokens and text. I couldn't find the code where representations are fed back into the VAR decoder for pixel-level reconstruction during training. Could you please clarify this or point me to the relevant code?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions