VAR reconstruction in training

Hi authors, thanks for the great work!

I have a question regarding the image reconstruction pipeline. The paper mentions:
> "The predicted representations are then passed through the VAR model to reconstruct pixel observations"

However, in the data preparation code, future_imgs are converted into discrete VAR tokens offline:
`for img_path in future_imgs:
    var_path = ipath_to_varpath(img_path)
    scale = scale_schedule[0]
    arr = torch.load(var_path, map_location='cpu')[scale].tolist()
    var_token = [f"<|{token_id}|>" for token_id in arr]
    var_token = "".join(var_token)`

It looks like the training phase only performs standard next-token prediction on these offline <var> tokens and <answer> text. I couldn't find the code where representations are fed back into the VAR decoder for pixel-level reconstruction during training. Could you please clarify this or point me to the relevant code?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VAR reconstruction in training #2

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

VAR reconstruction in training #2

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions