Hi authors, thanks for the great work!
I have a question regarding the image reconstruction pipeline. The paper mentions:
"The predicted representations are then passed through the VAR model to reconstruct pixel observations"
However, in the data preparation code, future_imgs are converted into discrete VAR tokens offline:
for img_path in future_imgs: var_path = ipath_to_varpath(img_path) scale = scale_schedule[0] arr = torch.load(var_path, map_location='cpu')[scale].tolist() var_token = [f"<|{token_id}|>" for token_id in arr] var_token = "".join(var_token)
It looks like the training phase only performs standard next-token prediction on these offline tokens and text. I couldn't find the code where representations are fed back into the VAR decoder for pixel-level reconstruction during training. Could you please clarify this or point me to the relevant code?
Hi authors, thanks for the great work!
I have a question regarding the image reconstruction pipeline. The paper mentions:
However, in the data preparation code, future_imgs are converted into discrete VAR tokens offline:
for img_path in future_imgs: var_path = ipath_to_varpath(img_path) scale = scale_schedule[0] arr = torch.load(var_path, map_location='cpu')[scale].tolist() var_token = [f"<|{token_id}|>" for token_id in arr] var_token = "".join(var_token)It looks like the training phase only performs standard next-token prediction on these offline tokens and text. I couldn't find the code where representations are fed back into the VAR decoder for pixel-level reconstruction during training. Could you please clarify this or point me to the relevant code?