Hi,
I've noticed that the tokenizer is configured with tokenizer.pad_token = tokenizer.eos_token.
This poses a significant problem for models like Llama 3 that use eos_token (e.g., <|eot_id|>) as a semantic delimiter to separate turns in chat templates. During batch processing of multi-turn dialogues, the real eos_token marking the end of a turn gets incorrectly masked out in the attention_mask.
Problem:
When padding shorter sequences in a batch, the tokenizer's logic sets the attention_mask to False for all pad_token_ids. Because this is the same as the eos_token_id, it incorrectly masks the eos_tokens that are part of the actual input, not just the padding. This can result in an attention_mask like [...True, True, False, True...], where the False corresponds to a meaningful eos_token.
Suggested Solution:
Would you consider adding a new, distinct padding token (e.g., ) or use different pad_token defined by each model (such as <|finetune_right_pad_id|> in llama-3) to the tokenizer and resizing the model's token embeddings? This would resolve the ambiguity and ensure correct attention masking during training.
Thanks for your great work on this project!
Hi,
I've noticed that the tokenizer is configured with
tokenizer.pad_token = tokenizer.eos_token.This poses a significant problem for models like Llama 3 that use eos_token (e.g.,
<|eot_id|>) as a semantic delimiter to separate turns in chat templates. During batch processing of multi-turn dialogues, the real eos_token marking the end of a turn gets incorrectly masked out in the attention_mask.Problem:
When padding shorter sequences in a batch, the tokenizer's logic sets the attention_mask to False for all pad_token_ids. Because this is the same as the eos_token_id, it incorrectly masks the eos_tokens that are part of the actual input, not just the padding. This can result in an attention_mask like [...True, True, False, True...], where the False corresponds to a meaningful eos_token.
Suggested Solution:
Would you consider adding a new, distinct padding token (e.g., ) or use different pad_token defined by each model (such as
<|finetune_right_pad_id|>in llama-3) to the tokenizer and resizing the model's token embeddings? This would resolve the ambiguity and ensure correct attention masking during training.Thanks for your great work on this project!