Skip to content

AttributeError: 'MegatronHFTokenizer' object has no attribute 'additional_special_tokens' — all T5 / BertWordPieceCase nightlies broken #4824

@balasaajay

Description

@balasaajay

Every T5 functional test in the nightly pipeline currently dies during dataset construction with an AttributeError on the MegatronHFTokenizer wrapper. The training process never reaches step 0; the failure is deterministic and reproduces on every rank.

The error is triggered from T5MaskedWordPieceDatasetConfig.post_init when it dereferences tokenizer.additional_special_tokens_ids, which transitively reads self.additional_special_tokens on the wrapper — an attribute that the wrapper never sets in init. It used to get set as a side-effect of add_special_tokens() copying attributes from the underlying HuggingFace tokenizer's SPECIAL_TOKENS_ATTRIBUTES list, but that side effect no longer fires under the transformers version installed in the current mcore-pyt-dev / mcore-pyt-lts containers.

Affected scope
All T5 pretraining functional tests (e.g. t5_11b_mcore_tp4_pp1, t5_mcore_tp1_pp1_vp1, t5_mcore_tp4_pp1, t5_mcore_te_tp4_pp1, etc.).
Any other recipe whose training path goes through MegatronHFTokenizer and reaches additional_special_tokens_ids — primarily anything that uses --tokenizer-type BertWordPieceCase (T5 sentinel tokens) or BertWordPieceLowerCase.
BERT pretraining recipes are not affected because they don't read additional_special_tokens_ids during dataset init.
Reproducer
Run any T5 pretraining functional test on the current ci-nightly ref using the current mcore-pyt-dev (or mcore-pyt-lts) container — for example t5_11b_mcore_tp4_pp1 on dgxh100_coreweave or t5_mcore_tp1_pp1_vp1 on dgxa100_dracooci-ord.
Job fails on rank 0 (and all other ranks) immediately after tokenizer initialization, before any training step. Outer job duration ~20-30 minutes (most of it spent in JET pipeline retries).
Stack trace (verbatim from rank 0 stderr)

[rank0]: Traceback (most recent call last):
[rank0]: File "/opt/megatron-lm/pretrain_t5.py", line 275, in
[rank0]: pretrain(
[rank0]: File "/opt/megatron-lm/megatron/training/training.py", line 1151, in pretrain
[rank0]: build_train_valid_test_data_iterators(train_valid_test_dataset_provider)
[rank0]: File "/opt/megatron-lm/megatron/training/training.py", line 3941, in build_train_valid_test_data_iterators
[rank0]: train_dataloader, valid_dataloaders, test_dataloader = build_train_valid_test_data_loaders(
[rank0]: File "/opt/megatron-lm/megatron/training/training.py", line 3898, in build_train_valid_test_data_loaders
[rank0]: train_ds, valid_ds, test_ds = build_train_valid_test_datasets(build_train_valid_test_datasets_provider)
[rank0]: File "/opt/megatron-lm/megatron/training/training.py", line 3844, in build_train_valid_test_datasets
[rank0]: return build_train_valid_test_datasets_provider(train_valid_test_num_samples)
[rank0]: File "/opt/megatron-lm/pretrain_t5.py", line 208, in train_valid_test_datasets_provider
[rank0]: config = T5MaskedWordPieceDatasetConfig(
[rank0]: File "", line 26, in init
[rank0]: File "/opt/megatron-lm/megatron/core/datasets/t5_dataset.py", line 45, in post_init
[rank0]: assert len(self.tokenizer.additional_special_tokens_ids) > 0
[rank0]: File "/opt/megatron-lm/megatron/core/tokenizers/text/text_tokenizer.py", line 195, in additional_special_tokens_ids
[rank0]: return self._tokenizer.additional_special_tokens_ids
[rank0]: File "/opt/megatron-lm/megatron/core/tokenizers/text/libraries/huggingface_tokenizer.py", line 218, in additional_special_tokens_ids
[rank0]: return [self.token_to_id(token) for token in self.additional_special_tokens]
[rank0]: AttributeError: 'MegatronHFTokenizer' object has no attribute 'additional_special_tokens'. Did you mean: 'ad
A clear and concise description of what the bug is. Tag the @mcore-oncall
to get oncall's attention to this issue.

Steps/Code to reproduce bug

Please list minimal steps or code snippet for us to be able to reproduce the bug.

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Add any other context about the problem here.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions