Summary
ProcessorMixin._load_tokenizer_from_pretrained in v5.x unconditionally appends the
sub-processor attribute name as a subfolder when loading a non-primary tokenizer:
# src/transformers/processing_utils.py, ~L1457-1467
is_primary = sub_processor_type == "tokenizer"
if is_primary:
tokenizer = AutoTokenizer.from_pretrained(path, subfolder=subfolder, **kwargs)
else:
tokenizer_subfolder = os.path.join(subfolder, sub_processor_type) if subfolder else sub_processor_type
tokenizer = AutoTokenizer.from_pretrained(path, subfolder=tokenizer_subfolder, **kwargs)
This breaks loading of processors whose tokenizer files live at the repo root but whose
sub-processor attribute is named anything other than tokenizer — e.g.
UniversalActionProcessor from physical-intelligence/fast,
which uses bpe_tokenizer as the attribute name.
Under transformers v4.x this case worked because the loader could fall back to root.
Repro
# transformers==5.2.0
from transformers import AutoProcessor
proc = AutoProcessor.from_pretrained("physical-intelligence/fast", trust_remote_code=True)
Fails with ValueError: Couldn't instantiate the backend tokenizer ....
AutoTokenizer.from_pretrained("physical-intelligence/fast") (no subfolder arg) works,
confirming the issue is specifically the forced subfolder.
Suggested fix
When the subfolder lookup returns no files (or all .no_exist), fall back to loading from
root with a deprecation warning. This restores v4.x behavior while preserving the v5 intent
of supporting multiple sub-tokenizers in subfolders.
The cleanest fix is probably to try root first and only escalate to subfolder if the
attribute-named subdirectory exists in the repo file listing.
Summary
ProcessorMixin._load_tokenizer_from_pretrainedin v5.x unconditionally appends thesub-processor attribute name as a subfolder when loading a non-primary tokenizer:
This breaks loading of processors whose tokenizer files live at the repo root but whose
sub-processor attribute is named anything other than
tokenizer— e.g.UniversalActionProcessorfromphysical-intelligence/fast,which uses
bpe_tokenizeras the attribute name.Under transformers v4.x this case worked because the loader could fall back to root.
Repro
Fails with
ValueError: Couldn't instantiate the backend tokenizer ....AutoTokenizer.from_pretrained("physical-intelligence/fast")(no subfolder arg) works,confirming the issue is specifically the forced subfolder.
Suggested fix
When the subfolder lookup returns no files (or all
.no_exist), fall back to loading fromroot with a deprecation warning. This restores v4.x behavior while preserving the v5 intent
of supporting multiple sub-tokenizers in subfolders.
The cleanest fix is probably to try root first and only escalate to subfolder if the
attribute-named subdirectory exists in the repo file listing.