Skip to content

Regression in v5.x: ProcessorMixin._load_tokenizer_from_pretrained forces subfolder for non-primary sub-tokenizers, breaking repos that put tokenizer files at root #46153

@HanfengLiao

Description

@HanfengLiao

Summary

ProcessorMixin._load_tokenizer_from_pretrained in v5.x unconditionally appends the
sub-processor attribute name as a subfolder when loading a non-primary tokenizer:

# src/transformers/processing_utils.py, ~L1457-1467
is_primary = sub_processor_type == "tokenizer"
if is_primary:
    tokenizer = AutoTokenizer.from_pretrained(path, subfolder=subfolder, **kwargs)
else:
    tokenizer_subfolder = os.path.join(subfolder, sub_processor_type) if subfolder else sub_processor_type
    tokenizer = AutoTokenizer.from_pretrained(path, subfolder=tokenizer_subfolder, **kwargs)

This breaks loading of processors whose tokenizer files live at the repo root but whose
sub-processor attribute is named anything other than tokenizer — e.g.
UniversalActionProcessor from physical-intelligence/fast,
which uses bpe_tokenizer as the attribute name.

Under transformers v4.x this case worked because the loader could fall back to root.

Repro

# transformers==5.2.0
from transformers import AutoProcessor
proc = AutoProcessor.from_pretrained("physical-intelligence/fast", trust_remote_code=True)

Fails with ValueError: Couldn't instantiate the backend tokenizer ....
AutoTokenizer.from_pretrained("physical-intelligence/fast") (no subfolder arg) works,
confirming the issue is specifically the forced subfolder.

Suggested fix

When the subfolder lookup returns no files (or all .no_exist), fall back to loading from
root with a deprecation warning. This restores v4.x behavior while preserving the v5 intent
of supporting multiple sub-tokenizers in subfolders.

The cleanest fix is probably to try root first and only escalate to subfolder if the
attribute-named subdirectory exists in the repo file listing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions