Skip to content

Commit 749eeef

Browse files
kcz358github-actions[bot]KemingWucharlesswumwxely
authored
[WIP] feat: Transformers 5.0 compatibility (#142)
* feat(models): add transformers 5.0 compatibility Conditionally import models incompatible with transformers >= 5.0: - dream_dllm, qwen3_dllm, llada_dllm require transformers < 5.0 - llava_onevision1_5 requires transformers < 5.0 - Dynamically update __all__ based on transformers version - Prevents ImportError when using transformers 5.0+ * fix(train): add group_by_length for backward compatibility Add group_by_length parameter to TrainingArguments to maintain compatibility with existing training configurations. * feat(deps): allow transformers >= 4.57.1 Update transformers dependency from exact version to minimum version to support transformers 5.0+ while maintaining backward compatibility. * style: auto-fix lint (black + isort) * refactor(processor): replace additional_special_tokens with all_special_tokens Use all_special_tokens for transformers >= 5.0 compatibility while maintaining backward compatibility with transformers < 5.0. Changes: - Add special_tokens property to all processor classes - Use all_special_tokens if available (transformers >= 5.0) - Fall back to additional_special_tokens (transformers < 5.0) - Add <|im_start|> and <||im_end|> tokens to special_tokens list - Cache special_tokens as instance attribute for performance Affected processors: - AeroDataProcessor (base class) - BaseQwen2_5_DataProcessor (inherits from AeroDataProcessor) - Qwen2VLDataProcessor - Qwen2DataProcessor - LLaVADataProcessor - LLaVAVideoDataProcessor (inherits from LLaVADataProcessor) - NanovlmDataProcessor - Qwen3_VLDataProcessor (inherits from BaseQwen2_5_DataProcessor) * style: auto-fix lint (black + isort) * refactor(processor): unify apply_chat_template usage Use processor.apply_chat_template with tokenize=True consistently across all processors instead of mixing with processor.tokenizer calls. Changes: - aero_processor: use processor.apply_chat_template(tokenize=True)[0] - base_qwen2_5_processor: use processor.apply_chat_template(tokenize=True)[0] - qwen2_vl_processor: use processor.apply_chat_template(tokenize=True) - qwen3_vl_processor: use processor.apply_chat_template(tokenize=True)[0] This ensures all processors return token IDs directly during data preparation, improving consistency and reducing confusion. * feat(models): add common_ops for transformer-agnostic rope index Extract rope index calculation functions into common_ops/rope.py to ensure consistent behavior across transformers versions. Changes: - Add common_ops/rope.py with qwen2_5_vl_rope_index and qwen3_vl_get_rope_index - Update qwen2_5_vl_ops.py to use qwen2_5_vl_rope_index - Update qwen3_vl_ops.py to use qwen3_vl_get_rope_index - Update qwen3_vl_moe_ops.py to use qwen3_vl_get_rope_index This ensures rope index calculations remain stable even when transformers internal implementations change. * fix(utils): add B200/B300 GPU FLOPS support Add NVIDIA B200/B300 GPU FLOPS (2.25e15) to get_device_flops() to fix MFU calculation returning 0 on B200 GPUs. Previously, unknown GPU types returned inf FLOPS, causing MFU to always be 0. * Lint * fix(models): qwen2_5_vl transformers 5.0 compatibility - Fix vision_model variable reference in liger kernel patch - Support nested text_config in lce_forward - Handle rope_scaling/rope_parameters for transformers 5.0+ - Add qwen2_5_vl to FlopsCounter model type mapping * refactor(processor): use DataUtilities.apply_chat_template for transformers 5.0 compatibility - Add apply_chat_template utility method to DataUtilities - Handles dict-like return values (BatchEncoding) with use_key param - Handles nested list wrapping from some processors - Update all processors to use unified method * feat(launch): add filter_training_args for transformers 5.0 compatibility Filter unsupported TrainingArguments parameters by inspecting transformers.TrainingArguments.__init__ signature, avoiding errors from deprecated or removed parameters in newer versions. * fix(models): add parse_visual_output for transformers 5.0 compatibility Visual model methods (get_image_features, get_video_features, visual()) may return tuples OR dataclass objects (BaseModelOutputWithPooling, BaseModelOutputWithDeepstackFeatures) in transformers 5.0+. Add parse_visual_output() to transparently handle both return types. * [feat] Support Qwen3_5 Training (#143) * [feat] Support Qwen3_5 Training * style: auto-fix lint (black + isort) * [feat] Support Qwen3.5 Training * optimize qwen3.5 dataset process logic * optimize qwen3.5 dataset process logic * flop function leave empty --------- Co-authored-by: charlesswu <charlesswu@tencent.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * fix(processor): remove duplicate special_tokens property in qwen2_vl_processor * fix(models): remove duplicate .to() calls in qwen2_5_omni_liger * fix(models): define input_ids_rmpad in inputs_embeds branch to avoid NameError * refactor(models): extract parse_visual_output to common_ops/visual.py * refactor(processor): extract special_tokens logic to DataUtilities.get_special_tokens * style: auto-fix lint (black + isort) * docs: add Transformers 5.0 migration guide Add comprehensive migration guide for transformers 5.0 compatibility. Includes compatibility matrix, installation instructions, and troubleshooting for Qwen3.5 (requires >= 5.3.0) and legacy models (requires < 5.0.0). --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: wukeming <108406625+KemingWu@users.noreply.github.com> Co-authored-by: charlesswu <charlesswu@tencent.com> Co-authored-by: mwxely <yang0756@e.ntu.edu.sg>
1 parent 5ff50c0 commit 749eeef

31 files changed

Lines changed: 1359 additions & 73 deletions

docs/index.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,12 @@ Welcome to the LMMs Engine documentation! LMMs Engine is a flexible and extensib
6262
models/qwen3_moe
6363
models/qwen3_omni_moe
6464

65+
.. toctree::
66+
:maxdepth: 1
67+
:caption: Troubleshooting
68+
69+
troubleshoot/index
70+
6571
Indices and tables
6672
==================
6773

docs/troubleshoot/index.rst

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
Troubleshooting
2+
===============
3+
4+
Common issues and solutions for LMMs Engine.
5+
6+
.. toctree::
7+
:maxdepth: 2
8+
9+
transformers_5_migration
Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
# Transformers 5.0 Migration Guide
2+
3+
This guide helps you migrate to transformers 5.0 while maintaining backward compatibility with older models.
4+
5+
## Overview
6+
7+
LMMs Engine now supports `transformers >= 5.0` while maintaining backward compatibility with `transformers 4.x`. This enables training with the latest models like Qwen3.5 while preserving support for existing models.
8+
9+
## Compatibility Matrix
10+
11+
| Model Family | transformers < 5.0 | transformers >= 5.0 | Minimum Version |
12+
|-------------|-------------------|---------------------|-----------------|
13+
| Qwen2.5-VL ||| - |
14+
| Qwen3-VL ||| - |
15+
| Qwen3 ||| - |
16+
| **Qwen3.5** ||| **>= 5.3.0** |
17+
| LLaVA-OneVision1.5 ||| < 5.0.0 |
18+
| DLLM models (DreamDLLM, Qwen3DLLM, LLaDADLLM) ||| < 5.0.0 |
19+
20+
## Installation
21+
22+
### For Qwen3.5 Training (New Feature)
23+
24+
Qwen3.5 requires transformers 5.3.0 or higher:
25+
26+
```bash
27+
pip install "transformers>=5.3.0"
28+
```
29+
30+
Or with uv:
31+
32+
```bash
33+
uv pip install "transformers>=5.3.0"
34+
```
35+
36+
### For Legacy Models (LLaVA-OneVision1.5, DLLM)
37+
38+
If you need to use LLaVA-OneVision1.5 or DLLM models, install transformers 4.x:
39+
40+
```bash
41+
pip install "transformers<5.0.0"
42+
```
43+
44+
Or with uv:
45+
46+
```bash
47+
uv pip install "transformers<5.0.0"
48+
```
49+
50+
## Verified Compatibilities
51+
52+
The following models have been tested and verified:
53+
54+
### Tested with transformers >= 5.0
55+
-**Qwen2.5-VL** - Fully compatible
56+
-**Qwen3-VL** - Fully compatible
57+
-**Qwen3** - Fully compatible
58+
59+
### Tested with transformers < 5.0
60+
-**Qwen2.5-VL** - Fully compatible
61+
-**Qwen3-VL** - Fully compatible
62+
-**Qwen3** - Fully compatible
63+
-**LLaVA-OneVision1.5** - Only compatible with < 5.0
64+
-**DLLM models** - Only compatible with < 5.0
65+
66+
## How It Works
67+
68+
LMMs Engine automatically detects your transformers version and:
69+
70+
1. **With transformers >= 5.0**: Loads Qwen3.5 and all compatible models. Legacy models (LLaVA-OneVision1.5, DLLM) are excluded from imports.
71+
72+
2. **With transformers < 5.0**: Loads all legacy models. Qwen3.5 is not available.
73+
74+
The version check is performed at import time using `is_transformers_version_greater_or_equal_to()`.
75+
76+
## Troubleshooting
77+
78+
### Error: "Module not found" for Qwen3.5
79+
80+
**Symptom**: Trying to use Qwen3.5 but getting import errors.
81+
82+
**Solution**: Qwen3.5 requires transformers >= 5.3.0. Install the correct version:
83+
84+
```bash
85+
pip install "transformers>=5.3.0"
86+
```
87+
88+
### Error: "Module not found" for LLaVA-OneVision1.5 or DLLM
89+
90+
**Symptom**: Trying to use LLaVA-OneVision1.5 or DLLM models but they're not available.
91+
92+
**Solution**: These models are incompatible with transformers >= 5.0. Downgrade to transformers 4.x:
93+
94+
```bash
95+
pip install "transformers<5.0.0"
96+
```
97+
98+
### Error: ImportError when importing models
99+
100+
**Symptom**: `ImportError` or `ModuleNotFoundError` when importing specific models.
101+
102+
**Solution**: Check your transformers version and consult the compatibility matrix above. Ensure you're using the correct transformers version for your target model.
103+
104+
## Implementation Details
105+
106+
The compatibility is implemented through conditional imports in `src/lmms_engine/models/__init__.py`:
107+
108+
```python
109+
from lmms_engine.utils.import_utils import is_transformers_version_greater_or_equal_to
110+
111+
is_transformers_5 = is_transformers_version_greater_or_equal_to("5.0.0")
112+
113+
# Models that work with both versions are always imported
114+
from .qwen2_5_vl import apply_liger_kernel_to_qwen2_5_vl
115+
from .qwen3_vl import apply_liger_kernel_to_qwen3_vl
116+
from .qwen3 import apply_liger_kernel_to_qwen3
117+
118+
# Models only compatible with transformers < 5.0
119+
if not is_transformers_5:
120+
from .llava_onevision1_5 import LLaVAOneVision1_5_ForConditionalGeneration
121+
from .dream_dllm import DreamDLLMForMaskedLM
122+
# ... other legacy models
123+
```
124+
125+
## Related Resources
126+
127+
- [Qwen-VL Training Guide](../models/qwenvl.md)
128+
- [Data Preparation Guide](../user_guide/data_prep.md)
129+
- [Training Configuration](../getting_started/train.md)
Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
trainer_type: fsdp2_trainer
2+
dataset_config:
3+
extra_kwargs: {}
4+
dataset_type: qwen3_vl_iterable
5+
dataset_format: yaml
6+
processor_config:
7+
processor_name: Qwen/Qwen3-VL-8B-Instruct
8+
processor_type: qwen3_vl
9+
dataset_path: data/video/debug.yaml
10+
datasets: null
11+
shuffle: true
12+
eval_dataset_path: null
13+
object_storage: none
14+
bucket_name: null
15+
packing: false
16+
packing_strategy: first_fit
17+
packing_length: 51200
18+
filter_overlong: true
19+
filter_overlong_workers: 8
20+
max_length: null
21+
video_sampling_strategy: fps
22+
video_max_pixels: 50176
23+
video_max_frames: 512
24+
frame_num: 64
25+
fps: 1
26+
video_backend: qwen_vl_utils
27+
trainer_args:
28+
output_dir: ./output/qwen3_5_training
29+
do_train: false
30+
do_eval: false
31+
do_predict: false
32+
eval_strategy: 'no'
33+
prediction_loss_only: false
34+
per_device_train_batch_size: 1
35+
per_device_eval_batch_size: 8
36+
gradient_accumulation_steps: 1
37+
eval_accumulation_steps: null
38+
eval_delay: 0
39+
torch_empty_cache_steps: null
40+
learning_rate: 0.0002
41+
weight_decay: 0.0
42+
adam_beta1: 0.9
43+
adam_beta2: 0.999
44+
adam_epsilon: 1.0e-08
45+
max_grad_norm: 1.0
46+
num_train_epochs: 1
47+
max_steps: 1000
48+
lr_scheduler_type: cosine
49+
lr_scheduler_kwargs: {}
50+
warmup_ratio: 0.1
51+
warmup_steps: 0
52+
log_level: passive
53+
log_level_replica: warning
54+
log_on_each_node: true
55+
logging_dir: ./output/qwen3_5_training/runs
56+
logging_strategy: steps
57+
logging_first_step: false
58+
logging_steps: 1
59+
logging_nan_inf_filter: true
60+
save_strategy: steps
61+
save_steps: 1000
62+
save_total_limit: 1
63+
save_on_each_node: false
64+
save_only_model: false
65+
restore_callback_states_from_checkpoint: false
66+
use_cpu: false
67+
seed: 42
68+
data_seed: null
69+
bf16: true
70+
fp16: false
71+
bf16_full_eval: false
72+
fp16_full_eval: false
73+
tf32: null
74+
local_rank: 0
75+
ddp_backend: null
76+
debug: []
77+
dataloader_drop_last: false
78+
eval_steps: null
79+
dataloader_num_workers: 0
80+
dataloader_prefetch_factor: null
81+
run_name: qwen3_5_debug
82+
disable_tqdm: false
83+
remove_unused_columns: true
84+
label_names: null
85+
load_best_model_at_end: false
86+
metric_for_best_model: null
87+
greater_is_better: null
88+
ignore_data_skip: false
89+
fsdp: []
90+
fsdp_config:
91+
transformer_layer_cls_to_wrap:
92+
- Qwen3_5DecoderLayer
93+
reshard_after_forward: false
94+
min_num_params: 0
95+
xla: false
96+
xla_fsdp_v2: false
97+
xla_fsdp_grad_ckpt: false
98+
accelerator_config:
99+
split_batches: false
100+
dispatch_batches: null
101+
even_batches: true
102+
use_seedable_sampler: true
103+
non_blocking: false
104+
gradient_accumulation_kwargs: null
105+
parallelism_config: null
106+
deepspeed: null
107+
label_smoothing_factor: 0.0
108+
optim: adamw_torch_fused
109+
optim_args: null
110+
length_column_name: length
111+
report_to: []
112+
project: huggingface
113+
trackio_space_id: trackio
114+
ddp_find_unused_parameters: null
115+
ddp_bucket_cap_mb: null
116+
ddp_broadcast_buffers: null
117+
dataloader_pin_memory: true
118+
dataloader_persistent_workers: false
119+
skip_memory_metrics: true
120+
push_to_hub: false
121+
resume_from_checkpoint: null
122+
hub_model_id: null
123+
hub_strategy: every_save
124+
hub_token: <HUB_TOKEN>
125+
hub_private_repo: null
126+
hub_always_push: false
127+
hub_revision: null
128+
gradient_checkpointing: true
129+
gradient_checkpointing_kwargs: null
130+
include_for_metrics: []
131+
eval_do_concat_batches: true
132+
auto_find_batch_size: false
133+
full_determinism: false
134+
ddp_timeout: 1800
135+
torch_compile: false
136+
torch_compile_backend: null
137+
torch_compile_mode: null
138+
include_num_input_tokens_seen: 'no'
139+
neftune_noise_alpha: null
140+
optim_target_modules: null
141+
batch_eval_metrics: false
142+
eval_on_start: false
143+
use_liger_kernel: true
144+
liger_kernel_config: null
145+
eval_use_gather_object: false
146+
average_tokens_across_devices: true
147+
use_muon: false
148+
freeze_modules: null
149+
use_rmpad: true
150+
fsdp2: true
151+
sp_ulysses_degree: 1
152+
reduce_dtype: bfloat16
153+
output_dtype: bfloat16
154+
print_batch_input_steps: 5
155+
enable_profiler: false
156+
profiler_config:
157+
start_step: 1
158+
end_step: 3
159+
model_config:
160+
extra_kwargs: {}
161+
load_from_pretrained_path: Qwen/Qwen3.5-VL-8B-Instruct
162+
load_from_config: null
163+
attn_implementation: flash_attention_2
164+
overwrite_config: null
165+
monkey_patch_kwargs: null
166+
extra_kwargs: null

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ license = { text = "Apache-2.0" }
1616
dependencies = [
1717
"datasets",
1818
"hf_transfer",
19-
"transformers==4.57.1",
19+
"transformers>=4.57.1",
2020
"accelerate",
2121
"pillow",
2222
"peft",

src/lmms_engine/datasets/processor/aero_processor.py

Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
from PIL import Image
66

77
from lmms_engine.mapping_func import register_processor
8+
from lmms_engine.utils import DataUtilities
89

910
from ...models.aero.processing_aero import AeroProcessor, AeroProcessorKwargs
1011
from .config import ProcessorConfig
@@ -19,6 +20,14 @@ def build(self):
1920
self.processor = self._build_processor()
2021
self.processor.chat_template = self.chat_template_no_system
2122

23+
@property
24+
def special_tokens(self):
25+
if not hasattr(self, "_special_tokens"):
26+
self._special_tokens = DataUtilities.get_special_tokens(
27+
self.processor.tokenizer, extra_tokens=["<|im_start|>", "<|im_end|>"]
28+
)
29+
return self._special_tokens
30+
2231
def _build_processor(self):
2332
processor = AeroProcessor.from_pretrained(self.config.processor_name)
2433
return processor
@@ -88,9 +97,7 @@ def get_qwen_template_labels(
8897
system_message: str = "You are a helpful assistant",
8998
add_system_prompt: bool = True,
9099
):
91-
special_tokens = self.processor.tokenizer.additional_special_tokens
92-
special_tokens.extend(["<|im_start|>", "<|im_end|>"])
93-
unmask_tokens_idx = [self.processor.tokenizer.convert_tokens_to_ids(t) for t in special_tokens]
100+
unmask_tokens_idx = [self.processor.tokenizer.convert_tokens_to_ids(t) for t in self.special_tokens]
94101
input_id, target = [], []
95102
# The purpose of start from is to record which mm token we are at. Supposing the format is interleaved
96103
# Then we need to record this so that the mm token can be expanded correctly per conversation
@@ -100,12 +107,13 @@ def get_qwen_template_labels(
100107
video_start_from = 0
101108

102109
if add_system_prompt and hf_messages[0]["role"] != "system":
103-
input_id += self.processor.tokenizer.apply_chat_template([{"role": "system", "content": system_message}])
110+
input_id += DataUtilities.apply_chat_template(
111+
self.processor, [{"role": "system", "content": [{"type": "text", "text": system_message}]}]
112+
)
104113
target += [-100] * len(input_id)
105114
for message in hf_messages:
106115
role = message["role"]
107-
# Cautions, qwen2_5 vl tokenizer wrap into a list
108-
encode_id = self.processor.apply_chat_template([message], tokenize=True)[0]
116+
encode_id = DataUtilities.apply_chat_template(self.processor, [message])
109117
if self.audio_token_id in encode_id:
110118
encode_id, used_audio = self._expand_encode_id_audio_tokens(
111119
encode_id, num_audio_tokens, audio_start_from

0 commit comments

Comments
 (0)