Skip to content

ValueError: There is no module or parameter named 'base_model' in Qwen2_5_VLForConditionalGeneration #418

@luyouqi233

Description

@luyouqi233

报错如下:

Traceback (most recent call last):                                                                        
  File "/fs/fast/ROLL/examples/start_rlvr_vl_custom_pipeline.py", line 33, in <module>        
    main()                                                                                                
  File "/fs/fast/ROLL/examples/start_rlvr_vl_custom_pipeline.py", line 29, in main            
    pipeline.run()                                                                                        
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context                                                                            
    return func(*args, **kwargs)                                                                          
  File "/fs/fast/ROLL/roll/pipeline/rlvr/rlvr_custom_vlm_pipeline.py", line 471, in run       
    model_update_metrics: Dict = self.model_update(global_step)                                           
  File "/fs/fast/ROLL/roll/pipeline/base_pipeline.py", line 74, in model_update               
    metrics.update(model_update_group.model_update(global_step))                                          
  File "/fs/fast/ROLL/roll/distributed/executor/model_update_group.py", line 35, in model_upda
te
    dataprotos: list[DataProto] = ray.get(
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/ray/_private/auto_init_hook.
py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/ray/_private/client_mode_hoo
k.py", line 103, in wrapper
return func(*args, **kwargs)
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/ray/_private/client_mode_hoo
k.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/ray/_private/worker.py", lin
e 2822, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/ray/_private/worker.py", lin
e 930, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::ActorWorker.start_model_update() (pid=46829, ip=10.0.0.2, actor_id=2f74a
d18013a46002d87ddc201000000, repr=ActorWorker(actor_train-0-G5))
  File "/fs/fast/ROLL/roll/distributed/executor/worker.py", line 188, in start_model_update
    exec_metrics: Dict = self.strategy.model_update(*args, **kwargs)
  File "/fs/fast/ROLL/roll/distributed/strategy/deepspeed_strategy.py", line 593, in model_upd
ate
    return self.weight_updaters[model_update_name].model_update()
  File "/fs/fast/ROLL/roll/third_party/deepspeed/model_update.py", line 79, in model_update
    return self._colocated_model_update()
  File "/fs/fast/ROLL/roll/third_party/deepspeed/model_update.py", line 167, in _colocated_mod
el_update
ray.get(refs)
ray.exceptions.RayTaskError: ray::InferWorker.update_parameter_in_bucket() (pid=47376, ip=10.0.0.2, actor_
id=c8f1bcaf9f05bbffadab9c2f01000000, repr=InferWorker(actor_infer-0-G45))
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/concurrent/futures/_base.py", line 458, in
 result
    return self.__get_result()
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/concurrent/futures/_base.py", line 403, in
 __get_result
    raise self._exception
  File "/fs/fast/ROLL/roll/pipeline/base_worker.py", line 473, in update_parameter_in_bucket
    await self.strategy.update_parameter_in_bucket(*args, **kwargs)
  File "/fs/fast/ROLL/roll/distributed/strategy/vllm_strategy.py", line 348, in update_paramet
er_in_bucket
    await self.model.update_parameter_in_bucket(serialized_named_tensors, is_lora)
  File "/fs/fast/ROLL/roll/third_party/vllm/async_llm.py", line 22, in update_parameter_in_buc
ket
    await self.engine_core.collective_rpc_async(method="update_parameter_in_bucket", args=(serialized_name
d_tensors, is_lora))
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/vllm/v1/engine/core_client.p
y", line 747, in collective_rpc_async
    return await self.call_utility_async("collective_rpc", method, timeout,
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/vllm/v1/engine/core_client.p
y", line 678, in call_utility_async
return await self._call_utility_async(method,
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/vllm/v1/engine/core_client.p
y", line 691, in _call_utility_async
    return await future
Exception: Call to collective_rpc method failed: ray::RayWorkerWrapper.execute_method() (pid=49104, ip=10.
0.0.2, actor_id=380289d821e6833eeceacd0003000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at
 0x14da328ebb50>)
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/vllm/worker/worker_base.py",
 line 621, in execute_method
    raise e
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/vllm/worker/worker_base.py",
 line 612, in execute_method
    return run_method(self, method, args, kwargs)
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/vllm/utils.py", line 2378, i
n run_method
    return func(*args, **kwargs)
  File "/fs/fast/ROLL/roll/third_party/vllm/worker.py", line 144, in update_parameter_in_bucke
t
    self.load_weights([(name, weight) for name, weight in named_params])
  File "/fs/fast/ROLL/roll/third_party/vllm/worker.py", line 74, in load_weights
    self.model_runner.model.load_weights(weights=weights)
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/vllm/model_executor/models/q
wen2_5_vl.py", line 1116, in load_weights
return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/vllm/model_executor/models/u
tils.py", line 261, in load_weights                                                                       
    autoloaded_weights = set(self._load_module("", self.module, weights))                                 
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/vllm/model_executor/models/u
tils.py", line 250, in _load_module                                                                       
    raise ValueError(msg)                                                                                 
ValueError: There is no module or parameter named 'base_model' in Qwen2_5_VLForConditionalGeneration
Traceback (most recent call last):
  File "/fs/fast/ROLL/examples/start_rlvr_vl_custom_pipeline.py", line 33, in <module>
    main()
  File "/fs/fast/ROLL/examples/start_rlvr_vl_custom_pipeline.py", line 29, in main
    pipeline.run()
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/utils/_contextlib.py"$
 line 116, in decorate_context
    return func(*args, **kwargs)
  File "/fs/fast/ROLL/roll/pipeline/rlvr/rlvr_custom_vlm_pipeline.py", line 471, in run
    model_update_metrics: Dict = self.model_update(global_step)
  File "/fs/fast/ROLL/roll/pipeline/base_pipeline.py", line 74, in model_update
    metrics.update(model_update_group.model_update(global_step))
  File "/fs/fast/ROLL/roll/distributed/executor/model_update_group.py", line 35, in model_upd$
te
dataprotos: list[DataProto] = ray.get(
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/ray/_private/auto_init_hook.
py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/ray/_private/client_mode_hoo
k.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/ray/_private/worker.py", lin
e 2822, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/ray/_private/worker.py", lin
e 930, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::ActorWorker.start_model_update() (pid=46829, ip=10.0.0.2, actor_id=2f74a
d18013a46002d87ddc201000000, repr=ActorWorker(actor_train-0-G5))
  File "/fs/fast/ROLL/roll/distributed/executor/worker.py", line 188, in start_model_update
    exec_metrics: Dict = self.strategy.model_update(*args, **kwargs)
  File "/fs/fast/ROLL/roll/distributed/strategy/deepspeed_strategy.py", line 593, in model_upd
ate
    return self.weight_updaters[model_update_name].model_update()
  File "/fs/fast/ROLL/roll/third_party/deepspeed/model_update.py", line 79, in model_update
    return self._colocated_model_update()
  File "/fs/fast/ROLL/roll/third_party/deepspeed/model_update.py", line 167, in _colocated_mod
el_update
ray.get(refs)
ray.exceptions.RayTaskError: ray::InferWorker.update_parameter_in_bucket() (pid=47376, ip=10.0.0.2, actor_
id=c8f1bcaf9f05bbffadab9c2f01000000, repr=InferWorker(actor_infer-0-G45))
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/concurrent/futures/_base.py", line 458, in
 result
    return self.__get_result()
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/concurrent/futures/_base.py", line 403, in
 __get_result
    raise self._exception
  File "/fs/fast/ROLL/roll/pipeline/base_worker.py", line 473, in update_parameter_in_bucket
    await self.strategy.update_parameter_in_bucket(*args, **kwargs)
  File "/fs/fast/ROLL/roll/distributed/strategy/vllm_strategy.py", line 348, in update_paramet
er_in_bucket
    await self.model.update_parameter_in_bucket(serialized_named_tensors, is_lora)
  File "/fs/fast/ROLL/roll/third_party/vllm/async_llm.py", line 22, in update_parameter_in_buc
ket
    await self.engine_core.collective_rpc_async(method="update_parameter_in_bucket", args=(serialized_name
d_tensors, is_lora))
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/vllm/v1/engine/core_client.p
y", line 747, in collective_rpc_async
    return await self.call_utility_async("collective_rpc", method, timeout,
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/vllm/v1/engine/core_client.p
y", line 678, in call_utility_async
    return await self._call_utility_async(method,
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/vllm/v1/engine/core_client.p
y", line 691, in _call_utility_async
    return await future
Exception: Call to collective_rpc method failed: ray::RayWorkerWrapper.execute_method() (pid=49104, ip=10.
0.0.2, actor_id=380289d821e6833eeceacd0003000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at
 0x14da328ebb50>)
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/vllm/worker/worker_base.py",
 line 621, in execute_method
    raise e
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/vllm/worker/worker_base.py",
 line 612, in execute_method
    return run_method(self, method, args, kwargs)
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/vllm/utils.py", line 2378, i
n run_method
    return func(*args, **kwargs)
  File "/fs/fast/ROLL/roll/third_party/vllm/worker.py", line 144, in update_parameter_in_bucke
t
    self.load_weights([(name, weight) for name, weight in named_params])
  File "/fs/fast/ROLL/roll/third_party/vllm/worker.py", line 74, in load_weights
    self.model_runner.model.load_weights(weights=weights)
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/vllm/model_executor/models/q
wen2_5_vl.py", line 1116, in load_weights
return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/vllm/model_executor/models/u
tils.py", line 261, in load_weights                                                                       
    autoloaded_weights = set(self._load_module("", self.module, weights))
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/vllm/model_executor/models/$
tils.py", line 250, in _load_module
    raise ValueError(msg)
ValueError: There is no module or parameter named 'base_model' in Qwen2_5_VLForConditionalGeneration

yaml如下:

defaults:
  - ../config/deepspeed_zero@_here_
  - ../config/deepspeed_zero2@_here_
  - ../config/deepspeed_zero3@_here_
  - ../config/deepspeed_zero3_cpuoffload@_here_

hydra:
  run:
    dir: .
  output_subdir: null

exp_name: "qwen2_5_vl_7B_custom_config2"
seed: 42
logging_dir: ./output2/logs
output_dir: ./output2
system_envs:
  # Note: On ARM with cuDNN + TE <= 2.10, fused attention may segfault. Disable it to avoid the issue.
  # Disable TE Fuse_attention 
  PYTORCH_CUDA_ALLOC_CONF: expandable_segments:False
  NVTE_FLASH_ATTN: '1'
  NVTE_FUSED_ATTN: '0'
  NVTE_UNFUSED_ATTN: '0'
  # Disable Flash_mla 
  NVTE_FLASH_MLA: '0'

checkpoint_config:
  type: file_system
  output_dir: /fs/fast/ROLL/playground/models/${exp_name}

track_with: wandb
tracker_kwargs:
  log_dir: /fs/fast/ROLL/playground/wandb/roll_exp/rlvr_custom

num_nodes: 1
num_gpus_per_node: 4

save_steps: 20
logging_steps: 1
resume_from_checkpoint: false

adv_estimator: "grpo"
loss_agg_mode: "token-mean"

rollout_batch_size: 64
num_return_sequences_in_group: 8
is_num_return_sequences_expand: true
prompt_length: 1536
response_length: 2048
async_generation_ratio: 1

ppo_epochs: 1
pg_clip: 0.2
pg_clip_low: 0.2
pg_clip_high: 0.28
use_pg_clip_range: true

value_clip: 0.5
reward_clip: 10
advantage_clip: 10.0
whiten_advantages: false

init_kl_coef: 0.0
use_kl_loss: true
kl_loss_coef: 1.0e-3

# lora
lora_target: 'model\.layers\.\d+\.(mlp\.(down_proj|gate_proj|up_proj)|self_attn\.(k_proj|q_proj|v_proj|o_proj))'
lora_rank: 64
lora_alpha: 128

pretrain: /fs/fast/pretrained_models/Qwen2.5-VL-7B-Instruct

actor_train:
  model_args:
    flash_attn: fa2
    attn_implementation: fa2
    # Recomputed tensor size does not match for LoRA with Zero3 when activating checkpointing, See https://github.com/huggingface/transformers/issues/34928 for details
    disable_gradient_checkpointing: true
    dtype: bf16
    lora_target: ${lora_target}
    lora_rank: ${lora_rank}
    lora_alpha: ${lora_alpha}
    model_type: ~
  training_args:
    learning_rate: 2.0e-5
    min_lr: 1.0e-6
    weight_decay: 1.0e-2
    per_device_train_batch_size: 1
    gradient_accumulation_steps: 32
    warmup_ratio: 0.1
    num_train_epochs: 1
    lr_scheduler_type: cosine_with_min_lr
  data_args:
    template: qwen2-vl
    dataset_type: parquet
    dataset_dir: /fs/fast/data_filtering_mcq/files/hard/
    file_name: 
      - emoset_parquet/train-00000-of-00001.parquet
      - pisc_parquet/train-00000-of-00001.parquet
      - illusory_parquet/train-00000-of-00001.parquet
      - hateful_memes_parquet/train-00000-of-00001.parquet
      - sketchy_parquet/train-00000-of-00001.parquet
    domain_interleave_probs:
      llm_judge: 1.0
    preprocessing_num_workers: 16
  strategy_args:
    strategy_name: deepspeed_train
    strategy_config: ${deepspeed_zero3}
  device_mapping: list(range(0,4))
  infer_batch_size: 2

actor_infer:
  model_args:
    disable_gradient_checkpointing: true
    dtype: bf16
    lora_target: ${lora_target}
    lora_rank: ${lora_rank}
    lora_alpha: ${lora_alpha}
  generating_args:
    max_new_tokens: ${response_length}
    top_p: 1.0
    top_k: -1
    num_beams: 1
    temperature: 1.0
    num_return_sequences: ${num_return_sequences_in_group}
  data_args:
    template: qwen2-vl
  strategy_args:
    strategy_name: vllm
    strategy_config:
      tensor_parallel_size: 2
      gpu_memory_utilization: 0.6
      block_size: 16
      max_model_len: 2560
      enable_prefix_caching: false
      disable_mm_preprocessor_cache: true # RAM leak: https://github.com/vllm-project/vllm/issues/15085
  device_mapping: list(range(0,4))
  infer_batch_size: 2

reference:
  model_args:
    attn_implementation: fa2
    disable_gradient_checkpointing: true
    dtype: bf16
    # In transformers>=4.50.0, if model.from_pretrained with auto device_map, None
    # tp_plan (and tp_plan of model is not None) and WORLD_SIZE>1, TP would be used.
    # Thus using device_map=0 to disable HF transformers parallel, otherwise use
    # zero3 for reference model
    device_map: "cuda:0"
    model_type: ~
  data_args:
    template: qwen2-vl
  strategy_args:
    strategy_name: hf_infer
    strategy_config: ~
  device_mapping: list(range(0,4))
  infer_batch_size: 2

rewards:
  llm_judge:
    worker_cls: roll.pipeline.rlvr.rewards.custom_reward_worker.CustomRewardWorker
    judge_model_type: inference
    judge_num_per_query: 4
    model_args:
      model_name_or_path: /fs/fast/pretrained_models/Qwen3-4B-Instruct-2507
      attn_implementation: fa2
      dtype: bf16
      model_type: ~
    data_args:
      template: qwen3
    strategy_args:
      strategy_name: hf_infer
      strategy_config: ~
    device_mapping: list(range(0,4))
    infer_batch_size: 2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions