Skip to content

Enable DeepSpeed ZeRO-3 for rollout script #38

@XiangZhang-zx

Description

@XiangZhang-zx

Hi team,

When running sample/llada_rl_rollout.py, I encountered an OOM issue when using DeepSpeed ZeRO-3 together with LoRA and modules_to_save.
It seems that the current rollout script may not fully support ZeRO-3, or may require additional configuration to handle the increased memory footprint introduced by modules_to_save (e.g., keeping wte and ff_out trainable).

Could you please confirm whether ZeRO-3 is officially supported for rollout (and if so, what the correct setup is)?
If not currently supported, it would be great to include guidance or example configs for using llada_rl_rollout.py with ZeRO-3 in the documentation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions