Replies: 2 comments 1 reply
-
|
@jisngprk I am also new to deep speed so I may be wrong but this is what works for many people: As far as checkpoints go, I think one is the checkpoint, and another is the full model with state dic + optimizer state + other stuff. At least this is how it works with torch. For example: Save: Load:
And if you want to use just the CPU, then you usually just specify it. I.e. model.cuda() Hope this is helpful. |
Beta Was this translation helpful? Give feedback.
-
|
Saving and loading DeepSpeed checkpoints to/from CPU is important for scenarios where GPU memory is limited. Here's what works: Saving with CPU offload: # In your training loop
if step % save_interval == 0:
# Save checkpoint - DeepSpeed handles sharding automatically
model_engine.save_checkpoint(
save_dir="./checkpoints",
tag=f"step_{step}",
# Save optimizer states to CPU to avoid GPU memory spike
exclude_frozen_parameters=True,
)Loading checkpoint: # During initialization
model_engine, optimizer, _, _ = deepspeed.initialize(
model=model,
config=ds_config,
)
# Load from checkpoint
_, client_state = model_engine.load_checkpoint(
load_dir="./checkpoints",
tag="step_1000",
load_optimizer_states=True,
load_lr_scheduler_states=True,
)CPU checkpoint consolidation: python -c "
from deepspeed.utils.zero_to_fp32 import convert_zero_checkpoint_to_fp32_state_dict
convert_zero_checkpoint_to_fp32_state_dict('./checkpoints/step_1000', './checkpoints/fp32_model.pt')
"Common pitfall: For distributed LLM serving: |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I have some question.
I am using two GPU in ond node
I want to load the model with cpu
How I could load the model ckpts with normal torch.load or deepspeed engine without distributed gpu environment setting?
If there is some example in DeepspeedExample repo, please let me know.
When I save the ckpt , the ckpt is saved in two separate directory that are named with loss. Is it normal that the ckpt is saved separately ? - I am guessing because of loss in name of directory
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions