feat: add ckpt conversion script fp32-bf16#614
Merged
ashokponkumar merged 4 commits intoSep 30, 2025
Merged
Conversation
Signed-off-by: yashasvi <yashasvi@ibm.com>
|
Thanks for making a pull request! 😃 |
Collaborator
|
Can we make it generic as say checkpoint_utils.py or something, and features like removing optimizer files etc. |
882adc5 to
35f0908
Compare
Signed-off-by: yashasvi <yashasvi@ibm.com>
7d4b464 to
9c2bf84
Compare
Signed-off-by: yashasvi <yashasvi@ibm.com>
c2c2506 to
ea8ac4c
Compare
ashokponkumar
approved these changes
Sep 30, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of the change
fms-hf saves both intermediate and end checkpoints based on settings like (save_strategy, save_model_dir) using HF APIs.
For trainings done on multi node setup with mixed precision (which is default in granite dot build due to better performance) are saved in fp32 mode. This means the checkpoint saved is large in number and even becomes bigger for intermediate checkpoints due to optimizer state.
This PR will add a script to convert model checkpoints from
fp32tobf16, this script can be then used as and when required by to convert checkpoints..This PR adds a script
checkpoint_utils.pyto manage checkpoints:Utilities for managing model checkpoints with an optional in-place mode.
INPUT -> OUTPUTunchanged--convert-model-to-bf16: cast model FP32 tensors to BF16 (optimizer tensors remain FP32)--no-optimizer: drop optimizer artifacts when writing outputs (merged with--drop-files)--drop-files: comma-separated extra file/dir names to drop (works with--no-optimizerand--inplace)--inplace: perform conversion and/or dropping directly in INPUT (destructive)Defaults
Optimizer-related files dropped by
--no-optimizer:optimizer.pt,optimizer,optimizer_0,optimizer_1The script supports converting:
.pt/.pthfiles.safetensorsfiles.safetensors)Features
--drop-filesUsage:
This script provides several operations on checkpoints: copying, converting FP32 tensors to BF16, dropping optimizer states, and modifying checkpoints in place.
Related issue number
How to verify the PR
Was the PR tested