fix: directly save final ckpt in save_model_dir#626
Conversation
|
Thanks for making a pull request! 😃 |
dushyantbehl
left a comment
There was a problem hiding this comment.
Rather than doing this can we pass the save_dir directly with hf_converted_output_dir attached if intermediate else not if this is final from the call to this function?
This would reduce the code and have a clean function.
Yeah, the path-based approach sounds simpler, but I had thought that it could easily lead to inconsistent directory handling. I did consider it, but I think the flag-based version is better since it keeps all folder-structure logic in one place. That way, if we ever change stuff (like renaming |
Can you explain what you mean by |
With inconsistent I had meant passing different paths outside the checkpoint function, the current version contains all the logic within the checkpoint function and simply abstracts the intermediate vs final checkpoint logic within the function |
93bda6b to
a346f92
Compare
Signed-off-by: yashasvi <yashasvi@ibm.com>
a346f92 to
e3f0b8a
Compare
1e844d3
into
foundation-model-stack:main

Description of the change
With this PR the the fms_hf_tuning stack will directly save final ckpt in save_model_dir rather than nesting it further within the
/{save_model_dir}/hf_converted_checkpointWhen using
SHARDED_STATE_DICT, each intermediate checkpoint produces distributed shard files that must be converted into Hugging Face–compatible format.which are saved within
but for the final checkpoint nesting it again under
hf_converted_checkpointis unnecessary and hence the change..Related issue number
How to verify the PR
Following is the screenshot of the job with the PR changes.
hf_converted_checkpointsubdirWas the PR tested