Currently in FSDP skips intermediate reconstruction of tree. This results in the following issues:
- grouping for the parameters might be non-deterministic -> loaded state might be incorrect
- state tight to the number of groups -- so to the size of the communication, which is wrong design
While current strategy works deterministically if everything is fixed from run to run, however improvement is needed.
Currently in FSDP skips intermediate reconstruction of tree. This results in the following issues:
While current strategy works deterministically if everything is fixed from run to run, however improvement is needed.