NNX migration prep (5/N): enable NNX by default#3526
Draft
ecnal-cienet wants to merge 5 commits intomainfrom
Draft
NNX migration prep (5/N): enable NNX by default#3526ecnal-cienet wants to merge 5 commits intomainfrom
ecnal-cienet wants to merge 5 commits intomainfrom
Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
bac289f to
db75887
Compare
ab2019b to
ffeefdd
Compare
- Add utils to manipulate the NNX shardings with abstract state of a
model
- also add unit tests for the utils
- Extract mesh creation function to maxtext_utils.get_mesh_from_config()
- also add unit tests for this func
Note:
flax v0.12 has DeprecationWarning in multiple places:
- DeprecationWarning: '.value' access is now deprecated. Use
variable.get_value() or variable[...] (for [Array]).
- DeprecationWarning: 'VariableState' was removed, this is just
an alias to 'Variable'. Plase use 'Variable' directly instead.
But since the code needs to work with post-training, which currently
requires flax v0.11, we didn't change code for these warnings.
- Add TrainStateNNX (layers/train_state_nnx.py) with checkpoint and unit tests - Refactor model_creation_utils with create_nnx_abstract_model(); add NNX support to muon_utils - Add get_abstract_state_nnx() and get_nnx_named_sharding_with_scan_axis() to maxtext_utils.py - Wire NNX train state into train.py and train_utils.py with pure_nnx dispatch
…ison utility - modify print_shardings_params to support NNX (maxtext_utils.py) - add --pure_nnx flag to run_sharding_dump.py - add bidirectional Linen<->NNX checkpoint conversion utility (linen_nnx_converter.py) - add checkpoint comparison utility for Linen vs NNX validation (compare_linen_nnx_checkpoint.py)
f1e9765 to
5a7f63b
Compare
5a7f63b to
73213e0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
NNX Migration Route Map
pure_nnxflag,init_state_fn,TrainStateNNX, NNX utils. Linen workflow unchanged. (PR #3427)get_abstract_state_nnx,get_named_sharding_nnx,set_named_sharding_nnx,get_partition_spec_nnx,get_mesh_from_config. (PR #3470)TrainStateNNX, model creation, gradient accumulation, checkpointing, and training loop dispatch. (PR #3500)Description
Config change
src/MaxText/configs/base.yml— three flags flipped toTrue:Test fixes
src/maxtext/layers/nnx_decoders.pymultimodal_input=NonetoNNXDecoder.__call__and unpack into individual fields —Transformer.__call__passes a unifiedMultimodalInputobject butNNXDecoderpreviously only accepted the fields individually.src/maxtext/utils/muon_utils.pyget_muon_weight_dimension_numbers, return thennx.Statedirectly (preserving.valueattribute access for unit tests). (2) Inget_model_mdn, normalize the NNX output viannx.to_pure_dict+{"params": ...}wrapper sotest_model_integrationexpected values (written in Linen format) stay valid.src/maxtext/trainers/post_train/distillation/distillation_utils.pyoptimizer_staterestore on whether it exists in the checkpoint —PeftTrainer.save()only savesmodel_params, so restoringoptimizer_stateunconditionally caused aKeyError.tests/integration/gradient_accumulation_test.pytest_sft_grad_accumulate_same_lossfrom the deprecated SFT loop to the NNX-nativetrain_sft.py— the deprecated loop always passednextrngas a 3rd positional arg, mismatching the 2-element NNXin_shardings.tests/integration/decode_tests.py,generate_param_only_checkpoint_test.py,smoke/inference_microbenchmark_smoke_test.pypure_nnx=False enable_nnx=False pure_nnx_decoder=Falseto all inference test configs — maxengine (the inference engine) does not yet support NNX, so decode tests must explicitly declare the Linen path.tests/unit/sharding_compare_test.pyabstract_state.modelleaves to floating-point only before assertingdtype == float32— the NNX model state includes RNG state variables (uint32/key) that are not weight parameters.src/maxtext/layers/nnx_decoders.py(scan update)self.layers = nnx.merge(...)withnnx.update(self.layers, nnx.state(...))in_apply_layers_sequentiallycallers — reassigningself.layersinsidennx.value_and_gradmutates the NNX graph structure, triggeringValueError: cached_partial graph structure mutated.src/maxtext/utils/gradient_accumulation.pyshard_mode=explicit):jax.lax.scantraces its body with anAbstractMeshwhere all axis types areAuto, which rejectsreduced/unreducedPartitionSpecin scan carry tensors. Fix: use plainparams_shardingsin the scan carry; applyunreducedannotation to gradients after the scan to trigger the all-reduce across data-parallel devices. Also addscopy=Truetonnx.mergeinside the scan body to avoidTraceContextErrorfrom reused Variable objects.src/maxtext/layers/pipeline.pytest_full_train_non_circular):nnx.vmap'sextract.to_treechecksVariable._can_update(via_trace_state.is_valid()). Variables created bynnx.mergeinsidejax.value_and_gradhave_trace_stateat the grad trace level; whennnx.vmapenters a deeper trace level,_can_updatereturnsFalseand raisesValueError: Cannot extract graph node from different trace level. Fix: wrap the vmapped function withnnx.to_pure_dict(state)before thennx.vmapcall — a pure dict of arrays has no Variable objects, soextract.to_treeskips the trace check, andnnx.merge(graph, pure_dict)inside the vmap body creates fresh Variables valid at the current trace level.Tests