Li/fix zero param level sharding#57
Closed
donglixp wants to merge 2042 commits into
Closed
Conversation
fix cache dir is a str, no exists() function
After this PR, the `follow` logic in autodist is - the father (followed) op should not contain a sum dim (linear is defined as a sum op, since it has a sum dim in computation) - a unary op (like GeLU) will try to follow its producer - if a op's inputs are from multiple producers (like add, concat), it will follow the 1st producer if the producers are in a same `follow region`. Update the test case to elaborate this PR. Fix the bug in dp solver when computing the in edges for a dp node.
…contains attributes parity alert passed 
The index in `train_mem2in_idx` is the original index of the input of the operator, here add a mapping for the original index to the pure tensor index. This bug is found by the functions that didn't put tensor input in the front, i.e., `torch.gather`.
parser: never fold getattr node 'self.training' unit test pass parity check pass
self.training in submodules: hotfix for nightly test unit test pass parity check pass
Add a pipeline to nightly build wheel, and fix packaging for autodist profile data. pipeline: https://msrasrg.visualstudio.com/SuperScaler/_build?definitionId=114 repo: https://msrasrg.visualstudio.com/SuperScaler/_artifacts/feed/nightly
never fold nnscaler runtime functions
- align the memory estimation in dp solver with ilp solver, check this [PR](https://dev.azure.com/msrasrg/SuperScaler/_git/MagicCube/pullrequest/2121) for more details - refine c++ code - have verified the search result compared to the ilp with & without recompute on retnet-3b NOTE: after this PR, more meta information are introduced in a dynamic programming state, resulting in the dp solver may be slower than ilp solver, which needs further optimization.
1. [PR](https://dev.azure.com/msrasrg/SuperScaler/_git/MagicCube/pullrequest/2185) ignores the case when profiling fails 2. dp solver bug 1: segment fault when `following candidates` is empty 2. dp solver bug 2: corner case, the new generated dp state can be illegal, need to check when adding it to new states tests added
support ConvTranspose1D,Conv2D,ConvTranspose2D
`_TensorBase ` rename to `TensorBase ` in torch 2.3, this will affect the functions call like `torch.Tensor.view(t, (1, -1))`, this pr fix this issue.
update doc for v0.1
Added nanoGPT example Known causes of parity mismatch: - grad clip - param group - unable to align rng (validation loop consumes random numbers, etc) Update: Added args `use_nnscaler`, `plan_ngpus`, `runtime_ngpus`. Precision will derive nanoGPT's arg `dtype`. Doc will be added in another PR. Other known issue: When using fp16, nanoGPT uses a scaler while this version does not. I think it's not a big deal because by default nanoGPT does not use fp16.
fix ifexpr warning unit test pass parity check pass
ensure the output tensor has dim anno when it is a scalar tensor
Adds a pipeline to upload release wheel to devops artifact and test.pypi.org. (Has already been run basing on 0.1 tag) And then update `version.py` to 0.2. Don't want to bother create a separate PR. Pipeline usage: 1. Open the pipeline webpage: https://msrasrg.visualstudio.com/SuperScaler/_build?definitionId=116 2. Click "Run pipeline" 3. Choose the branch/tag 4. Click "Variables" 5. Click "version" and set the value to something like "v0.1" 6. Confirm update and run (the update will not be saved and must be done every time)
1. Add a flag to IRTensor to indicate whether it is originally a scalar tensor. 2. During graph transformation, do as it is. 3. When generate code, check the flag to generate correct code. unit test pass parity check pass
MiniTrainer workable version. the parity check is against lightning.
parity matched between lightning version & mini-trainer version
add mixed precision f16 optimizer
Loss is a special tensor in the computation graph. - requires_grad = True - the forward graph and backward graph share exactly a same tensor physically The main branch exists problem when partitioning the loss. Since the loss is a scalar tensor by default, it is partitioned along the value dimension. Assume we have a operator `nll_loss([1024, 2048], [1024]) -> [1]` with annotation `N+ C^, C^ ->1`. In LLM training, `N` is the token dim, `C` is the dictionary dim, partition along `N` will partition the loss along value. In the main branch, following code will be generated  Although it is runnable and correct, it breaks our definition of `IRSegment`, **the intermediate variable `nll_loss_10138` should not be passed out as an output tensor**. However, removing this sub-tensor directly does not solve the problem, since the real loss tensor is generated by an adapter `nnscaler.runtime.adapter.all_reduce`, which means its `requires_grad` field equals to `False` at runtime. In addition, the additional partitioned `nll_loss_10138` disappears at pipeline in the main branch.  Root causes are - when `gen_activations` is called to generate adapters, the returned adapter for the partitioned loss is wrong. It should be a `nnscaler.runtime.adapter.nn.allreduce_identity` instead of `nnscaler.runtime.adapter.all_reduce` - an additional compiling pass `Grouping` is called for spmd/tp. `Grouping` will dispatch the partitioned graph to each device and build an `IRSegment` for each device. - in the `create_segment` method, there is an additional check when determining the outputs: `isinstance(otensor, IRSubTensor) and otensor.is_loss()`. This check will add both of `nll_loss_10138` and `nll_loss_1955` to the segment's output. - `nll_loss_1955` is annotated with `requires_grad=False` and `grad=None`, `nll_loss_10138` is annotated `requires_grad=True` and `grad = gtensorxxx`. According to the logic in `get_backward_callsite_io_tensors`, `nll_loss_10138` will be recognized as the real loss to the backward graph. - However, in the pipeline code generation, there is no `Grouping` pass. The dispatch process (ExeReuseCell -> Segment -> IRCell) strictly follows the assumption that output of a segment should be a full tensor. To solve this problem, in this PR - generate correct adapters when the output loss is used in another operator (like the `.data` operation in fairseq's criterion) - choose tensor as the segment's output carefully to make the emit process runnable parity check passed  1. Add a new Mixin (ScaleDelayedOptimizerMixin) to support MixedPrecisionAdam like optimizers 2. Refine HybridOptimizer to support ScaleDelayedOptimizerMixin. --------- Co-authored-by: zyeric <cheerforwhy@gmail.com>
1. Fix tracing error for pytorch 2.9+ 2. Fix state dicts related logic for Muon. 3. Add MASTER_PORT (also used as PYTEST_RUN_ID) env for tests, so we can run multiple pytest in the same machine.
1. add a new flag zero_param_level_sharding so user can control the granularity of sharding parameters in ZeRO 2. Add a muon mixin to flatten/unflatten parameters for Muon. 3. Refine logic for flatten buffers. We will record the position (start/end) of each paraemters in the flattened buffer. (instead of calculating it on the fly). So we have more freedom to add paddings.
…ensor parallelism (#18) 1. try to use all_gather to gather weights from other ranks for TP 2. If fails (pp/zero3 are used), we will fallback to use the old way. --------- Co-authored-by: Xun Wu <138114252+yushuiwx@users.noreply.github.com>
…ll cases (#19) Add a dtensor class to help runtime state dict merging.
This PR extends nnscaler’s mixed-module training/serialization path to support ZeRO-1 for non-parallel parameters, introducing a virtual NonParallelModule wrapper and updating optimizer/state-dict merge logic accordingly.
1. updates and expands the documentation for parallel_module.md and trainer.md, 2. update arg parser to add escape support. 3. update arg parser to better support __type, __value_type, function type.
Remove unnecessary multiref addition when pipeline parallelism is enabled. This change also clarifies the conditions under which multiref is applied. The old code was over-triggering: with pipeline parallelism enabled (len(pp_desc.spmd_descs) > 1), it would add multiref for any replicated parameter, even if that parameter only existed in a single stage. The fix narrows the condition to only add multiref when the parameter is actually consumed in multiple stages, which is the true definition of a cross-stage shared parameter.
Introduce support for fake functions in tracing, allowing for the registration and use of placeholder functions during model tracing. This is mostly useful for functions that are costly or are not runnable (i.e. communication functions) during tracing. This feature also aligns with `torch.library.custom_op`, which supports registering fake. But the interface is not totally the same. The fake function should return a real tensor, which can be used in concrete tracing later. (currently no `@custom_op.register`_fake-like interface)
Add profiling capabilities to the CLI, allowing users to monitor CPU and CUDA activities during training.
Implement overlapped scheduler by: 1. add cuda stream context for each segment/adapter/reducer 2. add cuda event wait/record for each segment 3. refine sched code generation for stream/event config.
This PR introduces a “no-grad-reduce” annotation mechanism for custom op shape annotations so that, for specific partition identifiers, nnscaler can skip inserting gradient all-reduce adapters (avoiding incorrect or redundant reductions). This is done by extending ShapeAnno parsing to support : / modifiers (and '/' as a shortcut) to control gradient-reduction behavior during partitioning.
* Merge ring attention implementation into main branch * remove sink since it is no longer needed * remove cp_ranks * add tests
Introduce support for the grad_dtype attribute in parameters, enhancing flexibility in gradient precision management. --------- Co-authored-by: zyeric <cheerforwhy@gmail.com>
nns supports 2.11
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
当 zero_param_level_sharding=True 时,bucket builder 不能只按 bytes 切,还要保证每个 bucket 至少有 zero_group_size 个参数,尤其要处理尾桶。