Li/fix zero param level sharding by donglixp · Pull Request #57 · microsoft/nnscaler

donglixp · 2026-05-04T06:59:29Z

当 zero_param_level_sharding=True 时，bucket builder 不能只按 bytes 切，还要保证每个 bucket 至少有 zero_group_size 个参数，尤其要处理尾桶。

fix cache dir is a str, no exists() function

After this PR, the `follow` logic in autodist is - the father (followed) op should not contain a sum dim (linear is defined as a sum op, since it has a sum dim in computation) - a unary op (like GeLU) will try to follow its producer - if a op's inputs are from multiple producers (like add, concat), it will follow the 1st producer if the producers are in a same `follow region`. Update the test case to elaborate this PR. Fix the bug in dp solver when computing the in edges for a dp node.

…erging

…contains attributes parity alert passed ![image.png](https://msrasrg.visualstudio.com/bb54e96e-8cc1-46f6-9021-c7048165b5bc/_apis/git/repositories/66b74611-09f4-4d0e-89b7-5ee93c087d3c/pullRequests/2187/attachments/image.png)

The index in `train_mem2in_idx` is the original index of the input of the operator, here add a mapping for the original index to the pure tensor index. This bug is found by the functions that didn't put tensor input in the front, i.e., `torch.gather`.

parser: never fold getattr node 'self.training' unit test pass parity check pass

self.training in submodules: hotfix for nightly test unit test pass parity check pass

Add a pipeline to nightly build wheel, and fix packaging for autodist profile data. pipeline: https://msrasrg.visualstudio.com/SuperScaler/_build?definitionId=114 repo: https://msrasrg.visualstudio.com/SuperScaler/_artifacts/feed/nightly

refine docs

never fold nnscaler runtime functions

- align the memory estimation in dp solver with ilp solver, check this [PR](https://dev.azure.com/msrasrg/SuperScaler/_git/MagicCube/pullrequest/2121) for more details - refine c++ code - have verified the search result compared to the ilp with & without recompute on retnet-3b NOTE: after this PR, more meta information are introduced in a dynamic programming state, resulting in the dp solver may be slower than ilp solver, which needs further optimization.

1. [PR](https://dev.azure.com/msrasrg/SuperScaler/_git/MagicCube/pullrequest/2185) ignores the case when profiling fails 2. dp solver bug 1: segment fault when `following candidates` is empty 2. dp solver bug 2: corner case, the new generated dp state can be illegal, need to check when adding it to new states tests added

support ConvTranspose1D,Conv2D,ConvTranspose2D

`_TensorBase ` rename to `TensorBase ` in torch 2.3, this will affect the functions call like `torch.Tensor.view(t, (1, -1))`, this pr fix this issue.

update doc for v0.1

Added nanoGPT example Known causes of parity mismatch: - grad clip - param group - unable to align rng (validation loop consumes random numbers, etc) Update: Added args `use_nnscaler`, `plan_ngpus`, `runtime_ngpus`. Precision will derive nanoGPT's arg `dtype`. Doc will be added in another PR. Other known issue: When using fp16, nanoGPT uses a scaler while this version does not. I think it's not a big deal because by default nanoGPT does not use fp16.

fix ifexpr warning unit test pass parity check pass

ensure the output tensor has dim anno when it is a scalar tensor

Adds a pipeline to upload release wheel to devops artifact and test.pypi.org. (Has already been run basing on 0.1 tag) And then update `version.py` to 0.2. Don't want to bother create a separate PR. Pipeline usage: 1. Open the pipeline webpage: https://msrasrg.visualstudio.com/SuperScaler/_build?definitionId=116 2. Click "Run pipeline" 3. Choose the branch/tag 4. Click "Variables" 5. Click "version" and set the value to something like "v0.1" 6. Confirm update and run (the update will not be saved and must be done every time)

1. Add a flag to IRTensor to indicate whether it is originally a scalar tensor. 2. During graph transformation, do as it is. 3. When generate code, check the flag to generate correct code. unit test pass parity check pass

MiniTrainer workable version. the parity check is against lightning.

parity matched between lightning version & mini-trainer version

add mixed precision f16 optimizer

Loss is a special tensor in the computation graph. - requires_grad = True - the forward graph and backward graph share exactly a same tensor physically The main branch exists problem when partitioning the loss. Since the loss is a scalar tensor by default, it is partitioned along the value dimension. Assume we have a operator `nll_loss([1024, 2048], [1024]) -> [1]` with annotation `N+ C^, C^ ->1`. In LLM training, `N` is the token dim, `C` is the dictionary dim, partition along `N` will partition the loss along value. In the main branch, following code will be generated ![image (3).png](https://msrasrg.visualstudio.com/bb54e96e-8cc1-46f6-9021-c7048165b5bc/_apis/git/repositories/66b74611-09f4-4d0e-89b7-5ee93c087d3c/pullRequests/2207/attachments/image%20%283%29.png) Although it is runnable and correct, it breaks our definition of `IRSegment`, **the intermediate variable `nll_loss_10138` should not be passed out as an output tensor**. However, removing this sub-tensor directly does not solve the problem, since the real loss tensor is generated by an adapter `nnscaler.runtime.adapter.all_reduce`, which means its `requires_grad` field equals to `False` at runtime. In addition, the additional partitioned `nll_loss_10138` disappears at pipeline in the main branch. ![image (4).png](https://msrasrg.visualstudio.com/bb54e96e-8cc1-46f6-9021-c7048165b5bc/_apis/git/repositories/66b74611-09f4-4d0e-89b7-5ee93c087d3c/pullRequests/2207/attachments/image%20%284%29.png) Root causes are - when `gen_activations` is called to generate adapters, the returned adapter for the partitioned loss is wrong. It should be a `nnscaler.runtime.adapter.nn.allreduce_identity` instead of `nnscaler.runtime.adapter.all_reduce` - an additional compiling pass `Grouping` is called for spmd/tp. `Grouping` will dispatch the partitioned graph to each device and build an `IRSegment` for each device. - in the `create_segment` method, there is an additional check when determining the outputs: `isinstance(otensor, IRSubTensor) and otensor.is_loss()`. This check will add both of `nll_loss_10138` and `nll_loss_1955` to the segment's output. - `nll_loss_1955` is annotated with `requires_grad=False` and `grad=None`, `nll_loss_10138` is annotated `requires_grad=True` and `grad = gtensorxxx`. According to the logic in `get_backward_callsite_io_tensors`, `nll_loss_10138` will be recognized as the real loss to the backward graph. - However, in the pipeline code generation, there is no `Grouping` pass. The dispatch process (ExeReuseCell -> Segment -> IRCell) strictly follows the assumption that output of a segment should be a full tensor. To solve this problem, in this PR - generate correct adapters when the output loss is used in another operator (like the `.data` operation in fairseq's criterion) - choose tensor as the segment's output carefully to make the emit process runnable parity check passed ![image.png](https://msrasrg.visualstudio.com/bb54e96e-8cc1-46f6-9021-c7048165b5bc/_apis/git/r...

[Refine] Reduce memory fragment when resuming

…nal einops functions. Tracing einops Functions are challenging due to their dynamic nature and heavy reliance on string-based patterns and runtime shape manipulations. To make things easier, we skip tracing the internal logic of einops functions and directly use the resolved transformation recipes.

1. Normalize state_dict device handling 2. replace torch.cat with F.pad

* Add Doc for Autodist Constraints Guide * Revise the description of autodist in the documentation. * fix comment * polish doc --------- Co-authored-by: yileiyang <yileiyang@gmail.com>

* add ci.yml for continuous integration and continuous development * branch self test switch back to conda from uv change back to python3.10 from python 3.12 * add conda-forge channel for tox-conda package * add py lib for tox-conda * change conda to uv because tox-conda is too old * using uv tool for fixing setup issue * pip is needed for tox 3.0, while azuer use 3.0 don't need this permission * add all possible commands to allowlist * change back to main branch * back to conda with fixed tox and tox-conda version * change back to uv --------- Co-authored-by: yileiyang <yileiyang@gmail.com>

* Add nightly test to the repo * make parity alignment * fix incoordination shutil.rmtree of rank0 --------- Co-authored-by: yileiyang <yileiyang@gmail.com>

…10) 1. Add a new Mixin (ScaleDelayedOptimizerMixin) to support MixedPrecisionAdam like optimizers 2. Refine HybridOptimizer to support ScaleDelayedOptimizerMixin. --------- Co-authored-by: zyeric <cheerforwhy@gmail.com>

1. Fix tracing error for pytorch 2.9+ 2. Fix state dicts related logic for Muon. 3. Add MASTER_PORT (also used as PYTEST_RUN_ID) env for tests, so we can run multiple pytest in the same machine.

1. add a new flag zero_param_level_sharding so user can control the granularity of sharding parameters in ZeRO 2. Add a muon mixin to flatten/unflatten parameters for Muon. 3. Refine logic for flatten buffers. We will record the position (start/end) of each paraemters in the flattened buffer. (instead of calculating it on the fly). So we have more freedom to add paddings.

…ensor parallelism (#18) 1. try to use all_gather to gather weights from other ranks for TP 2. If fails (pp/zero3 are used), we will fallback to use the old way. --------- Co-authored-by: Xun Wu <138114252+yushuiwx@users.noreply.github.com>

…ll cases (#19) Add a dtensor class to help runtime state dict merging.

This PR extends nnscaler’s mixed-module training/serialization path to support ZeRO-1 for non-parallel parameters, introducing a virtual NonParallelModule wrapper and updating optimizer/state-dict merge logic accordingly.

1. updates and expands the documentation for parallel_module.md and trainer.md, 2. update arg parser to add escape support. 3. update arg parser to better support __type, __value_type, function type.

Remove unnecessary multiref addition when pipeline parallelism is enabled. This change also clarifies the conditions under which multiref is applied. The old code was over-triggering: with pipeline parallelism enabled (len(pp_desc.spmd_descs) > 1), it would add multiref for any replicated parameter, even if that parameter only existed in a single stage. The fix narrows the condition to only add multiref when the parameter is actually consumed in multiple stages, which is the true definition of a cross-stage shared parameter.

Introduce support for fake functions in tracing, allowing for the registration and use of placeholder functions during model tracing. This is mostly useful for functions that are costly or are not runnable (i.e. communication functions) during tracing. This feature also aligns with `torch.library.custom_op`, which supports registering fake. But the interface is not totally the same. The fake function should return a real tensor, which can be used in concrete tracing later. (currently no `@custom_op.register`_fake-like interface)

Add profiling capabilities to the CLI, allowing users to monitor CPU and CUDA activities during training.

Implement overlapped scheduler by: 1. add cuda stream context for each segment/adapter/reducer 2. add cuda event wait/record for each segment 3. refine sched code generation for stream/event config.

This PR introduces a “no-grad-reduce” annotation mechanism for custom op shape annotations so that, for specific partition identifiers, nnscaler can skip inserting gradient all-reduce adapters (avoiding incorrect or redundant reductions). This is done by extending ShapeAnno parsing to support : / modifiers (and '/' as a shortcut) to control gradient-reduction behavior during partitioning.

* Merge ring attention implementation into main branch * remove sink since it is no longer needed * remove cp_ranks * add tests

Introduce support for the grad_dtype attribute in parameters, enhancing flexibility in gradient precision management. --------- Co-authored-by: zyeric <cheerforwhy@gmail.com>

…parameter (#34)

nns supports 2.11

…unit tests

Yilei Yang and others added 30 commits June 24, 2024 03:23

Merged PR 2183: fix cache dir

98d57c8

fix cache dir is a str, no exists() function

Merged PR 2186: hotfix: non-tensor support for consistence check in m…

eeef286

…erging

Merged PR 2184: parser: never fold getattr node 'self.training'

b943f8e

parser: never fold getattr node 'self.training' unit test pass parity check pass

Merged PR 2188: self.training in submodules: hotfix for nightly test

04c608a

self.training in submodules: hotfix for nightly test unit test pass parity check pass

Merged PR 2189: Lightning: refine code/add more tests

ef2586e

Merged PR 2144: Nightly build scripts

a182bcc

Add a pipeline to nightly build wheel, and fix packaging for autodist profile data. pipeline: https://msrasrg.visualstudio.com/SuperScaler/_build?definitionId=114 repo: https://msrasrg.visualstudio.com/SuperScaler/_artifacts/feed/nightly

Merged PR 2194: Reset version to v0.1 and update email

9eebf40

Merged PR 2193: lightning: refine docs about checkpoint

7a2485d

refine docs

Merged PR 2196: never fold nnscaler runtime functions

d6f6c09

never fold nnscaler runtime functions

Merged PR 2169: support conv1d-2d

42f64b1

support ConvTranspose1D,Conv2D,ConvTranspose2D

Merged PR 2200: TensorBase adaption in torch>=2.3

d82882e

`_TensorBase ` rename to `TensorBase ` in torch 2.3, this will affect the functions call like `torch.Tensor.view(t, (1, -1))`, this pr fix this issue.

Merged PR 2197: update doc for v0.1

8ee468b

update doc for v0.1

Merged PR 2199: fix ifexpr warning

13ad8ff

fix ifexpr warning unit test pass parity check pass

Merged PR 2205: Fix sum anno bug

d185206

ensure the output tensor has dim anno when it is a scalar tensor

Merged PR 2202: add scalar tensor support

c59f7b1

1. Add a flag to IRTensor to indicate whether it is originally a scalar tensor. 2. During graph transformation, do as it is. 3. When generate code, check the flag to generate correct code. unit test pass parity check pass

Merged PR 2204: fix embedding padding index

106bbf2

Merged PR 2192: minitrainer: refine config

992b945

MiniTrainer workable version. the parity check is against lightning.

Merged PR 2209: Nanogpt with mini-trainer

027fd64

parity matched between lightning version & mini-trainer version

Merged PR 2210: Minitrainer: refine names / precision support

167704a

Merged PR 2212: add mixed precision f16 optimizer

c5b6dfb

add mixed precision f16 optimizer

Merged PR 2214: lightning: add merged checkpoint support

67f0e81

Merged PR 2213: Refine ring flash attn: add llama 3.1's implementation

6a069fa

0xWJ and others added 29 commits January 21, 2026 09:20

add more debug info

9d5b02e

refine comment

c799251

refine code

943b154

add barrier

52d9322

Merge pull request #6 from msrasys/weijiangxu/mem-fragment-in-resume

a77b282

[Refine] Reduce memory fragment when resuming

[Refine] Normalize device handling in state dicts and more (#9)

e39d68b

1. Normalize state_dict device handling 2. replace torch.cat with F.pad

Add Doc Autodist Constraints Guide (#5)

cc97940

* Add Doc for Autodist Constraints Guide * Revise the description of autodist in the documentation. * fix comment * polish doc --------- Co-authored-by: yileiyang <yileiyang@gmail.com>

Add nightly test to the repo (#12)

8ce8b09

* Add nightly test to the repo * make parity alignment * fix incoordination shutil.rmtree of rank0 --------- Co-authored-by: yileiyang <yileiyang@gmail.com>

[Feat] Add Muon Support Phase 1 (dp without zero) (#13)

8719948

1. Fix tracing error for pytorch 2.9+ 2. Fix state dicts related logic for Muon. 3. Add MASTER_PORT (also used as PYTEST_RUN_ID) env for tests, so we can run multiple pytest in the same machine.

Fix set item issue in profiler database (#15)

4ecaa0b

[Test] add tests for cli pipeline (#20)

96dca62

[Runtime] Refine the performance of module state dict gathering for a…

174ef8e

…ll cases (#19) Add a dtensor class to help runtime state dict merging.

[Feat] Add zero1 support for non-parallel parameters. (#21)

da22d5b

This PR extends nnscaler’s mixed-module training/serialization path to support ZeRO-1 for non-parallel parameters, introducing a virtual NonParallelModule wrapper and updating optimizer/state-dict merge logic accordingly.

[Trainer] Document update and arg parser refine (#22)

5a6fbdd

1. updates and expands the documentation for parallel_module.md and trainer.md, 2. update arg parser to add escape support. 3. update arg parser to better support __type, __value_type, function type.

[Feat] CLI: Add profiling support (#28)

aa951a8

Add profiling capabilities to the CLI, allowing users to monitor CPU and CUDA activities during training.

[Feat] Pipeline: add overlapped scheduler support (#27)

8d751d4

Implement overlapped scheduler by: 1. add cuda stream context for each segment/adapter/reducer 2. add cuda event wait/record for each segment 3. refine sched code generation for stream/event config.

Merge ring attention implementation into main branch (#32)

772cb00

* Merge ring attention implementation into main branch * remove sink since it is no longer needed * remove cp_ranks * add tests

[Feat] Add grad_dtype support (#30)

6ddb208

Introduce support for the grad_dtype attribute in parameters, enhancing flexibility in gradient precision management. --------- Co-authored-by: zyeric <cheerforwhy@gmail.com>

[Feat] Update shuffle_varlen and unshuffle_varlen to accept cp_ranks …

56a3190

…parameter (#34)

Update torch version constraint in requirements.txt (#35)

b770d32

nns supports 2.11

[Fix] zero parameter-level sharding in Reducer and add corresponding …

d0f91f3

…unit tests

donglixp closed this May 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Li/fix zero param level sharding#57

Li/fix zero param level sharding#57
donglixp wants to merge 2042 commits into
microsoft:mainfrom
msrasys:li/fix-zero_param_level_sharding

donglixp commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

donglixp commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants