Li/fix shuffle varlen by donglixp · Pull Request #56 · microsoft/nnscaler

donglixp · 2026-04-29T08:29:24Z

def shuffle_varlen(t: Tensor, cu_seqlens_padded: Tensor, cp_ranks: List[int], cp_group: dist.ProcessGroup) -> Tensor:

def unshuffle_varlen(t: Tensor, cu_seqlens_padded: Tensor, cp_ranks: List[int], cp_group: dist.ProcessGroup) -> Tensor:

missing cp_ranks (although it's unused in the func)

This PR is trying to reduce the memory usage when merging by combining zero and tp state together.

fix cache dir is a str, no exists() function

After this PR, the `follow` logic in autodist is - the father (followed) op should not contain a sum dim (linear is defined as a sum op, since it has a sum dim in computation) - a unary op (like GeLU) will try to follow its producer - if a op's inputs are from multiple producers (like add, concat), it will follow the 1st producer if the producers are in a same `follow region`. Update the test case to elaborate this PR. Fix the bug in dp solver when computing the in edges for a dp node.

…erging

…contains attributes parity alert passed ![image.png](https://msrasrg.visualstudio.com/bb54e96e-8cc1-46f6-9021-c7048165b5bc/_apis/git/repositories/66b74611-09f4-4d0e-89b7-5ee93c087d3c/pullRequests/2187/attachments/image.png)

The index in `train_mem2in_idx` is the original index of the input of the operator, here add a mapping for the original index to the pure tensor index. This bug is found by the functions that didn't put tensor input in the front, i.e., `torch.gather`.

parser: never fold getattr node 'self.training' unit test pass parity check pass

self.training in submodules: hotfix for nightly test unit test pass parity check pass

Add a pipeline to nightly build wheel, and fix packaging for autodist profile data. pipeline: https://msrasrg.visualstudio.com/SuperScaler/_build?definitionId=114 repo: https://msrasrg.visualstudio.com/SuperScaler/_artifacts/feed/nightly

refine docs

never fold nnscaler runtime functions

- align the memory estimation in dp solver with ilp solver, check this [PR](https://dev.azure.com/msrasrg/SuperScaler/_git/MagicCube/pullrequest/2121) for more details - refine c++ code - have verified the search result compared to the ilp with & without recompute on retnet-3b NOTE: after this PR, more meta information are introduced in a dynamic programming state, resulting in the dp solver may be slower than ilp solver, which needs further optimization.

1. [PR](https://dev.azure.com/msrasrg/SuperScaler/_git/MagicCube/pullrequest/2185) ignores the case when profiling fails 2. dp solver bug 1: segment fault when `following candidates` is empty 2. dp solver bug 2: corner case, the new generated dp state can be illegal, need to check when adding it to new states tests added

support ConvTranspose1D,Conv2D,ConvTranspose2D

`_TensorBase ` rename to `TensorBase ` in torch 2.3, this will affect the functions call like `torch.Tensor.view(t, (1, -1))`, this pr fix this issue.

update doc for v0.1

Added nanoGPT example Known causes of parity mismatch: - grad clip - param group - unable to align rng (validation loop consumes random numbers, etc) Update: Added args `use_nnscaler`, `plan_ngpus`, `runtime_ngpus`. Precision will derive nanoGPT's arg `dtype`. Doc will be added in another PR. Other known issue: When using fp16, nanoGPT uses a scaler while this version does not. I think it's not a big deal because by default nanoGPT does not use fp16.

fix ifexpr warning unit test pass parity check pass

ensure the output tensor has dim anno when it is a scalar tensor

Adds a pipeline to upload release wheel to devops artifact and test.pypi.org. (Has already been run basing on 0.1 tag) And then update `version.py` to 0.2. Don't want to bother create a separate PR. Pipeline usage: 1. Open the pipeline webpage: https://msrasrg.visualstudio.com/SuperScaler/_build?definitionId=116 2. Click "Run pipeline" 3. Choose the branch/tag 4. Click "Variables" 5. Click "version" and set the value to something like "v0.1" 6. Confirm update and run (the update will not be saved and must be done every time)

1. Add a flag to IRTensor to indicate whether it is originally a scalar tensor. 2. During graph transformation, do as it is. 3. When generate code, check the flag to generate correct code. unit test pass parity check pass

MiniTrainer workable version. the parity check is against lightning.

parity matched between lightning version & mini-trainer version

add mixed precision f16 optimizer

[Refine] Reduce memory fragment when resuming

…nal einops functions. Tracing einops Functions are challenging due to their dynamic nature and heavy reliance on string-based patterns and runtime shape manipulations. To make things easier, we skip tracing the internal logic of einops functions and directly use the resolved transformation recipes.

1. Normalize state_dict device handling 2. replace torch.cat with F.pad

* Add Doc for Autodist Constraints Guide * Revise the description of autodist in the documentation. * fix comment * polish doc --------- Co-authored-by: yileiyang <yileiyang@gmail.com>

* add ci.yml for continuous integration and continuous development * branch self test switch back to conda from uv change back to python3.10 from python 3.12 * add conda-forge channel for tox-conda package * add py lib for tox-conda * change conda to uv because tox-conda is too old * using uv tool for fixing setup issue * pip is needed for tox 3.0, while azuer use 3.0 don't need this permission * add all possible commands to allowlist * change back to main branch * back to conda with fixed tox and tox-conda version * change back to uv --------- Co-authored-by: yileiyang <yileiyang@gmail.com>

* Add nightly test to the repo * make parity alignment * fix incoordination shutil.rmtree of rank0 --------- Co-authored-by: yileiyang <yileiyang@gmail.com>

…10) 1. Add a new Mixin (ScaleDelayedOptimizerMixin) to support MixedPrecisionAdam like optimizers 2. Refine HybridOptimizer to support ScaleDelayedOptimizerMixin. --------- Co-authored-by: zyeric <cheerforwhy@gmail.com>

1. Fix tracing error for pytorch 2.9+ 2. Fix state dicts related logic for Muon. 3. Add MASTER_PORT (also used as PYTEST_RUN_ID) env for tests, so we can run multiple pytest in the same machine.

1. add a new flag zero_param_level_sharding so user can control the granularity of sharding parameters in ZeRO 2. Add a muon mixin to flatten/unflatten parameters for Muon. 3. Refine logic for flatten buffers. We will record the position (start/end) of each paraemters in the flattened buffer. (instead of calculating it on the fly). So we have more freedom to add paddings.

…ensor parallelism (#18) 1. try to use all_gather to gather weights from other ranks for TP 2. If fails (pp/zero3 are used), we will fallback to use the old way. --------- Co-authored-by: Xun Wu <138114252+yushuiwx@users.noreply.github.com>

…ll cases (#19) Add a dtensor class to help runtime state dict merging.

This PR extends nnscaler’s mixed-module training/serialization path to support ZeRO-1 for non-parallel parameters, introducing a virtual NonParallelModule wrapper and updating optimizer/state-dict merge logic accordingly.

1. updates and expands the documentation for parallel_module.md and trainer.md, 2. update arg parser to add escape support. 3. update arg parser to better support __type, __value_type, function type.

Remove unnecessary multiref addition when pipeline parallelism is enabled. This change also clarifies the conditions under which multiref is applied. The old code was over-triggering: with pipeline parallelism enabled (len(pp_desc.spmd_descs) > 1), it would add multiref for any replicated parameter, even if that parameter only existed in a single stage. The fix narrows the condition to only add multiref when the parameter is actually consumed in multiple stages, which is the true definition of a cross-stage shared parameter.

Introduce support for fake functions in tracing, allowing for the registration and use of placeholder functions during model tracing. This is mostly useful for functions that are costly or are not runnable (i.e. communication functions) during tracing. This feature also aligns with `torch.library.custom_op`, which supports registering fake. But the interface is not totally the same. The fake function should return a real tensor, which can be used in concrete tracing later. (currently no `@custom_op.register`_fake-like interface)

Add profiling capabilities to the CLI, allowing users to monitor CPU and CUDA activities during training.

Implement overlapped scheduler by: 1. add cuda stream context for each segment/adapter/reducer 2. add cuda event wait/record for each segment 3. refine sched code generation for stream/event config.

This PR introduces a “no-grad-reduce” annotation mechanism for custom op shape annotations so that, for specific partition identifiers, nnscaler can skip inserting gradient all-reduce adapters (avoiding incorrect or redundant reductions). This is done by extending ShapeAnno parsing to support : / modifiers (and '/' as a shortcut) to control gradient-reduction behavior during partitioning.

* Merge ring attention implementation into main branch * remove sink since it is no longer needed * remove cp_ranks * add tests

Introduce support for the grad_dtype attribute in parameters, enhancing flexibility in gradient precision management. --------- Co-authored-by: zyeric <cheerforwhy@gmail.com>

…parameter

0xWJ and others added 30 commits June 19, 2024 02:38

Merged PR 2111: refine optimizer state dict merge

d0b7e5e

This PR is trying to reduce the memory usage when merging by combining zero and tp state together.

Merged PR 2180: lightning: fix gradient sync and gradient averaging

4dc1166

Merged PR 2183: fix cache dir

98d57c8

fix cache dir is a str, no exists() function

Merged PR 2186: hotfix: non-tensor support for consistence check in m…

eeef286

…erging

Merged PR 2184: parser: never fold getattr node 'self.training'

b943f8e

parser: never fold getattr node 'self.training' unit test pass parity check pass

Merged PR 2188: self.training in submodules: hotfix for nightly test

04c608a

self.training in submodules: hotfix for nightly test unit test pass parity check pass

Merged PR 2189: Lightning: refine code/add more tests

ef2586e

Merged PR 2144: Nightly build scripts

a182bcc

Add a pipeline to nightly build wheel, and fix packaging for autodist profile data. pipeline: https://msrasrg.visualstudio.com/SuperScaler/_build?definitionId=114 repo: https://msrasrg.visualstudio.com/SuperScaler/_artifacts/feed/nightly

Merged PR 2194: Reset version to v0.1 and update email

9eebf40

Merged PR 2193: lightning: refine docs about checkpoint

7a2485d

refine docs

Merged PR 2196: never fold nnscaler runtime functions

d6f6c09

never fold nnscaler runtime functions

Merged PR 2169: support conv1d-2d

42f64b1

support ConvTranspose1D,Conv2D,ConvTranspose2D

Merged PR 2200: TensorBase adaption in torch>=2.3

d82882e

`_TensorBase ` rename to `TensorBase ` in torch 2.3, this will affect the functions call like `torch.Tensor.view(t, (1, -1))`, this pr fix this issue.

Merged PR 2197: update doc for v0.1

8ee468b

update doc for v0.1

Merged PR 2199: fix ifexpr warning

13ad8ff

fix ifexpr warning unit test pass parity check pass

Merged PR 2205: Fix sum anno bug

d185206

ensure the output tensor has dim anno when it is a scalar tensor

Merged PR 2202: add scalar tensor support

c59f7b1

1. Add a flag to IRTensor to indicate whether it is originally a scalar tensor. 2. During graph transformation, do as it is. 3. When generate code, check the flag to generate correct code. unit test pass parity check pass

Merged PR 2204: fix embedding padding index

106bbf2

Merged PR 2192: minitrainer: refine config

992b945

MiniTrainer workable version. the parity check is against lightning.

Merged PR 2209: Nanogpt with mini-trainer

027fd64

parity matched between lightning version & mini-trainer version

Merged PR 2210: Minitrainer: refine names / precision support

167704a

Merged PR 2212: add mixed precision f16 optimizer

c5b6dfb

add mixed precision f16 optimizer

Merged PR 2214: lightning: add merged checkpoint support

67f0e81

0xWJ and others added 27 commits January 21, 2026 09:20

add more debug info

9d5b02e

refine comment

c799251

refine code

943b154

add barrier

52d9322

Merge pull request #6 from msrasys/weijiangxu/mem-fragment-in-resume

a77b282

[Refine] Reduce memory fragment when resuming

[Refine] Normalize device handling in state dicts and more (#9)

e39d68b

1. Normalize state_dict device handling 2. replace torch.cat with F.pad

Add Doc Autodist Constraints Guide (#5)

cc97940

* Add Doc for Autodist Constraints Guide * Revise the description of autodist in the documentation. * fix comment * polish doc --------- Co-authored-by: yileiyang <yileiyang@gmail.com>

Add nightly test to the repo (#12)

8ce8b09

* Add nightly test to the repo * make parity alignment * fix incoordination shutil.rmtree of rank0 --------- Co-authored-by: yileiyang <yileiyang@gmail.com>

[Feat] Add Muon Support Phase 1 (dp without zero) (#13)

8719948

1. Fix tracing error for pytorch 2.9+ 2. Fix state dicts related logic for Muon. 3. Add MASTER_PORT (also used as PYTEST_RUN_ID) env for tests, so we can run multiple pytest in the same machine.

Fix set item issue in profiler database (#15)

4ecaa0b

[Test] add tests for cli pipeline (#20)

96dca62

[Runtime] Refine the performance of module state dict gathering for a…

174ef8e

…ll cases (#19) Add a dtensor class to help runtime state dict merging.

[Feat] Add zero1 support for non-parallel parameters. (#21)

da22d5b

This PR extends nnscaler’s mixed-module training/serialization path to support ZeRO-1 for non-parallel parameters, introducing a virtual NonParallelModule wrapper and updating optimizer/state-dict merge logic accordingly.

[Trainer] Document update and arg parser refine (#22)

5a6fbdd

1. updates and expands the documentation for parallel_module.md and trainer.md, 2. update arg parser to add escape support. 3. update arg parser to better support __type, __value_type, function type.

[Feat] CLI: Add profiling support (#28)

aa951a8

Add profiling capabilities to the CLI, allowing users to monitor CPU and CUDA activities during training.

[Feat] Pipeline: add overlapped scheduler support (#27)

8d751d4

Implement overlapped scheduler by: 1. add cuda stream context for each segment/adapter/reducer 2. add cuda event wait/record for each segment 3. refine sched code generation for stream/event config.

Merge ring attention implementation into main branch (#32)

772cb00

* Merge ring attention implementation into main branch * remove sink since it is no longer needed * remove cp_ranks * add tests

[Feat] Add grad_dtype support (#30)

6ddb208

Introduce support for the grad_dtype attribute in parameters, enhancing flexibility in gradient precision management. --------- Co-authored-by: zyeric <cheerforwhy@gmail.com>

[Feat] Update shuffle_varlen and unshuffle_varlen to accept cp_ranks …

8776025

…parameter

donglixp closed this Apr 29, 2026

donglixp deleted the li/fix_shuffle_varlen branch April 29, 2026 08:34

donglixp restored the li/fix_shuffle_varlen branch April 30, 2026 04:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Li/fix shuffle varlen#56

Li/fix shuffle varlen#56
donglixp wants to merge 2040 commits into
microsoft:mainfrom
msrasys:li/fix_shuffle_varlen

donglixp commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

donglixp commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants