Skip to content

Li/fix shuffle varlen#56

Closed
donglixp wants to merge 2040 commits into
microsoft:mainfrom
msrasys:li/fix_shuffle_varlen
Closed

Li/fix shuffle varlen#56
donglixp wants to merge 2040 commits into
microsoft:mainfrom
msrasys:li/fix_shuffle_varlen

Conversation

@donglixp
Copy link
Copy Markdown

def shuffle_varlen(t: Tensor, cu_seqlens_padded: Tensor, cp_ranks: List[int], cp_group: dist.ProcessGroup) -> Tensor:

def unshuffle_varlen(t: Tensor, cu_seqlens_padded: Tensor, cp_ranks: List[int], cp_group: dist.ProcessGroup) -> Tensor:

missing cp_ranks (although it's unused in the func)

0xWJ and others added 30 commits June 19, 2024 02:38
This PR is trying to reduce the memory usage when merging by combining zero and tp state together.
fix cache dir is a str, no exists() function
After this PR, the `follow` logic in autodist is
- the father (followed) op should not contain a sum dim (linear is defined as a sum op, since it has a sum dim in computation)
- a unary op (like GeLU) will try to follow its producer
- if a op's inputs are from multiple producers (like add, concat), it will follow the 1st producer if the producers are in a same `follow region`.

Update the test case to elaborate this PR.

Fix the bug in dp solver when computing the in edges for a dp node.
The index in `train_mem2in_idx` is the original index of the input of the operator, here add a mapping for the original index to the pure tensor index.

This bug is found by the functions that didn't put tensor input in the front, i.e., `torch.gather`.
parser: never fold getattr node 'self.training'
unit test pass
parity check pass
self.training in submodules: hotfix for nightly test

unit test pass
parity check pass
never fold nnscaler runtime functions
- align the memory estimation in dp solver with ilp solver, check this [PR](https://dev.azure.com/msrasrg/SuperScaler/_git/MagicCube/pullrequest/2121) for more details
- refine c++ code
- have verified the search result compared to the ilp with & without recompute on retnet-3b

NOTE: after this PR, more meta information are introduced in a dynamic programming state, resulting in the dp solver may be slower than ilp solver, which needs further optimization.
1. [PR](https://dev.azure.com/msrasrg/SuperScaler/_git/MagicCube/pullrequest/2185) ignores the case when profiling fails
2. dp solver bug 1: segment fault when `following candidates` is empty
2. dp solver bug 2: corner case, the new generated dp state can be illegal, need to check when adding it to new states

tests added
support ConvTranspose1D,Conv2D,ConvTranspose2D
`_TensorBase ` rename to `TensorBase ` in torch 2.3, this will affect the functions call like `torch.Tensor.view(t, (1, -1))`, this pr fix this issue.
Added nanoGPT example

Known causes of parity mismatch:

- grad clip
- param group
- unable to align rng (validation loop consumes random numbers, etc)

Update:

Added args `use_nnscaler`, `plan_ngpus`, `runtime_ngpus`.
Precision will derive nanoGPT's arg `dtype`.

Doc will be added in another PR.

Other known issue:
When using fp16, nanoGPT uses a scaler while this version does not.
I think it's not a big deal because by default nanoGPT does not use fp16.
fix ifexpr warning
unit test pass
parity check pass
ensure the output tensor has dim anno when it is a scalar tensor
Adds a pipeline to upload release wheel to devops artifact and test.pypi.org. (Has already been run basing on 0.1 tag)

And then update `version.py` to 0.2. Don't want to bother create a separate PR.

Pipeline usage:

1. Open the pipeline webpage: https://msrasrg.visualstudio.com/SuperScaler/_build?definitionId=116
2. Click "Run pipeline"
3. Choose the branch/tag
4. Click "Variables"
5. Click "version" and set the value to something like "v0.1"
6. Confirm update and run (the update will not be saved and must be done every time)
1. Add a flag to IRTensor to indicate whether it is originally a scalar tensor.
2. During graph transformation, do as it is.
3. When generate code, check the flag to generate correct code.

unit test pass
parity check pass
MiniTrainer workable version. the parity check is against lightning.
parity matched between lightning version & mini-trainer version
add mixed precision f16 optimizer
0xWJ and others added 27 commits January 21, 2026 09:20
[Refine] Reduce memory fragment when resuming
…nal einops functions.

Tracing einops Functions are challenging due to their dynamic nature and heavy reliance on string-based patterns and runtime shape manipulations. 

To make things easier, we skip tracing the internal logic of einops functions and directly use the resolved transformation recipes.
1. Normalize state_dict device handling
2. replace torch.cat with F.pad
* Add Doc for Autodist Constraints Guide

* Revise the description of autodist in the documentation.

* fix comment

* polish doc

---------

Co-authored-by: yileiyang <yileiyang@gmail.com>
* add ci.yml for continuous integration and continuous development

* branch self test
switch back to conda from uv
change back to python3.10 from python 3.12

* add conda-forge channel for tox-conda package

* add py lib for tox-conda

* change conda to uv because tox-conda is too old

* using uv tool for fixing setup issue

* pip is needed for tox 3.0, while azuer use 3.0 don't need this permission

* add all possible commands to allowlist

* change back to main branch

* back to conda with fixed tox and tox-conda version

* change back to uv

---------

Co-authored-by: yileiyang <yileiyang@gmail.com>
* Add nightly test to the repo

* make parity alignment

* fix incoordination shutil.rmtree of rank0

---------

Co-authored-by: yileiyang <yileiyang@gmail.com>
…10)

1. Add a new Mixin (ScaleDelayedOptimizerMixin) to support MixedPrecisionAdam like optimizers
2. Refine HybridOptimizer to support ScaleDelayedOptimizerMixin.

---------

Co-authored-by: zyeric <cheerforwhy@gmail.com>
1. Fix tracing error for pytorch 2.9+
2. Fix state dicts related logic for Muon.
3. Add MASTER_PORT (also used as PYTEST_RUN_ID) env for tests, so we can run multiple pytest in the same machine.
1. add a new flag zero_param_level_sharding so user can control the granularity of sharding parameters in ZeRO
2. Add a muon mixin to flatten/unflatten parameters for Muon.
3. Refine logic for flatten buffers. We will record the position (start/end) of each paraemters in the flattened buffer. (instead of calculating it on the fly). So we have more freedom to add paddings.
…ensor parallelism (#18)

1. try to use all_gather to gather weights from other ranks for TP
2. If fails (pp/zero3 are used), we will fallback to use the old way.

---------

Co-authored-by: Xun Wu <138114252+yushuiwx@users.noreply.github.com>
…ll cases (#19)

Add a dtensor class to help runtime state dict merging.
This PR extends nnscaler’s mixed-module training/serialization path to support ZeRO-1 for non-parallel parameters, introducing a virtual NonParallelModule wrapper and updating optimizer/state-dict merge logic accordingly.
1. updates and expands the documentation for parallel_module.md and trainer.md,
2. update arg parser to add escape support.
3. update arg parser to better support __type, __value_type, function type.
Remove unnecessary multiref addition when pipeline parallelism is enabled. This change also clarifies the conditions under which multiref is applied.

The old code was over-triggering: with pipeline parallelism enabled (len(pp_desc.spmd_descs) > 1), it would add multiref for any replicated parameter, even if that parameter only existed in a single stage. The fix narrows the condition to only add multiref when the parameter is actually consumed in multiple stages, which is the true definition of a cross-stage shared parameter.
Introduce support for fake functions in tracing, allowing for the registration and use of placeholder functions during model tracing. This is mostly useful for functions that are costly or are not runnable (i.e. communication functions) during tracing.

This feature also aligns with `torch.library.custom_op`, which supports registering fake. But the interface is not totally the same.

The fake function should return a real tensor, which can be used in concrete tracing later. (currently no `@custom_op.register`_fake-like interface)
Add profiling capabilities to the CLI, allowing users to monitor CPU and CUDA activities during training.
Implement overlapped scheduler by:

1. add cuda stream context for each segment/adapter/reducer
2. add cuda event wait/record for each segment
3. refine sched code generation for stream/event config.
This PR introduces a “no-grad-reduce” annotation mechanism for custom op shape annotations so that, for specific partition identifiers, nnscaler can skip inserting gradient all-reduce adapters (avoiding incorrect or redundant reductions).

This is done by extending ShapeAnno parsing to support : / modifiers (and '/' as a shortcut) to control gradient-reduction behavior during partitioning.
* Merge ring attention implementation into main branch

* remove sink since it is no longer needed

* remove cp_ranks

* add tests
Introduce support for the grad_dtype attribute in parameters, enhancing flexibility in gradient precision management.
---------

Co-authored-by: zyeric <cheerforwhy@gmail.com>
@donglixp donglixp closed this Apr 29, 2026
@donglixp donglixp deleted the li/fix_shuffle_varlen branch April 29, 2026 08:34
@donglixp donglixp restored the li/fix_shuffle_varlen branch April 30, 2026 04:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants