Skip to content

feat: make mesh accept meshcontext#2266

Open
adil-a wants to merge 7 commits into
mainfrom
akoumpa/refactor_auto_class_public_api
Open

feat: make mesh accept meshcontext#2266
adil-a wants to merge 7 commits into
mainfrom
akoumpa/refactor_auto_class_public_api

Conversation

@adil-a
Copy link
Copy Markdown
Collaborator

@adil-a adil-a commented May 18, 2026

What does this PR do ?

Refactors the distributed public API around MeshContext so users can initialize distributed once, create a mesh context, and pass it directly to NeMoAutoModelForCausalLM.from_pretrained.

Changelog

  • Add/standardize create_mesh_context as the component-layer API that returns a MeshContext.
  • Rename the recipe YAML adapter from _dist_setup.setup_distributed to _dist_utils.create_mesh_context_from_config.
  • Update NeMoAutoModel*, recipes, diffusion pipeline, docs, and tests to use mesh-context naming.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?

Validation:

  • uv run ruff format ...
  • uv run ruff check --fix ...
  • uv run pytest tests/unit_tests/distributed/test_mesh_utils.py tests/unit_tests/distributed/test_device_mesh.py tests/unit_tests/recipes/test_dist_utils.py tests/unit_tests/recipes/test_diffusion_train_metrics.py tests/unit_tests/_diffusers/test_auto_diffusion_pipeline.py -q
  • Targeted recipe setup tests passed.

Note: running the full test_train_ft.py and test_finetune_vlm_helpers.py files hit unrelated optional cut_cross_entropy CUDA fused-CE failures in this environment.

Additional Information

Related to distributed public API cleanup.

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 18, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@adil-a
Copy link
Copy Markdown
Collaborator Author

adil-a commented May 18, 2026

/ok to test 3dcadfb

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented May 18, 2026

/ok to test a8b2df6

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa akoumpa requested review from a team and jgerh as code owners May 19, 2026 21:44
@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented May 19, 2026

/ok to test d039d24

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented May 19, 2026

/ok to test a4876ae

@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented May 20, 2026

/ok to test d836169

Copy link
Copy Markdown
Contributor

@jgerh jgerh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed tech pubs review of docs/guides/gradient-checkpointing.md. No changes needed. LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants