Skip to content

nsight system and torch Profiler step-wise data collection support#9391

Open
arvyanh wants to merge 7 commits into
modelscope:mainfrom
arvyanh:profiler
Open

nsight system and torch Profiler step-wise data collection support#9391
arvyanh wants to merge 7 commits into
modelscope:mainfrom
arvyanh:profiler

Conversation

@arvyanh
Copy link
Copy Markdown

@arvyanh arvyanh commented May 21, 2026

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

add step-based nsys/torch-profiler collection control

Experiment results

arvyanh and others added 7 commits May 20, 2026 16:35
- rename nsys.py -> profiler.py with NsysCallback + TorchProfilerCallback
- add profiler_type arg (nsys/torch) to megatron_args
- add profile_rank arg (-1=all, 0=rank0 default)
- TorchProfilerCallback: torch.profiler.profile with step schedule, saves to output_dir/torch_profiler/
- NsysCallback: add check_error on cudaProfilerStart/Stop
- update mapping.py with torch_profiler entry
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces profiling support for Megatron training by implementing NsysCallback and TorchProfilerCallback, along with corresponding configuration arguments to control profiling ranges and target ranks. The review feedback highlights several critical improvements, including a logic fix to ensure the Torch profiler remains disabled when configured with default values, handling potential null values for the output directory, and aligning default step values across different classes. Additionally, suggestions were made to enhance type safety using Literal types and to make the callback selection logic more explicit to avoid errors with unexpected input strings.

Comment on lines +70 to +76
if not self._should_profile():
return
wait = max(0, self.start_step - 1)
active = max(1, self.end_step - self.start_step + 1)
base_dir = self.args.output_dir
trace_dir = os.path.join(base_dir, 'torch_profiler', f'rank{self._local_rank}_node{self._node_rank}')
os.makedirs(trace_dir, exist_ok=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There are a few issues here:

  1. If nsys_profile_start is set to -1 (the default), the current logic will calculate wait=0 and active=1, causing the Torch profiler to run for the first step even when it should be disabled. You should explicitly check if profiling is enabled (e.g., start_step > 0).
  2. self.args.output_dir can be None. If so, os.path.join will raise a TypeError. Consider providing a default directory (e.g., '.').
        if not self._should_profile() or self.start_step <= 0:
            return
        wait = self.start_step - 1
        active = self.end_step - self.start_step + 1
        if active <= 0:
            return
        base_dir = self.args.output_dir or '.'
        trace_dir = os.path.join(base_dir, 'torch_profiler', f'rank{self._local_rank}_node{self._node_rank}')
        os.makedirs(trace_dir, exist_ok=True)

callbacks: List[str] = field(default_factory=list)
nsys_profile_start: int = -1 # 1-based; 0 = disabled
nsys_profile_end: int = -1
profiler_type: str = 'none' # nsys or torch
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It is recommended to use Literal for profiler_type to provide better type safety and validation of the allowed values ('none', 'nsys', 'torch').

Suggested change
profiler_type: str = 'none' # nsys or torch
profiler_type: Literal['none', 'nsys', 'torch'] = 'none' # nsys or torch

Comment on lines +761 to +763
if self.profiler_type != 'none':
cb = 'torch_profiler' if self.profiler_type == 'torch' else 'nsys'
self.callbacks.append(cb)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current logic for adding profiler callbacks is a bit loose. If profiler_type is set to an unexpected string (other than 'none' or 'torch'), it will default to adding the 'nsys' callback. It's better to be explicit.

Suggested change
if self.profiler_type != 'none':
cb = 'torch_profiler' if self.profiler_type == 'torch' else 'nsys'
self.callbacks.append(cb)
if self.profiler_type == 'torch':
self.callbacks.append('torch_profiler')
elif self.profiler_type == 'nsys':
self.callbacks.append('nsys')

Comment on lines +57 to +58
self.start_step = getattr(self.args, 'nsys_profile_start', 5)
self.end_step = getattr(self.args, 'nsys_profile_end', 5)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The default values in getattr (5) are inconsistent with the defaults defined in MegatronArguments (-1) and those used in NsysCallback (-1). While MegatronArguments provides a default, it's safer and clearer to use the same fallback value here.

Suggested change
self.start_step = getattr(self.args, 'nsys_profile_start', 5)
self.end_step = getattr(self.args, 'nsys_profile_end', 5)
self.start_step = getattr(self.args, 'nsys_profile_start', -1)
self.end_step = getattr(self.args, 'nsys_profile_end', -1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant