nsight system and torch Profiler step-wise data collection support#9391
nsight system and torch Profiler step-wise data collection support#9391arvyanh wants to merge 7 commits into
Conversation
- rename nsys.py -> profiler.py with NsysCallback + TorchProfilerCallback - add profiler_type arg (nsys/torch) to megatron_args - add profile_rank arg (-1=all, 0=rank0 default) - TorchProfilerCallback: torch.profiler.profile with step schedule, saves to output_dir/torch_profiler/ - NsysCallback: add check_error on cudaProfilerStart/Stop - update mapping.py with torch_profiler entry
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces profiling support for Megatron training by implementing NsysCallback and TorchProfilerCallback, along with corresponding configuration arguments to control profiling ranges and target ranks. The review feedback highlights several critical improvements, including a logic fix to ensure the Torch profiler remains disabled when configured with default values, handling potential null values for the output directory, and aligning default step values across different classes. Additionally, suggestions were made to enhance type safety using Literal types and to make the callback selection logic more explicit to avoid errors with unexpected input strings.
| if not self._should_profile(): | ||
| return | ||
| wait = max(0, self.start_step - 1) | ||
| active = max(1, self.end_step - self.start_step + 1) | ||
| base_dir = self.args.output_dir | ||
| trace_dir = os.path.join(base_dir, 'torch_profiler', f'rank{self._local_rank}_node{self._node_rank}') | ||
| os.makedirs(trace_dir, exist_ok=True) |
There was a problem hiding this comment.
There are a few issues here:
- If
nsys_profile_startis set to -1 (the default), the current logic will calculatewait=0andactive=1, causing the Torch profiler to run for the first step even when it should be disabled. You should explicitly check if profiling is enabled (e.g.,start_step > 0). self.args.output_dircan beNone. If so,os.path.joinwill raise aTypeError. Consider providing a default directory (e.g.,'.').
if not self._should_profile() or self.start_step <= 0:
return
wait = self.start_step - 1
active = self.end_step - self.start_step + 1
if active <= 0:
return
base_dir = self.args.output_dir or '.'
trace_dir = os.path.join(base_dir, 'torch_profiler', f'rank{self._local_rank}_node{self._node_rank}')
os.makedirs(trace_dir, exist_ok=True)| callbacks: List[str] = field(default_factory=list) | ||
| nsys_profile_start: int = -1 # 1-based; 0 = disabled | ||
| nsys_profile_end: int = -1 | ||
| profiler_type: str = 'none' # nsys or torch |
There was a problem hiding this comment.
| if self.profiler_type != 'none': | ||
| cb = 'torch_profiler' if self.profiler_type == 'torch' else 'nsys' | ||
| self.callbacks.append(cb) |
There was a problem hiding this comment.
The current logic for adding profiler callbacks is a bit loose. If profiler_type is set to an unexpected string (other than 'none' or 'torch'), it will default to adding the 'nsys' callback. It's better to be explicit.
| if self.profiler_type != 'none': | |
| cb = 'torch_profiler' if self.profiler_type == 'torch' else 'nsys' | |
| self.callbacks.append(cb) | |
| if self.profiler_type == 'torch': | |
| self.callbacks.append('torch_profiler') | |
| elif self.profiler_type == 'nsys': | |
| self.callbacks.append('nsys') |
| self.start_step = getattr(self.args, 'nsys_profile_start', 5) | ||
| self.end_step = getattr(self.args, 'nsys_profile_end', 5) |
There was a problem hiding this comment.
The default values in getattr (5) are inconsistent with the defaults defined in MegatronArguments (-1) and those used in NsysCallback (-1). While MegatronArguments provides a default, it's safer and clearer to use the same fallback value here.
| self.start_step = getattr(self.args, 'nsys_profile_start', 5) | |
| self.end_step = getattr(self.args, 'nsys_profile_end', 5) | |
| self.start_step = getattr(self.args, 'nsys_profile_start', -1) | |
| self.end_step = getattr(self.args, 'nsys_profile_end', -1) |
PR type
PR information
add step-based nsys/torch-profiler collection control
Experiment results