Skip to content

Commit d82d39d

Browse files
[doc] feat: Add fine-grained profiling tutorial for FSDP and Megatron on Ascend (verl-project#4610)
### What does this PR do? This PR introduces a comprehensive guide for fine-grained performance profiling on Ascend devices, supporting both FSDP and Megatron backends. To address the challenge of massive data volume during full profiling in large-scale training, this tutorial implements a "Key Path Sampling" strategy. Key Highlights: **Rollout Stage:** Detailed instructions for profiling vLLM and SGLang inference engines using torch_npu.profiler schedules. **Training Stage:** Code instrumentation examples for compute_log_prob and update_policy phases to capture specific micro-batches or mini-batches. **Backend Specifics:** Differentiated guidance for FSDP (Micro-Batch level control) and Megatron (Mini-Batch level control). ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) --------- Co-authored-by: Zhen <295632982@qq.com>
1 parent f6ee083 commit d82d39d

2 files changed

Lines changed: 496 additions & 11 deletions

File tree

docs/ascend_tutorial/ascend_profiling_en.rst

Lines changed: 246 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
Performance data collection based on FSDP or MindSpeed(Megatron) on Ascend devices(en)
22
==========================================================================================
33

4-
Last updated: 08/14/2025.
4+
Last updated: 12/20/2025.
55

66
This is a tutorial for data collection using the GRPO or DAPO algorithm
77
based on FSDP or MindSpeed(Megatron) on Ascend devices.
@@ -11,8 +11,8 @@ Configuration
1111

1212
Leverage two levels of configuration to control data collection:
1313

14-
1. **Global profiler control**: Use parameters in ``ppo_trainer.yaml`` to control the collection mode and steps.
15-
2. **Role profile control**: Use parameters in each role's ``profile`` field to control the collection mode for each role.
14+
- **Global profiler control**: Use parameters in ``verl/trainer/config/ppo_trainer.yaml`` (FSDP) or ``verl/trainer/config/ppo_megatron_trainer.yaml`` (MindSpeed) to control the collection mode and steps.
15+
- **Role profile control**: Use parameters in each role's ``profile`` field to control various parameters.
1616

1717
Global collection control
1818
~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -83,14 +83,16 @@ End-to-End collection
8383
8484
global_profiler:
8585
steps: [1, 2, 5]
86+
save_path: ./outputs/profile
8687
actor_rollout_ref:
87-
actor:
88+
actor: # Set actor role profiler collection configuration parameters
8889
profiler:
8990
enable: True
9091
all_ranks: True
9192
tool_config:
9293
npu:
9394
discrete: False
95+
contents: [npu, cpu] # Control collection list, default cpu, npu, can configure memory, shapes, module, etc.
9496
# rollout & ref follow actor settings
9597
9698
@@ -101,6 +103,7 @@ Discrete Mode Collection
101103
102104
global_profiler:
103105
steps: [1, 2, 5]
106+
save_path: ./outputs/profile
104107
actor_rollout_ref:
105108
actor:
106109
profiler:
@@ -109,6 +112,7 @@ Discrete Mode Collection
109112
tool_config:
110113
npu:
111114
discrete: True
115+
contents: [npu, cpu] # Control collection list, default cpu, npu, can configure memory, shapes, module, etc.
112116
# rollout & ref follow actor settings
113117
114118
@@ -131,4 +135,241 @@ If the analysis parameter is set to False, offline parsing is required after dat
131135
132136
import torch_npu
133137
# Set profiler_path to the parent directory of the "localhost.localdomain_<PID>_<timestamp>_ascend_pt" folder
134-
torch_npu.profiler.profiler.analyse(profiler_path=profiler_path)
138+
torch_npu.profiler.profiler.analyse(profiler_path=profiler_path)
139+
140+
141+
Advanced Guide: Fine-grained Collection
142+
---------------------------------------
143+
144+
Background and Challenges
145+
~~~~~~~~~~~~~~~~~~~~~~~~~
146+
147+
Although the configuration-based collection method mentioned above is convenient, it faces challenges in training scenarios with **long sequences (Long Context)** or **large global batch sizes (Large Global Batch Size)**. Within a complete training step (Step), model computation exhibits high-frequency and repetitive characteristics:
148+
149+
1. **Rollout phase**: Sequence generation (Generate Sequence) is an autoregressive process involving thousands of forward computations of the Decoder model.
150+
2. **Training phase**: To control peak memory usage, verl typically adopts a Micro-Batch strategy, dividing large data streams into multiple micro-batches for computation.
151+
152+
- **compute_log_prob (Actor/Ref)**: Involves multiple rounds of pure forward propagation.
153+
- **update_policy (Actor/Critic)**: Involves multiple rounds of forward and backward propagation.
154+
155+
This characteristic leads to massive and repetitive operator records from full profiling. As shown in the image below:
156+
157+
.. image:: https://raw.githubusercontent.com/mengchengTang/verl-data/master/verl_ascend_profiler.png
158+
159+
Even with ``discrete`` mode enabled, performance data files for a single stage can still reach several TB, leading to **parsing failures** or **visualization tool lag**.
160+
161+
Solution: Critical Path Sampling
162+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
163+
164+
To solve the above problems, we can adopt a **critical path sampling** strategy: Based on the API interface provided by `torch_npu.profiler <https://www.hiascend.com/document/detail/zh/canncommercial/80RC2/devaids/auxiliarydevtool/atlasprofiling_16_0038.html>`_, directly modify Python source code to collect only representative data segments (such as specific Decode Steps or the first Micro-Batch).
165+
166+
**Important Notes**
167+
168+
1. This chapter involves direct source code modification. It is recommended to back up files before modification and restore them after debugging.
169+
2. When using code instrumentation for collection, be sure to **disable global collection** (``global_profiler: steps: null``) in ``ppo_trainer.yaml`` or ``ppo_megatron_trainer.yaml`` to avoid Profiler conflicts.
170+
171+
1. Fine-grained Collection in Rollout Phase
172+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
173+
174+
For vLLM or SGLang inference engines, we can control the ``schedule`` parameter to collect model forward propagation performance data for specific tokens.
175+
176+
**vLLM Engine**
177+
178+
- **Reference Version**: vLLM v0.11.0, vLLM-Ascend v0.11.0rc1
179+
- **Modified File**: ``vllm-ascend/vllm_ascend/worker/worker_v1.py``
180+
181+
.. code-block:: diff
182+
183+
class NPUWorker(WorkerBase):
184+
185+
def __init__(self, *args, **kwargs):
186+
# ... existing code ...
187+
188+
+ # Initialize profiler
189+
+ import torch_npu
190+
+ experimental_config = torch_npu.profiler._ExperimentalConfig(
191+
+ profiler_level=torch_npu.profiler.ProfilerLevel.Level1,
192+
+ export_type=torch_npu.profiler.ExportType.Db, # You can choose torch_npu.profiler.ExportType.Text format
193+
+ )
194+
+ self.profiler_npu = torch_npu.profiler.profile(
195+
+ activities=[torch_npu.profiler.ProfilerActivity.CPU, torch_npu.profiler.ProfilerActivity.NPU],
196+
+ with_modules=False, # Collect call stack
197+
+ profile_memory=False, # Collect memory
198+
+ experimental_config=experimental_config,
199+
+ # Skip first step, warmup one step, collect 3 steps, repeat 1 time. If you want to collect decode steps 30~70, set schedule=torch_npu.profiler.schedule(wait=29, warmup=1, active=30, repeat=1)
200+
+ schedule=torch_npu.profiler.schedule(wait=1, warmup=1, active=3, repeat=1),
201+
+ on_trace_ready=torch_npu.profiler.tensorboard_trace_handler("./outputs/vllm_profile", analyse_flag=True) # Data save path and whether to parse online
202+
+ )
203+
+ self.profiler_npu.start()
204+
205+
# ... existing code ...
206+
207+
def execute_model(self, scheduler_output=None, intermediate_tensors=None, **kwargs):
208+
# ... existing code ...
209+
output = self.model_runner.execute_model(scheduler_output,
210+
intermediate_tensors)
211+
212+
+ self.profiler_npu.step() # Drive schedule to collect partial decode steps
213+
214+
# ... existing code ...
215+
216+
**SGLang Engine**
217+
218+
- **Reference Version**: SGLang master branch
219+
- **Modified File**: ``sglang/python/sglang/srt/model_executor/model_runner.py``
220+
221+
.. code-block:: diff
222+
223+
# ... existing imports ...
224+
+ import torch_npu
225+
226+
class ModelRunner:
227+
228+
def __init__(self, *args, **kwargs):
229+
# ... existing init code ...
230+
231+
+ # Initialize profiler (same configuration as above, omitted)
232+
+ experimental_config = torch_npu.profiler._ExperimentalConfig(...)
233+
+ self.profiler_npu = torch_npu.profiler.profile(
234+
+ # ...
235+
+ # Skip first step, warmup one step, collect 3 steps, repeat 1 time.
236+
+ schedule=torch_npu.profiler.schedule(wait=1, warmup=1, active=3, repeat=1),
237+
+ on_trace_ready=torch_npu.profiler.tensorboard_trace_handler("./outputs/sglang_profile", analyse_flag=True)
238+
+ )
239+
+ self.profiler_npu.start()
240+
241+
def forward(self, forward_batch, **kwargs):
242+
# ... existing code ...
243+
244+
+ self.profiler_npu.step() # Drive schedule to collect partial decode steps
245+
return output
246+
247+
2. Fine-grained Collection in compute_log_prob (Actor & Ref) Phase
248+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
249+
250+
This phase computes probability distributions for new and old policies.
251+
252+
**FSDP Backend**
253+
254+
The FSDP backend allows fine-grained control at the Micro-Batch level.
255+
256+
- **Modified File**: ``verl/workers/actor/dp_actor.py``
257+
258+
.. code-block:: diff
259+
260+
# ... import dependencies ...
261+
+ import torch_npu
262+
263+
class DataParallelPPOActor(BasePPOActor):
264+
265+
def compute_log_prob(self, data: DataProto, calculate_entropy=False) -> torch.Tensor:
266+
267+
+ role = "Ref" if self.actor_optimizer is None else "Actor"
268+
+ # Prepare profiler (same configuration as above, omitted)
269+
+ experimental_config = torch_npu.profiler._ExperimentalConfig(...)
270+
+ self.prof_npu = torch_npu.profiler.profile(
271+
+ # ...
272+
+ # wait=0, warmup=0, active=1: directly collect first micro-batch
273+
+ schedule=torch_npu.profiler.schedule(wait=0, warmup=0, active=1, repeat=1),
274+
+ on_trace_ready=torch_npu.profiler.tensorboard_trace_handler(f"./outputs/{role}_compute_log_prob", analyse_flag=True)
275+
+ )
276+
277+
278+
+ # This function is shared by ref and actor, set role flag to distinguish. If you want to collect actor_compute_log_prob, set if role=="Actor":
279+
+ if role=="Ref":
280+
+ self.prof_npu.start()
281+
282+
for micro_batch in micro_batches:
283+
284+
# ... original computation logic ...
285+
with torch.no_grad():
286+
entropy, log_probs = self._forward_micro_batch(...)
287+
288+
+ # Drive schedule to collect micro batch
289+
+ if role=="Ref":
290+
+ self.prof_npu.step()
291+
292+
# ...
293+
294+
295+
**Megatron Backend**
296+
297+
The Micro-Batch scheduling in the Megatron backend is managed internally by the framework and does not currently support fine-grained collection at the Micro-Batch level through simple code instrumentation. It is recommended to use global configuration for collection.
298+
299+
3. Fine-grained Collection in update_policy (Actor & Critic) Phase
300+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
301+
302+
The Update phase includes forward and backward propagation.
303+
304+
**FSDP Backend**
305+
306+
The FSDP backend supports collection at both Mini-Batch and Micro-Batch granularities.
307+
308+
- **Modified File**: ``verl/workers/actor/dp_actor.py``
309+
310+
.. code-block:: diff
311+
312+
# ... import dependencies ...
313+
+ import torch_npu
314+
315+
class DataParallelPPOActor(BasePPOActor):
316+
317+
def update_policy(self, data: DataProto):
318+
319+
+ # Prepare profiler (same configuration as above, omitted)
320+
+ experimental_config = torch_npu.profiler._ExperimentalConfig(...)
321+
+ self.prof_npu = torch_npu.profiler.profile(
322+
+ # ...
323+
+ # Only collect first Mini Batch (including all Micro-Batch computations and one optimizer update)
324+
+ schedule=torch_npu.profiler.schedule(wait=0, warmup=0, active=1, repeat=1),
325+
+ on_trace_ready=torch_npu.profiler.tensorboard_trace_handler("./outputs/fsdp_actor_update_profile", analyse_flag=True)
326+
+ )
327+
+ self.prof_npu.start()
328+
329+
# ... PPO Epochs loop ...
330+
for _ in range(self.config.ppo_epochs):
331+
# ... Mini Batch loop ...
332+
for batch_idx, mini_batch in enumerate(mini_batches):
333+
# ... mini_batches split ...
334+
335+
for i, micro_batch in enumerate(micro_batches):
336+
# ... Original Forward & Backward logic ...
337+
# ... loss.backward() ...
338+
pass
339+
340+
grad_norm = self._optimizer_step()
341+
342+
+ # Drive schedule to collect mini batch, if you want micro batch collection, move self.prof_npu.step() inside the micro_batch loop
343+
+ self.prof_npu.step()
344+
345+
346+
**Megatron Backend**
347+
348+
The Megatron backend supports collection at the Mini-Batch granularity.
349+
350+
- **Modified File**: ``verl/workers/actor/megatron_actor.py``
351+
352+
.. code-block:: diff
353+
354+
class MegatronPPOActor(BasePPOActor):
355+
356+
def update_policy(self, dataloader: Iterable[DataProto]) -> dict:
357+
# ...
358+
+ # Prepare profiler (same configuration as above, omitted)
359+
+ experimental_config = torch_npu.profiler._ExperimentalConfig(...)
360+
+ self.prof_npu = torch_npu.profiler.profile(
361+
+ # ...
362+
+ # Only collect computation of first Mini Batch (including all Micro-Batches) and one optimizer update
363+
+ schedule=torch_npu.profiler.schedule(wait=0, warmup=0, active=1, repeat=1),
364+
+ on_trace_ready=torch_npu.profiler.tensorboard_trace_handler("./outputs/megatron_actor_update_profile", analyse_flag=True)
365+
+ )
366+
+ self.prof_npu.start()
367+
368+
for data in dataloader:
369+
# ... internally calls self.forward_backward_batch for computation ...
370+
# ... metric_micro_batch = self.forward_backward_batch(...)
371+
372+
# ... self.actor_optimizer.step() ...
373+
374+
+ # Drive schedule to collect mini batch
375+
+ self.prof_npu.step()

0 commit comments

Comments
 (0)