You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[doc] feat: Add fine-grained profiling tutorial for FSDP and Megatron on Ascend (verl-project#4610)
### What does this PR do?
This PR introduces a comprehensive guide for fine-grained performance
profiling on Ascend devices, supporting both FSDP and Megatron backends.
To address the challenge of massive data volume during full profiling in
large-scale training, this tutorial implements a "Key Path Sampling"
strategy.
Key Highlights:
**Rollout Stage:** Detailed instructions for profiling vLLM and SGLang
inference engines using torch_npu.profiler schedules.
**Training Stage:** Code instrumentation examples for compute_log_prob
and update_policy phases to capture specific micro-batches or
mini-batches.
**Backend Specifics:** Differentiated guidance for FSDP (Micro-Batch
level control) and Megatron (Mini-Batch level control).
### Checklist Before Starting
- [x] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
- `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
- Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`
### Test
> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.
### API and Usage Example
> Demonstrate how the API changes if any, and provide usage example(s)
if possible.
```python
# Add code snippet or script demonstrating how to use this
```
### Design & Code Changes
> Demonstrate the high-level design if this PR is complex, and list the
specific changes.
### Checklist Before Submitting
> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.
- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
---------
Co-authored-by: Zhen <295632982@qq.com>
This is a tutorial for data collection using the GRPO or DAPO algorithm
7
7
based on FSDP or MindSpeed(Megatron) on Ascend devices.
@@ -11,8 +11,8 @@ Configuration
11
11
12
12
Leverage two levels of configuration to control data collection:
13
13
14
-
1. **Global profiler control**: Use parameters in ``ppo_trainer.yaml`` to control the collection mode and steps.
15
-
2. **Role profile control**: Use parameters in each role's ``profile`` field to control the collection mode for each role.
14
+
- **Global profiler control**: Use parameters in ``verl/trainer/config/ppo_trainer.yaml`` (FSDP) or ``verl/trainer/config/ppo_megatron_trainer.yaml`` (MindSpeed) to control the collection mode and steps.
15
+
- **Role profile control**: Use parameters in each role's ``profile`` field to control various parameters.
16
16
17
17
Global collection control
18
18
~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -83,14 +83,16 @@ End-to-End collection
83
83
84
84
global_profiler:
85
85
steps: [1, 2, 5]
86
+
save_path: ./outputs/profile
86
87
actor_rollout_ref:
87
-
actor:
88
+
actor:# Set actor role profiler collection configuration parameters
88
89
profiler:
89
90
enable: True
90
91
all_ranks: True
91
92
tool_config:
92
93
npu:
93
94
discrete: False
95
+
contents: [npu, cpu] # Control collection list, default cpu, npu, can configure memory, shapes, module, etc.
94
96
# rollout & ref follow actor settings
95
97
96
98
@@ -101,6 +103,7 @@ Discrete Mode Collection
101
103
102
104
global_profiler:
103
105
steps: [1, 2, 5]
106
+
save_path: ./outputs/profile
104
107
actor_rollout_ref:
105
108
actor:
106
109
profiler:
@@ -109,6 +112,7 @@ Discrete Mode Collection
109
112
tool_config:
110
113
npu:
111
114
discrete: True
115
+
contents: [npu, cpu] # Control collection list, default cpu, npu, can configure memory, shapes, module, etc.
112
116
# rollout & ref follow actor settings
113
117
114
118
@@ -131,4 +135,241 @@ If the analysis parameter is set to False, offline parsing is required after dat
131
135
132
136
import torch_npu
133
137
# Set profiler_path to the parent directory of the "localhost.localdomain_<PID>_<timestamp>_ascend_pt" folder
Although the configuration-based collection method mentioned above is convenient, it faces challenges in training scenarios with **long sequences (Long Context)** or **large global batch sizes (Large Global Batch Size)**. Within a complete training step (Step), model computation exhibits high-frequency and repetitive characteristics:
148
+
149
+
1. **Rollout phase**: Sequence generation (Generate Sequence) is an autoregressive process involving thousands of forward computations of the Decoder model.
150
+
2. **Training phase**: To control peak memory usage, verl typically adopts a Micro-Batch strategy, dividing large data streams into multiple micro-batches for computation.
151
+
152
+
- **compute_log_prob (Actor/Ref)**: Involves multiple rounds of pure forward propagation.
153
+
- **update_policy (Actor/Critic)**: Involves multiple rounds of forward and backward propagation.
154
+
155
+
This characteristic leads to massive and repetitive operator records from full profiling. As shown in the image below:
Even with ``discrete`` mode enabled, performance data files for a single stage can still reach several TB, leading to **parsing failures** or **visualization tool lag**.
160
+
161
+
Solution: Critical Path Sampling
162
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
163
+
164
+
To solve the above problems, we can adopt a **critical path sampling** strategy: Based on the API interface provided by `torch_npu.profiler <https://www.hiascend.com/document/detail/zh/canncommercial/80RC2/devaids/auxiliarydevtool/atlasprofiling_16_0038.html>`_, directly modify Python source code to collect only representative data segments (such as specific Decode Steps or the first Micro-Batch).
165
+
166
+
**Important Notes**
167
+
168
+
1. This chapter involves direct source code modification. It is recommended to back up files before modification and restore them after debugging.
169
+
2. When using code instrumentation for collection, be sure to **disable global collection** (``global_profiler: steps: null``) in ``ppo_trainer.yaml`` or ``ppo_megatron_trainer.yaml`` to avoid Profiler conflicts.
170
+
171
+
1. Fine-grained Collection in Rollout Phase
172
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
173
+
174
+
For vLLM or SGLang inference engines, we can control the ``schedule`` parameter to collect model forward propagation performance data for specific tokens.
+ # Skip first step, warmup one step, collect 3 steps, repeat 1 time. If you want to collect decode steps 30~70, set schedule=torch_npu.profiler.schedule(wait=29, warmup=1, active=30, repeat=1)
+ on_trace_ready=torch_npu.profiler.tensorboard_trace_handler("./outputs/vllm_profile", analyse_flag=True) # Data save path and whether to parse online
The Micro-Batch scheduling in the Megatron backend is managed internally by the framework and does not currently support fine-grained collection at the Micro-Batch level through simple code instrumentation. It is recommended to use global configuration for collection.
298
+
299
+
3. Fine-grained Collection in update_policy (Actor & Critic) Phase
0 commit comments