Commit 1ba8f43
committed
feat(grpo-sync): equivalency fixes + content via TQ object column
Brings the TQ-mediated GRPO trainer (grpo_train_sync) into parity with
the legacy grpo_train path:
* Fix DS print KeyError on repeated_batch['total_reward'] — use the
cumulative unfiltered_rewards tracker instead.
* Wrap scale_rewards/shaping/overlong/baseline-std in
timer.time("reward_calculation") to match legacy timing dashboards.
* Warn when calculate_advantages_on_gpu is set under TQ (no-op since
the slice is CPU-side).
* Add per-step generation-side metric hooks
(snapshot_step_metrics on the first DS iter, clear_logger_metrics
before each rollout) inside SyncRolloutActor.rollout_to_tq via a
new first_iter kwarg.
* Plumb GDPO reward components through the rollout slice so
GDPOAdvantageEstimator and scale_rewards see them.
* Plumb assistant text content through TQ as an object column (verl-
style np.ndarray(dtype=object) → pack_object_array → uint8 jagged
nested tensor); driver fetches it pre-kv_clear alongside input_ids
via read_columns and writes it into train_data_step{N}.jsonl.
Also refactors _apply_dynamic_sampling to use BatchedDataDict's
select_indices / from_batches / slice methods rather than open-coded
helpers — slice_data is now a BatchedDataDict end-to-end.
Verified: tests/data_plane/unit/ 102 passed / 1 xfailed (Slurm 11653849);
GRPO 1B mcore + DS + TQ on simple backend 5/5 steps (Slurm 11653848);
on mooncake_cpu 5/5 steps with raised mooncake defaults (Slurm 11654191).
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
feat(data-plane): make mooncake segment/buffer Hydra-overridable, raise defaults
global_segment_size and local_buffer_size were hardcoded at 128 GiB /
16 GiB. Multi-iter DAPO with large message_log object payloads
exhausts mooncake_cpu's internal allocator headroom at those sizes,
manifesting as RuntimeError: batch_get_tensor returned None for
'<idx>@input_ids' partway through training (verified failure JOBID
11653282 on the 1n8g GRPO 1B + DS + TQ + mooncake_cpu recipe).
Both knobs now read from cfg.get(...), defaults raised to 512 GiB and
64 GiB respectively. Override per-recipe via
+data_plane.global_segment_size=<bytes> /
+data_plane.local_buffer_size=<bytes>. Lazy mmap, so RSS stays bounded
by actual traffic.
Verified at the new defaults: 1n8g GRPO 1B + DS + TQ + mooncake_cpu
runs 5/5 steps (JOBID 11654191).
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
test(data-plane): add object×backend coverage and mooncake load repro
Closes the gap that hid the mooncake_cpu under-sized-segment failure:
previously the codec round-trip (test_codec_object.py) tested
pack_object_array in-process, and the smoke round-trip
(test_smoke_round_trip_backends) ran tensor-only fields against both
backends. Object fields × mooncake_cpu was untested.
* tests/data_plane/functional/test_tq_lifecycle.py:
- test_object_round_trip_backends: np.ndarray(dtype=object) put →
get → decode equality, parametrized over simple + mooncake_cpu.
- test_object_and_tensor_mixed_round_trip_backends: mixed schema
(tensor + object on the same partition) — regression guard for
co-fetch tensor/object decode in a single read_columns call.
* research/mooncake_object_repro.{py,sbatch}: standalone Slurm-runnable
reproducer that hammers a backend with object-heavy puts/gets in
isolation (no rollout, no policy). Two modes: --mode=load (N iters
× M object fields, fresh partition per iter) and --mode=schema
(single put, mixed tensor + object). Lets us narrow future
storage-layer failures to a tiny artifact for upstream triage.
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>1 parent 9606176 commit 1ba8f43
5 files changed
Lines changed: 286 additions & 74 deletions
File tree
- nemo_rl
- algorithms
- data_plane/adapters
- experience
- tests/data_plane
- functional
- unit
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
57 | 57 | | |
58 | 58 | | |
59 | 59 | | |
| 60 | + | |
60 | 61 | | |
61 | 62 | | |
62 | 63 | | |
| |||
81 | 82 | | |
82 | 83 | | |
83 | 84 | | |
84 | | - | |
| 85 | + | |
85 | 86 | | |
86 | 87 | | |
87 | 88 | | |
| |||
116 | 117 | | |
117 | 118 | | |
118 | 119 | | |
119 | | - | |
120 | | - | |
121 | | - | |
122 | | - | |
| 120 | + | |
123 | 121 | | |
124 | 122 | | |
125 | 123 | | |
126 | 124 | | |
127 | 125 | | |
128 | 126 | | |
129 | | - | |
130 | | - | |
131 | | - | |
132 | | - | |
133 | | - | |
| 127 | + | |
134 | 128 | | |
135 | 129 | | |
136 | 130 | | |
| |||
150 | 144 | | |
151 | 145 | | |
152 | 146 | | |
153 | | - | |
154 | | - | |
155 | | - | |
156 | | - | |
| 147 | + | |
157 | 148 | | |
158 | 149 | | |
159 | 150 | | |
| |||
255 | 246 | | |
256 | 247 | | |
257 | 248 | | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
258 | 260 | | |
259 | 261 | | |
260 | 262 | | |
| |||
312 | 314 | | |
313 | 315 | | |
314 | 316 | | |
315 | | - | |
| 317 | + | |
316 | 318 | | |
317 | 319 | | |
318 | 320 | | |
| |||
420 | 422 | | |
421 | 423 | | |
422 | 424 | | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
423 | 430 | | |
424 | 431 | | |
425 | | - | |
| 432 | + | |
426 | 433 | | |
427 | 434 | | |
428 | 435 | | |
429 | 436 | | |
430 | 437 | | |
431 | 438 | | |
432 | 439 | | |
| 440 | + | |
433 | 441 | | |
434 | 442 | | |
| 443 | + | |
| 444 | + | |
435 | 445 | | |
436 | 446 | | |
437 | 447 | | |
| |||
450 | 460 | | |
451 | 461 | | |
452 | 462 | | |
453 | | - | |
454 | | - | |
455 | | - | |
456 | | - | |
457 | | - | |
458 | | - | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
459 | 466 | | |
460 | | - | |
461 | | - | |
462 | | - | |
463 | | - | |
464 | | - | |
465 | | - | |
466 | | - | |
467 | | - | |
468 | | - | |
469 | | - | |
470 | | - | |
471 | | - | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
472 | 484 | | |
473 | | - | |
474 | 485 | | |
475 | 486 | | |
476 | 487 | | |
| |||
609 | 620 | | |
610 | 621 | | |
611 | 622 | | |
612 | | - | |
613 | | - | |
614 | | - | |
615 | | - | |
616 | | - | |
| 623 | + | |
| 624 | + | |
| 625 | + | |
| 626 | + | |
| 627 | + | |
| 628 | + | |
617 | 629 | | |
618 | 630 | | |
619 | 631 | | |
620 | 632 | | |
621 | 633 | | |
622 | 634 | | |
623 | 635 | | |
| 636 | + | |
| 637 | + | |
624 | 638 | | |
625 | 639 | | |
626 | 640 | | |
| |||
699 | 713 | | |
700 | 714 | | |
701 | 715 | | |
702 | | - | |
703 | | - | |
704 | | - | |
705 | | - | |
| 716 | + | |
| 717 | + | |
| 718 | + | |
| 719 | + | |
| 720 | + | |
| 721 | + | |
706 | 722 | | |
| 723 | + | |
707 | 724 | | |
708 | | - | |
709 | | - | |
| 725 | + | |
| 726 | + | |
| 727 | + | |
| 728 | + | |
| 729 | + | |
710 | 730 | | |
711 | | - | |
| 731 | + | |
| 732 | + | |
| 733 | + | |
712 | 734 | | |
713 | 735 | | |
714 | 736 | | |
| |||
784 | 806 | | |
785 | 807 | | |
786 | 808 | | |
| 809 | + | |
| 810 | + | |
| 811 | + | |
| 812 | + | |
| 813 | + | |
| 814 | + | |
| 815 | + | |
| 816 | + | |
| 817 | + | |
| 818 | + | |
787 | 819 | | |
788 | 820 | | |
789 | | - | |
790 | | - | |
791 | | - | |
792 | | - | |
793 | | - | |
794 | | - | |
795 | | - | |
796 | | - | |
797 | | - | |
798 | | - | |
799 | | - | |
| 821 | + | |
800 | 822 | | |
801 | 823 | | |
802 | 824 | | |
| |||
937 | 959 | | |
938 | 960 | | |
939 | 961 | | |
940 | | - | |
| 962 | + | |
| 963 | + | |
| 964 | + | |
| 965 | + | |
| 966 | + | |
| 967 | + | |
| 968 | + | |
941 | 969 | | |
942 | 970 | | |
943 | 971 | | |
| |||
950 | 978 | | |
951 | 979 | | |
952 | 980 | | |
953 | | - | |
954 | | - | |
955 | | - | |
956 | | - | |
| 981 | + | |
| 982 | + | |
| 983 | + | |
| 984 | + | |
957 | 985 | | |
958 | 986 | | |
959 | 987 | | |
| |||
1005 | 1033 | | |
1006 | 1034 | | |
1007 | 1035 | | |
1008 | | - | |
1009 | | - | |
1010 | | - | |
| 1036 | + | |
1011 | 1037 | | |
1012 | 1038 | | |
1013 | 1039 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
243 | 243 | | |
244 | 244 | | |
245 | 245 | | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
246 | 254 | | |
247 | 255 | | |
248 | 256 | | |
249 | 257 | | |
250 | 258 | | |
251 | | - | |
252 | | - | |
253 | | - | |
254 | | - | |
255 | | - | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
256 | 265 | | |
257 | 266 | | |
258 | 267 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
176 | 176 | | |
177 | 177 | | |
178 | 178 | | |
| 179 | + | |
179 | 180 | | |
180 | 181 | | |
181 | 182 | | |
| |||
192 | 193 | | |
193 | 194 | | |
194 | 195 | | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
195 | 206 | | |
196 | 207 | | |
197 | 208 | | |
198 | 209 | | |
199 | 210 | | |
200 | 211 | | |
201 | 212 | | |
| 213 | + | |
202 | 214 | | |
203 | 215 | | |
204 | 216 | | |
205 | 217 | | |
206 | 218 | | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
207 | 230 | | |
208 | 231 | | |
209 | 232 | | |
| |||
268 | 291 | | |
269 | 292 | | |
270 | 293 | | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
271 | 299 | | |
272 | 300 | | |
273 | 301 | | |
| |||
302 | 330 | | |
303 | 331 | | |
304 | 332 | | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
305 | 339 | | |
306 | 340 | | |
307 | 341 | | |
| |||
0 commit comments