Commit 6a84d4e
fix(infra): correct staleness capacity inflation after recovery (#1345)
* fix(infra): correct staleness capacity inflation after checkpoint recovery
StalenessManager's accepted counter started at 0 while the version was
restored to a high value by the recovery path. This caused the capacity
formula to yield (max_staleness + recovered_version + 1) * batch_size
instead of the intended (max_staleness + 1) * batch_size, allowing a
burst of rollout submissions and unbounded staleness growth.
Add on_version_recovered() to StalenessManager and call it from
rl_trainer after recover completes. The trainer accesses the staleness
manager directly via the known concrete type (RolloutController in
single-controller mode, workflow_executor in SPMD mode).
* fix(infra): clarify staleness recovery semantics and use public APIs
Address review feedback on the staleness manager recovery path:
- Document that on_version_recovered is expected to be called with
running == 0 and explain the bound when it is not.
- Reach the manager through the public staleness_manager properties
on RolloutController and WorkflowExecutor instead of the private
_staleness_manager attribute, avoiding coupling to internal layout.
- Extend tests with the version=0 no-op case and a parametrized case
with in-flight rollouts to verify accepted is set correctly.
---------
Co-authored-by: fenghui <dh183333@antgroup.com>1 parent 9c2ec43 commit 6a84d4e
3 files changed
Lines changed: 75 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
112 | 112 | | |
113 | 113 | | |
114 | 114 | | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
115 | 133 | | |
116 | 134 | | |
117 | 135 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
368 | 368 | | |
369 | 369 | | |
370 | 370 | | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
371 | 382 | | |
372 | 383 | | |
373 | 384 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
766 | 766 | | |
767 | 767 | | |
768 | 768 | | |
| 769 | + | |
| 770 | + | |
| 771 | + | |
| 772 | + | |
| 773 | + | |
| 774 | + | |
| 775 | + | |
| 776 | + | |
| 777 | + | |
| 778 | + | |
| 779 | + | |
| 780 | + | |
| 781 | + | |
| 782 | + | |
| 783 | + | |
| 784 | + | |
| 785 | + | |
| 786 | + | |
| 787 | + | |
| 788 | + | |
| 789 | + | |
| 790 | + | |
| 791 | + | |
| 792 | + | |
| 793 | + | |
| 794 | + | |
| 795 | + | |
| 796 | + | |
| 797 | + | |
| 798 | + | |
| 799 | + | |
| 800 | + | |
| 801 | + | |
| 802 | + | |
| 803 | + | |
| 804 | + | |
| 805 | + | |
| 806 | + | |
| 807 | + | |
| 808 | + | |
| 809 | + | |
| 810 | + | |
| 811 | + | |
| 812 | + | |
| 813 | + | |
| 814 | + | |
769 | 815 | | |
770 | 816 | | |
771 | 817 | | |
0 commit comments