prefetch weights while waiting for pending requests to complete#728
Conversation
|
@JenniferWang has exported this pull request. If you are a Meta employee, you can view the originating Diff in D91092833. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #728 +/- ##
==========================================
- Coverage 78.33% 68.73% -9.61%
==========================================
Files 36 42 +6
Lines 4209 4455 +246
==========================================
- Hits 3297 3062 -235
- Misses 912 1393 +481 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary: Feature parity with v0: allow prefetching weights while waiting for the pending requests to finish. ## Test Plan Introduced a benchmark that simulates the on-going requests with actual weight sync logic. Reference Group (V0) ``` ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: False -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.102 s 2.99 GB/s Avg update_weights 43.738 s 0.35 GB/s Avg total (push + update) 48.840 s ================================================================================ ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: True Fetcher procs: 8 -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.208 s 2.93 GB/s Avg update_weights 29.602 s 0.52 GB/s Avg total (push + update) 34.810 s ================================================================================ ``` Test Group (V1) ``` ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: False -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.070 s 3.01 GB/s Avg update_weights 39.974 s 0.38 GB/s Avg total (push + update) 45.044 s ================================================================================ ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: True Fetcher procs: 8 -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.055 s 3.02 GB/s Avg update_weights 28.730 s 0.53 GB/s Avg total (push + update) 33.784 s ================================================================================ ``` ## Next Steps [-] implement the prefetch logic & shared memory [-] Add metric similar to generator v0 [ ] Perf/Throughput testing compared to generator v0 Differential Revision: D91092833
0c99d56 to
969b8ab
Compare
joecummings
left a comment
There was a problem hiding this comment.
I can't tell if this is supposed to be a stacked diff? Contains more than just prefetch information.
| """TitanTrainer with weight modification capabilities for benchmarking.""" | ||
|
|
||
| @endpoint | ||
| async def modify_weights(self, scale: float = 1.1): |
There was a problem hiding this comment.
super nit: do we need to parameterize this? Can't we just 1) assume it's a floating point and 2) arbitrarily add or scale by X ?
There was a problem hiding this comment.
I'll simplify this. The rest of the test is quite essential.
| logger.info( | ||
| "[ForgeMonarchExecutor] Deserializing TorchStore Controller from environment..." | ||
| ) | ||
| self.torchstore_controller = cloudpickle.loads( |
| model: str | ||
| iterations: int | ||
| prefetch_enabled: bool | ||
| n_fetcher_procs: int |
There was a problem hiding this comment.
Can we test how this parameter affects throughput?
There was a problem hiding this comment.
n_fetcher_procs -- too high will slightly degrade the overall time.
prefetch_enabled -- this is the major toggle.
There was a problem hiding this comment.
Right but is there a graph of where "too high" is? I imagine that also too low will not be optimized.
Summary: Pull Request resolved: #728 Feature parity with v0: allow prefetching weights while waiting for the pending requests to finish. ## Test Plan Introduced a benchmark that simulates the on-going requests with actual weight sync logic. Reference Group (V0) ``` ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: False -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.102 s 2.99 GB/s Avg update_weights 43.738 s 0.35 GB/s Avg total (push + update) 48.840 s ================================================================================ ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: True Fetcher procs: 8 -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.208 s 2.93 GB/s Avg update_weights 29.602 s 0.52 GB/s Avg total (push + update) 34.810 s ================================================================================ ``` Test Group (V1) ``` ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: False -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.070 s 3.01 GB/s Avg update_weights 39.974 s 0.38 GB/s Avg total (push + update) 45.044 s ================================================================================ ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: True Fetcher procs: 8 -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.055 s 3.02 GB/s Avg update_weights 28.730 s 0.53 GB/s Avg total (push + update) 33.784 s ================================================================================ ``` ## Next Steps [-] implement the prefetch logic & shared memory [-] Add metric similar to generator v0 [ ] Perf/Throughput testing compared to generator v0 Differential Revision: D91092833
969b8ab to
bfdb049
Compare
Summary: Feature parity with v0: allow prefetching weights while waiting for the pending requests to finish. ## Test Plan Introduced a benchmark that simulates the on-going requests with actual weight sync logic. Reference Group (V0) ``` ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: False -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.102 s 2.99 GB/s Avg update_weights 43.738 s 0.35 GB/s Avg total (push + update) 48.840 s ================================================================================ ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: True Fetcher procs: 8 -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.208 s 2.93 GB/s Avg update_weights 29.602 s 0.52 GB/s Avg total (push + update) 34.810 s ================================================================================ ``` Test Group (V1) ``` ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: False -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.070 s 3.01 GB/s Avg update_weights 39.974 s 0.38 GB/s Avg total (push + update) 45.044 s ================================================================================ ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: True Fetcher procs: 8 -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.055 s 3.02 GB/s Avg update_weights 28.730 s 0.53 GB/s Avg total (push + update) 33.784 s ================================================================================ ``` ## Next Steps [-] implement the prefetch logic & shared memory [-] Add metric similar to generator v0 [ ] Perf/Throughput testing compared to generator v0 Differential Revision: D91092833
bfdb049 to
3f869a3
Compare
Summary: Pull Request resolved: #728 Feature parity with v0: allow prefetching weights while waiting for the pending requests to finish. ## Test Plan Introduced a benchmark that simulates the on-going requests with actual weight sync logic. Reference Group (V0) ``` ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: False -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.102 s 2.99 GB/s Avg update_weights 43.738 s 0.35 GB/s Avg total (push + update) 48.840 s ================================================================================ ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: True Fetcher procs: 8 -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.208 s 2.93 GB/s Avg update_weights 29.602 s 0.52 GB/s Avg total (push + update) 34.810 s ================================================================================ ``` Test Group (V1) ``` ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: False -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.070 s 3.01 GB/s Avg update_weights 39.974 s 0.38 GB/s Avg total (push + update) 45.044 s ================================================================================ ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: True Fetcher procs: 8 -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.055 s 3.02 GB/s Avg update_weights 28.730 s 0.53 GB/s Avg total (push + update) 33.784 s ================================================================================ ``` ## Next Steps [-] implement the prefetch logic & shared memory [-] Add metric similar to generator v0 [ ] Perf/Throughput testing compared to generator v0 Differential Revision: D91092833
3f869a3 to
3a769bb
Compare
Summary: Feature parity with v0: allow prefetching weights while waiting for the pending requests to finish. ## Test Plan Introduced a benchmark that simulates the on-going requests with actual weight sync logic. Reference Group (V0) ``` ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: False -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.102 s 2.99 GB/s Avg update_weights 43.738 s 0.35 GB/s Avg total (push + update) 48.840 s ================================================================================ ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: True Fetcher procs: 8 -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.208 s 2.93 GB/s Avg update_weights 29.602 s 0.52 GB/s Avg total (push + update) 34.810 s ================================================================================ ``` Test Group (V1) ``` ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: False -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.070 s 3.01 GB/s Avg update_weights 39.974 s 0.38 GB/s Avg total (push + update) 45.044 s ================================================================================ ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: True Fetcher procs: 8 -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.055 s 3.02 GB/s Avg update_weights 28.730 s 0.53 GB/s Avg total (push + update) 33.784 s ================================================================================ ``` ## Next Steps [-] implement the prefetch logic & shared memory [-] Add metric similar to generator v0 [ ] Perf/Throughput testing compared to generator v0 Differential Revision: D91092833
3a769bb to
327d32a
Compare
Summary: Feature parity with v0: allow prefetching weights while waiting for the pending requests to finish. ## Test Plan Introduced a benchmark that simulates the on-going requests with actual weight sync logic. Reference Group (V0) ``` ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: False -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.102 s 2.99 GB/s Avg update_weights 43.738 s 0.35 GB/s Avg total (push + update) 48.840 s ================================================================================ ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: True Fetcher procs: 8 -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.208 s 2.93 GB/s Avg update_weights 29.602 s 0.52 GB/s Avg total (push + update) 34.810 s ================================================================================ ``` Test Group (V1) ``` ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: False -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.070 s 3.01 GB/s Avg update_weights 39.974 s 0.38 GB/s Avg total (push + update) 45.044 s ================================================================================ ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: True Fetcher procs: 8 -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.055 s 3.02 GB/s Avg update_weights 28.730 s 0.53 GB/s Avg total (push + update) 33.784 s ================================================================================ ``` ## Next Steps [-] implement the prefetch logic & shared memory [-] Add metric similar to generator v0 [ ] Perf/Throughput testing compared to generator v0 Differential Revision: D91092833
Summary: Feature parity with v0: allow prefetching weights while waiting for the pending requests to finish. ## Test Plan Introduced a benchmark that simulates the on-going requests with actual weight sync logic. Reference Group (V0) ``` ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: False -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.102 s 2.99 GB/s Avg update_weights 43.738 s 0.35 GB/s Avg total (push + update) 48.840 s ================================================================================ ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: True Fetcher procs: 8 -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.208 s 2.93 GB/s Avg update_weights 29.602 s 0.52 GB/s Avg total (push + update) 34.810 s ================================================================================ ``` Test Group (V1) ``` ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: False -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.070 s 3.01 GB/s Avg update_weights 39.974 s 0.38 GB/s Avg total (push + update) 45.044 s ================================================================================ ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: True Fetcher procs: 8 -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.055 s 3.02 GB/s Avg update_weights 28.730 s 0.53 GB/s Avg total (push + update) 33.784 s ================================================================================ ``` ## Next Steps [-] implement the prefetch logic & shared memory [-] Add metric similar to generator v0 [ ] Perf/Throughput testing compared to generator v0 Differential Revision: D91092833
6f9290d to
f5fe7d6
Compare
allenwang28
left a comment
There was a problem hiding this comment.
Review automatically exported from Phabricator review in Meta.
Summary:
Feature parity with v0: allow prefetching weights while waiting for the pending requests to finish.
Test Plan
Introduced a benchmark that simulates the on-going requests with actual weight sync logic.
Reference Group (V0)
Test Group (V1)
Next Steps
[-] implement the prefetch logic & shared memory
[-] Add metric similar to generator v0
[ ] Perf/Throughput testing compared to generator v0
Differential Revision: D91092833