prefetch weights while waiting for pending requests to complete by JenniferWang · Pull Request #728 · meta-pytorch/torchforge

JenniferWang · 2026-01-23T18:16:52Z

Summary:
Feature parity with v0: allow prefetching weights while waiting for the pending requests to finish.

Test Plan

Introduced a benchmark that simulates the on-going requests with actual weight sync logic.

Reference Group (V0)

================================================================================
WEIGHT SYNC BENCHMARK RESULTS
================================================================================
Model: Qwen/Qwen3-8B
Model size: 15.26 GB
Iterations: 3
Prefetch enabled: False
--------------------------------------------------------------------------------
Metric                         Time (s)        Throughput (GB/s)   
--------------------------------------------------------------------------------
Avg push_weights                      5.102 s         2.99 GB/s
Avg update_weights                   43.738 s         0.35 GB/s
Avg total (push + update)            48.840 s
================================================================================

================================================================================
WEIGHT SYNC BENCHMARK RESULTS
================================================================================
Model: Qwen/Qwen3-8B
Model size: 15.26 GB
Iterations: 3
Prefetch enabled: True
Fetcher procs: 8
--------------------------------------------------------------------------------
Metric                         Time (s)        Throughput (GB/s)   
--------------------------------------------------------------------------------
Avg push_weights                      5.208 s         2.93 GB/s
Avg update_weights                   29.602 s         0.52 GB/s
Avg total (push + update)            34.810 s
================================================================================

Test Group (V1)

================================================================================
WEIGHT SYNC BENCHMARK RESULTS
================================================================================
Model: Qwen/Qwen3-8B
Model size: 15.26 GB
Iterations: 3
Prefetch enabled: False
--------------------------------------------------------------------------------
Metric                         Time (s)        Throughput (GB/s)   
--------------------------------------------------------------------------------
Avg push_weights                      5.070 s         3.01 GB/s
Avg update_weights                   39.974 s         0.38 GB/s
Avg total (push + update)            45.044 s
================================================================================

================================================================================
WEIGHT SYNC BENCHMARK RESULTS
================================================================================
Model: Qwen/Qwen3-8B
Model size: 15.26 GB
Iterations: 3
Prefetch enabled: True
Fetcher procs: 8
--------------------------------------------------------------------------------
Metric                         Time (s)        Throughput (GB/s)   
--------------------------------------------------------------------------------
Avg push_weights                      5.055 s         3.02 GB/s
Avg update_weights                   28.730 s         0.53 GB/s
Avg total (push + update)            33.784 s
================================================================================

Next Steps

[-] implement the prefetch logic & shared memory
[-] Add metric similar to generator v0
[ ] Perf/Throughput testing compared to generator v0

Differential Revision: D91092833

meta-codesync · 2026-01-23T18:16:58Z

@JenniferWang has exported this pull request. If you are a Meta employee, you can view the originating Diff in D91092833.

codecov-commenter · 2026-01-23T18:26:00Z

Codecov Report

❌ Patch coverage is 0% with 167 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.73%. Comparing base (080770c) to head (0c99d56).
⚠️ Report is 14 commits behind head on main.

Files with missing lines	Patch %	Lines
benchmarks/generator/weight_sync.py	0.00%	167 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #728      +/-   ##
==========================================
- Coverage   78.33%   68.73%   -9.61%     
==========================================
  Files          36       42       +6     
  Lines        4209     4455     +246     
==========================================
- Hits         3297     3062     -235     
- Misses        912     1393     +481

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Summary: Feature parity with v0: allow prefetching weights while waiting for the pending requests to finish. ## Test Plan Introduced a benchmark that simulates the on-going requests with actual weight sync logic. Reference Group (V0) ``` ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: False -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.102 s 2.99 GB/s Avg update_weights 43.738 s 0.35 GB/s Avg total (push + update) 48.840 s ================================================================================ ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: True Fetcher procs: 8 -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.208 s 2.93 GB/s Avg update_weights 29.602 s 0.52 GB/s Avg total (push + update) 34.810 s ================================================================================ ``` Test Group (V1) ``` ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: False -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.070 s 3.01 GB/s Avg update_weights 39.974 s 0.38 GB/s Avg total (push + update) 45.044 s ================================================================================ ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: True Fetcher procs: 8 -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.055 s 3.02 GB/s Avg update_weights 28.730 s 0.53 GB/s Avg total (push + update) 33.784 s ================================================================================ ``` ## Next Steps [-] implement the prefetch logic & shared memory [-] Add metric similar to generator v0 [ ] Perf/Throughput testing compared to generator v0 Differential Revision: D91092833

joecummings

I can't tell if this is supposed to be a stacked diff? Contains more than just prefetch information.

joecummings · 2026-01-23T19:21:11Z

+    """TitanTrainer with weight modification capabilities for benchmarking."""
+
+    @endpoint
+    async def modify_weights(self, scale: float = 1.1):


super nit: do we need to parameterize this? Can't we just 1) assume it's a floating point and 2) arbitrarily add or scale by X ?

I'll simplify this. The rest of the test is quite essential.

joecummings · 2026-01-23T19:26:23Z

+            logger.info(
+                "[ForgeMonarchExecutor] Deserializing TorchStore Controller from environment..."
+            )
+            self.torchstore_controller = cloudpickle.loads(


joecummings · 2026-01-23T19:28:26Z

+    model: str
+    iterations: int
+    prefetch_enabled: bool
+    n_fetcher_procs: int


Can we test how this parameter affects throughput?

n_fetcher_procs -- too high will slightly degrade the overall time.
prefetch_enabled -- this is the major toggle.

Right but is there a graph of where "too high" is? I imagine that also too low will not be optimized.

Summary: Pull Request resolved: #728 Feature parity with v0: allow prefetching weights while waiting for the pending requests to finish. ## Test Plan Introduced a benchmark that simulates the on-going requests with actual weight sync logic. Reference Group (V0) ``` ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: False -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.102 s 2.99 GB/s Avg update_weights 43.738 s 0.35 GB/s Avg total (push + update) 48.840 s ================================================================================ ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: True Fetcher procs: 8 -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.208 s 2.93 GB/s Avg update_weights 29.602 s 0.52 GB/s Avg total (push + update) 34.810 s ================================================================================ ``` Test Group (V1) ``` ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: False -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.070 s 3.01 GB/s Avg update_weights 39.974 s 0.38 GB/s Avg total (push + update) 45.044 s ================================================================================ ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: True Fetcher procs: 8 -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.055 s 3.02 GB/s Avg update_weights 28.730 s 0.53 GB/s Avg total (push + update) 33.784 s ================================================================================ ``` ## Next Steps [-] implement the prefetch logic & shared memory [-] Add metric similar to generator v0 [ ] Perf/Throughput testing compared to generator v0 Differential Revision: D91092833

Summary: Feature parity with v0: allow prefetching weights while waiting for the pending requests to finish. ## Test Plan Introduced a benchmark that simulates the on-going requests with actual weight sync logic. Reference Group (V0) ``` ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: False -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.102 s 2.99 GB/s Avg update_weights 43.738 s 0.35 GB/s Avg total (push + update) 48.840 s ================================================================================ ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: True Fetcher procs: 8 -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.208 s 2.93 GB/s Avg update_weights 29.602 s 0.52 GB/s Avg total (push + update) 34.810 s ================================================================================ ``` Test Group (V1) ``` ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: False -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.070 s 3.01 GB/s Avg update_weights 39.974 s 0.38 GB/s Avg total (push + update) 45.044 s ================================================================================ ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: True Fetcher procs: 8 -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.055 s 3.02 GB/s Avg update_weights 28.730 s 0.53 GB/s Avg total (push + update) 33.784 s ================================================================================ ``` ## Next Steps [-] implement the prefetch logic & shared memory [-] Add metric similar to generator v0 [ ] Perf/Throughput testing compared to generator v0 Differential Revision: D91092833

Summary: Pull Request resolved: #728 Feature parity with v0: allow prefetching weights while waiting for the pending requests to finish. ## Test Plan Introduced a benchmark that simulates the on-going requests with actual weight sync logic. Reference Group (V0) ``` ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: False -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.102 s 2.99 GB/s Avg update_weights 43.738 s 0.35 GB/s Avg total (push + update) 48.840 s ================================================================================ ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: True Fetcher procs: 8 -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.208 s 2.93 GB/s Avg update_weights 29.602 s 0.52 GB/s Avg total (push + update) 34.810 s ================================================================================ ``` Test Group (V1) ``` ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: False -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.070 s 3.01 GB/s Avg update_weights 39.974 s 0.38 GB/s Avg total (push + update) 45.044 s ================================================================================ ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: True Fetcher procs: 8 -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.055 s 3.02 GB/s Avg update_weights 28.730 s 0.53 GB/s Avg total (push + update) 33.784 s ================================================================================ ``` ## Next Steps [-] implement the prefetch logic & shared memory [-] Add metric similar to generator v0 [ ] Perf/Throughput testing compared to generator v0 Differential Revision: D91092833

Summary: Feature parity with v0: allow prefetching weights while waiting for the pending requests to finish. ## Test Plan Introduced a benchmark that simulates the on-going requests with actual weight sync logic. Reference Group (V0) ``` ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: False -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.102 s 2.99 GB/s Avg update_weights 43.738 s 0.35 GB/s Avg total (push + update) 48.840 s ================================================================================ ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: True Fetcher procs: 8 -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.208 s 2.93 GB/s Avg update_weights 29.602 s 0.52 GB/s Avg total (push + update) 34.810 s ================================================================================ ``` Test Group (V1) ``` ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: False -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.070 s 3.01 GB/s Avg update_weights 39.974 s 0.38 GB/s Avg total (push + update) 45.044 s ================================================================================ ================================================================================ WEIGHT SYNC BENCHMARK RESULTS ================================================================================ Model: Qwen/Qwen3-8B Model size: 15.26 GB Iterations: 3 Prefetch enabled: True Fetcher procs: 8 -------------------------------------------------------------------------------- Metric Time (s) Throughput (GB/s) -------------------------------------------------------------------------------- Avg push_weights 5.055 s 3.02 GB/s Avg update_weights 28.730 s 0.53 GB/s Avg total (push + update) 33.784 s ================================================================================ ``` ## Next Steps [-] implement the prefetch logic & shared memory [-] Add metric similar to generator v0 [ ] Perf/Throughput testing compared to generator v0 Differential Revision: D91092833

allenwang28

Review automatically exported from Phabricator review in Meta.

…-pytorch#728)

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 23, 2026

meta-codesync Bot added fb-exported meta-exported labels Jan 23, 2026

facebook-github-bot force-pushed the export-D91092833 branch from 0c99d56 to 969b8ab Compare January 23, 2026 18:26

joecummings reviewed Jan 23, 2026

View reviewed changes

JenniferWang force-pushed the export-D91092833 branch from 969b8ab to bfdb049 Compare January 26, 2026 22:06

facebook-github-bot force-pushed the export-D91092833 branch from bfdb049 to 3f869a3 Compare January 26, 2026 22:06

JenniferWang force-pushed the export-D91092833 branch from 3f869a3 to 3a769bb Compare January 26, 2026 22:09

facebook-github-bot force-pushed the export-D91092833 branch from 3a769bb to 327d32a Compare January 27, 2026 19:22

facebook-github-bot force-pushed the export-D91092833 branch 2 times, most recently from 6f9290d to f5fe7d6 Compare January 27, 2026 19:31

allenwang28 approved these changes Jan 27, 2026

View reviewed changes

joecummings approved these changes Jan 27, 2026

View reviewed changes

JenniferWang merged commit 2729bdc into main Jan 27, 2026
11 of 12 checks passed

JenniferWang linked an issue Jan 28, 2026 that may be closed by this pull request

[vLLM v0.13] Re-architect forge's integration with vLLM (generator.py) #669

Closed

2 tasks

HosseinKaviani-H pushed a commit to HosseinKaviani-H/forge that referenced this pull request Feb 9, 2026

prefetch weights while waiting for pending requests to complete (meta…

04551ef

…-pytorch#728)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prefetch weights while waiting for pending requests to complete#728

prefetch weights while waiting for pending requests to complete#728
JenniferWang merged 1 commit into
mainfrom
export-D91092833

JenniferWang commented Jan 23, 2026

Uh oh!

meta-codesync Bot commented Jan 23, 2026

Uh oh!

codecov-commenter commented Jan 23, 2026

Uh oh!

joecummings left a comment

Uh oh!

joecummings Jan 23, 2026

Uh oh!

JenniferWang Jan 26, 2026 •

edited

Loading

Uh oh!

joecummings Jan 23, 2026

Uh oh!

joecummings Jan 23, 2026

Uh oh!

JenniferWang Jan 26, 2026

Uh oh!

joecummings Jan 27, 2026

Uh oh!

allenwang28 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

JenniferWang commented Jan 23, 2026

Test Plan

Next Steps

Uh oh!

meta-codesync Bot commented Jan 23, 2026

Uh oh!

codecov-commenter commented Jan 23, 2026

Codecov Report

Uh oh!

joecummings left a comment

Choose a reason for hiding this comment

Uh oh!

joecummings Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

JenniferWang Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joecummings Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

joecummings Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

JenniferWang Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

joecummings Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

allenwang28 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JenniferWang Jan 26, 2026 •

edited

Loading