Commit a93f9cd
committed
Per-weight constant cache for CUDA backend
Replace the old update_constants_from_blob + cross-method sharing with a
unified per-weight caching approach. The first method to initialize loads
its constants from the blob and caches them by FQN. Subsequent methods
with matching FQNs reuse cached GPU tensors via
update_user_managed_constant_buffer_pairs, skipping blob loading entirely.
This eliminates duplicate GPU weight allocations for multi-method models
(e.g., prefill/decode), reducing peak GPU memory from ~35 GB to ~17.6 GB
for Qwen 3.5 MoE.
Also adds GPU peak memory reporting to the Qwen3.5 MoE runner and a
CI check (< 20 GB) in test_model_e2e.sh.22 files changed
Lines changed: 2836 additions & 108 deletions
File tree
- backends
- aoti
- arm
- _passes
- test/passes
- cuda
- benchmarks
- runtime
- tests
- triton/kernels
- xnnpack/runtime
- docs/source
- examples
- models/qwen3_5_moe
- raspberry_pi/pico2
- runtime/executor
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
104 | 104 | | |
105 | 105 | | |
106 | 106 | | |
107 | | - | |
108 | | - | |
109 | | - | |
110 | | - | |
111 | | - | |
112 | | - | |
113 | | - | |
114 | | - | |
115 | | - | |
116 | | - | |
117 | | - | |
118 | | - | |
119 | | - | |
120 | | - | |
121 | | - | |
122 | | - | |
123 | | - | |
124 | 107 | | |
125 | 108 | | |
126 | 109 | | |
| |||
160 | 143 | | |
161 | 144 | | |
162 | 145 | | |
163 | | - | |
164 | | - | |
165 | | - | |
166 | 146 | | |
167 | 147 | | |
168 | 148 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
53 | 53 | | |
54 | 54 | | |
55 | 55 | | |
| 56 | + | |
56 | 57 | | |
57 | 58 | | |
58 | 59 | | |
| |||
70 | 71 | | |
71 | 72 | | |
72 | 73 | | |
| 74 | + | |
73 | 75 | | |
74 | 76 | | |
75 | 77 | | |
76 | 78 | | |
77 | 79 | | |
78 | 80 | | |
79 | 81 | | |
| 82 | + | |
80 | 83 | | |
81 | 84 | | |
82 | 85 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
61 | 61 | | |
62 | 62 | | |
63 | 63 | | |
| 64 | + | |
64 | 65 | | |
65 | 66 | | |
66 | 67 | | |
| |||
71 | 72 | | |
72 | 73 | | |
73 | 74 | | |
| 75 | + | |
74 | 76 | | |
75 | 77 | | |
76 | 78 | | |
77 | 79 | | |
78 | 80 | | |
79 | 81 | | |
80 | 82 | | |
| 83 | + | |
81 | 84 | | |
82 | 85 | | |
83 | 86 | | |
| |||
360 | 363 | | |
361 | 364 | | |
362 | 365 | | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
363 | 369 | | |
364 | 370 | | |
365 | 371 | | |
| |||
578 | 584 | | |
579 | 585 | | |
580 | 586 | | |
| 587 | + | |
| 588 | + | |
| 589 | + | |
581 | 590 | | |
582 | 591 | | |
583 | 592 | | |
| |||
0 commit comments