Commit 922adad
Per weight constant cache (#18901)
**Problem**: Multi-method AOTI models (e.g., Qwen3.5 MoE with separate
prefill/decode methods) load the full weight blob independently for each
method, even when they share identical weights. This causes duplicate
GPU allocations -- Qwen3.5 MoE peaked at ~35 GB, making it impossible to
run on a single 24 GB GPU (e.g., 4090).
**Solution**: Introduce a per-weight FQN-keyed constant cache in
`CudaBackend`. The first method loads its constants from the blob and
caches them. Subsequent methods with matching FQNs skip blob loading
entirely and reuse cached GPU tensors via
`update_user_managed_constant_buffer_pairs`. A legacy fallback path is
preserved for older AOTI models without constant management APIs.
**Results**
Peak GPU memory: 35.4 GB → 17.6 GB (-50%)
---------
Co-authored-by: gasoonjia <gasoonjia@fb.com>1 parent 62d8f21 commit 922adad
5 files changed
Lines changed: 428 additions & 163 deletions
File tree
- .ci/scripts
- backends
- aoti
- cuda/runtime
- examples/models/qwen3_5_moe
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
397 | 397 | | |
398 | 398 | | |
399 | 399 | | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
400 | 421 | | |
401 | 422 | | |
402 | 423 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
25 | 25 | | |
26 | 26 | | |
27 | 27 | | |
28 | | - | |
29 | 28 | | |
30 | 29 | | |
31 | 30 | | |
| |||
287 | 286 | | |
288 | 287 | | |
289 | 288 | | |
290 | | - | |
291 | | - | |
292 | | - | |
293 | | - | |
294 | | - | |
295 | | - | |
296 | | - | |
297 | | - | |
298 | | - | |
299 | | - | |
0 commit comments