You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[ET Device Support] Device-aware memory planning: separate buffers per device type
Pull Request resolved: #18375
Extends memory planning to separate device tensors from CPU tensors into distinct
memory buffers. Non-CPU TensorSpecs (e.g., CUDA) are pre-assigned device-specific
mem_ids before the greedy/naive algorithm runs, ensuring they get planned into
independent memory buffers that never share space with CPU tensors.
ghstack-source-id: 357060891
@exported-using-ghexport
Differential Revision: [D97447105](https://our.internmc.facebook.com/intern/diff/D97447105/)
Copy file name to clipboardExpand all lines: docs/source/compiler-memory-planning.md
+38Lines changed: 38 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -82,8 +82,46 @@ program = edge_program.to_executorch(
82
82
)
83
83
```
84
84
85
+
> **Note:** Custom pool passes that pre-assign `mem_id` are not yet compatible
86
+
> with `enable_non_cpu_memory_planning=True`. When per-device planning is
87
+
> enabled, device buffers are appended after the CPU buffers in the global
88
+
> `bufsizes` array. If a custom pass has already set `mem_id` values (e.g.
89
+
> `mem_id=2` or `mem_id=3`), those slots may collide with the device-buffer
90
+
> slots, leading to incorrect memory layout. If both features are enabled
91
+
> simultaneously, `apply_algo` will raise a `NotImplementedError`.
92
+
85
93
Users attempting to write a custom memory planning algorithm should start by looking at [the greedy algorithm's implementation](https://github.com/pytorch/executorch/blob/d62c41ca86435e5316e7ed292b6d68aff27a2fb7/exir/memory_planning.py#L459C1-L459C12).
86
94
95
+
## Device-Aware Memory Planning
96
+
97
+
When `enable_non_cpu_memory_planning=True` is set on `ExecutorchBackendConfig`,
98
+
the memory planning pass partitions tensor specs by their device type and runs
99
+
the planning algorithm independently for each device. This produces separate
100
+
memory buffers for each device (e.g. CPU vs. CUDA), ensuring that device memory
101
+
and host memory are never mixed.
102
+
103
+
```python
104
+
program = edge_program.to_executorch(
105
+
exir.ExecutorchBackendConfig(
106
+
enable_non_cpu_memory_planning=True,
107
+
)
108
+
)
109
+
```
110
+
111
+
The resulting `bufsizes` array layout depends on which devices are present:
112
+
113
+
| Scenario | bufsizes | Description |
114
+
|---|---|---|
115
+
| CPU only |`[0, cpu_size]`| Same as legacy behavior |
116
+
| CUDA only |`[0, cuda_size]`| Buffer 1 is CUDA, no wasted CPU slot |
117
+
| CPU + CUDA |`[0, cpu_size, cuda_size]`| Buffer 1 is CPU, buffer 2 is CUDA |
118
+
119
+
**Current limitations:**
120
+
- Not compatible with custom pool passes that pre-assign `spec.mem_id` (see note above).
121
+
- Submodule buffer sizes (from control-flow submodules like `cond`/`while`/`map`)
122
+
are applied only to the CPU partition. This is safe today because on-device
123
+
tensors only appear as delegate blob I/O, never inside control-flow submodules.
124
+
87
125
## Debugging Tool
88
126
89
127
Please refer to [Memory Planning Inspection](memory-planning-inspection.md) for a tool to inspect the result of memory planning.
0 commit comments