Commit ba83675
Fix OOM in attention sparsity calibration by reducing peak GPU memory
Three sources of unnecessary memory allocation during calibration:
1. flash_skip_softmax.py: In calc_correction_factor_and_p, `p` (full
attention-matrix-sized float tensor) and `p_larger_than_thresh` (same
size boolean tensor) were both alive simultaneously alongside
blocked_attn. Fuse the subtraction and comparison into a single
expression to avoid materializing `p`, and explicitly del
block_max, block_max_larger, block_max_cummax, and
p_larger_than_thresh as soon as each is no longer needed. Applies
to both prefill and decode paths.
2. calibrate.py (chunked prefill): del outputs after extracting
past_key_values in each chunk to free logits between chunks.
3. calibrate.py (decode loop): del outputs after the prefill step to
free the large [B, seqlen, vocab] logits tensor before the decode
loop, and del outputs inside each decode step.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Rohan Joshi <rohjoshi@nvidia.com>1 parent 2f27cfa commit ba83675
2 files changed
Lines changed: 16 additions & 15 deletions
File tree
- modelopt/torch/sparsity/attention_sparsity
- calibration
- methods
Lines changed: 4 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
133 | 133 | | |
134 | 134 | | |
135 | 135 | | |
| 136 | + | |
136 | 137 | | |
137 | 138 | | |
138 | 139 | | |
| |||
182 | 183 | | |
183 | 184 | | |
184 | 185 | | |
| 186 | + | |
| 187 | + | |
185 | 188 | | |
186 | 189 | | |
187 | 190 | | |
188 | 191 | | |
189 | 192 | | |
190 | 193 | | |
191 | | - | |
192 | | - | |
193 | | - | |
194 | 194 | | |
195 | 195 | | |
196 | 196 | | |
| |||
199 | 199 | | |
200 | 200 | | |
201 | 201 | | |
| 202 | + | |
202 | 203 | | |
203 | 204 | | |
204 | 205 | | |
| |||
Lines changed: 12 additions & 12 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
173 | 173 | | |
174 | 174 | | |
175 | 175 | | |
| 176 | + | |
176 | 177 | | |
177 | | - | |
178 | | - | |
179 | | - | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
180 | 183 | | |
181 | | - | |
182 | | - | |
183 | | - | |
184 | 184 | | |
185 | 185 | | |
186 | 186 | | |
| 187 | + | |
187 | 188 | | |
188 | 189 | | |
189 | 190 | | |
| |||
227 | 228 | | |
228 | 229 | | |
229 | 230 | | |
| 231 | + | |
230 | 232 | | |
231 | | - | |
232 | | - | |
233 | | - | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
234 | 236 | | |
235 | | - | |
236 | | - | |
237 | | - | |
238 | 237 | | |
| 238 | + | |
239 | 239 | | |
240 | 240 | | |
241 | 241 | | |
| |||
0 commit comments