DYNAMIC_UNBOUND support for portable runtime: lazy KV cache allocation#18350
DYNAMIC_UNBOUND support for portable runtime: lazy KV cache allocation#18350psiddh wants to merge 2 commits into
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18350
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 48 New Failures, 1 Unrelated Failure, 3 Unclassified FailuresAs of commit 0e196c3 with merge base 4741f3a ( NEW FAILURES - The following jobs have failed:
UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
| default=False, | ||
| help="Mark KV cache buffers as DYNAMIC_UNBOUND so they are allocated " | ||
| "lazily at runtime instead of at load time. Reduces initial memory " | ||
| "usage when max_context_length is large.", |
There was a problem hiding this comment.
is this because we do actually touch the full memory during attention?
There was a problem hiding this comment.
Missed out on this comment...
Yes, the full max_context_length buffer is allocated on first inference, not at load time. This defers the KV cache allocation from Module.load() to the first generate() call.
Sharing some Test Results:
Concrete KV cache costs for Qwen3-0.6B (28 layers, 8 KV heads, 128 head_dim,fp16):
▎ | max_context_length | KV Cache | Without PR | With PR (at load) |
▎ |-------------------- |----------|---------------|-------------------|
▎ | 128 (default) | 14 MB | Pre-allocated | 0 MB |
▎ | 1024 | 115 MB | Pre-allocated | 0 MB |
▎ | 2048 (standard) | 229 MB | Pre-allocated | 0 MB |
▎ | 4096 | 459 MB | Pre-allocated | 0 MB |
▎ | 16384 | 1.8 GB | OOM at load | 0 MB |
Note : KV cache sizes above are for fp16. fp32 doubles these values
With this PR I increased max_context_length to 4096 on Samsung S23 (8GB RAM) and tested 10+ multi-turn conversations with stable RSS:
- Load RSS: ~100-120 MiB (no KV cache)
- First inference RSS: ~1730 MiB (KV cache allocated on demand)
- Subsequent turns: stable, no memory growth
Key benefits:
- Lower RSS at startup → survives Android LMKD longer
- DynamicAllocator::free() enables freeing cache on memory pressure
(onTrimMemory) // Future enhacements - Unlocks larger context lengths (4K-16K) that would have OOM'd at load time without
this feature / lazy allocation
e823bc9 to
b311810
Compare
b311810 to
0779286
Compare
|
|
Moving the diff from Experimetal to Needs Review ( thoroughly tested) and requesting a formal review |
e56b926 to
e8a4154
Compare
Adds DYNAMIC_UNBOUND tensor support to ExecuTorch, enabling lazy KV cache allocation that defers memory to first inference instead of model load time. Export (Python): - MarkDynamicUnboundPass tags KV cache buffers as DYNAMIC_UNBOUND - SpecPropPass reads the flag and sets shape_dynamism accordingly - Memory planner skips DYNAMIC_UNBOUND tensors - emit_mutable_buffer_names auto-enabled when MarkDynamicUnboundPass detected - Export flag: --lazy_kv_cache Runtime (C++): - DynamicAllocator interface with PalDynamicAllocator (malloc-based) and TrackingDynamicAllocator (with stats) implementations - TensorImpl gains dynamic_allocator_ and capacity_bytes_ fields behind ET_DYNAMIC_ALLOCATOR_ENABLED compile guard - DYNAMIC_UNBOUND case in internal_resize_contiguous uses DynamicAllocator with 2x growth policy for amortized resizing - tensor_parser_portable.cpp: DYNAMIC_UNBOUND tensors start with capacity_bytes=0 and nullptr data (lazy allocation) - op_update_cache.cpp: maybe_resize_cache checks for null data pointer, triggers DynamicAllocator on first use - op_sdpa.cpp: same null-data guard before update_cache calls - method.cpp: FreeCall properly frees DYNAMIC_UNBOUND tensor memory - MemoryManager accepts optional DynamicAllocator* - Module::load_method creates PalDynamicAllocator when enabled - util.h: get_rss_bytes reads /proc/self/statm for current RSS Build: - CMake option EXECUTORCH_ENABLE_DYNAMIC_ALLOCATOR adds -DET_DYNAMIC_ALLOCATOR_ENABLED - All DYNAMIC_UNBOUND code guarded by #ifdef ET_DYNAMIC_ALLOCATOR_ENABLED Tested on Samsung S23 with Qwen3 0.6B (fp16) and Qwen2.5-Math 1.5B (8da4w): - Load RSS: ~100 MiB (vs ~2147 MiB without) — KV cache not pre-allocated - First inference: +1.6 GB (KV cache allocated on demand) - 10+ multi-turn conversations stable, no crashes - Generation speed unchanged (10-37 tok/s) Co-authored-by: Claude <noreply@anthropic.com> # Conflicts: # CMakeLists.txt # examples/models/llama/export_llama_lib.py # extension/module/module.cpp # extension/module/module.h # runtime/core/portable_type/tensor_impl.h # runtime/executor/memory_manager.h # runtime/executor/method.cpp
e8a4154 to
21200ba
Compare
|
See this new Android App & Desktop App (in progress) to validate this PR thoroughly : meta-pytorch/executorch-examples#240 |
Enable DYNAMIC_UNBOUND tensors in the portable runtime, allowing KV cache buffers to be dynamically managed rather than statically memory-planned. This is the architectural foundation for pay-as-you-go memory allocation in ExecuTorch LLM inference.
Core changes:
Export changes: