Skip to content

DYNAMIC_UNBOUND support for portable runtime: lazy KV cache allocation#18350

Draft
psiddh wants to merge 2 commits into
mainfrom
dynamic_unbound_kv_cache
Draft

DYNAMIC_UNBOUND support for portable runtime: lazy KV cache allocation#18350
psiddh wants to merge 2 commits into
mainfrom
dynamic_unbound_kv_cache

Conversation

@psiddh
Copy link
Copy Markdown
Contributor

@psiddh psiddh commented Mar 19, 2026

Enable DYNAMIC_UNBOUND tensors in the portable runtime, allowing KV cache buffers to be dynamically managed rather than statically memory-planned. This is the architectural foundation for pay-as-you-go memory allocation in ExecuTorch LLM inference.

Core changes:

  • DynamicAllocator interface with allocate/reallocate/free
  • PalDynamicAllocator default impl (PAL-backed, 2x growth policy)
  • TrackingDynamicAllocator for memory stats observability
  • MemoryManager gains 4th slot for DynamicAllocator (backward compatible)
  • TensorImpl gains dynamic_allocator_ and capacity_bytes_ fields
  • TensorImpl::internal_resize_contiguous handles DYNAMIC_UNBOUND resize
  • tensor_parser_portable.cpp: remove DYNAMIC_UNBOUND rejection, wire up allocator at load time for tensors with no memory-planned data
  • method.cpp: FreeCall frees dynamic memory; destructor cleans up all
  • Module API auto-creates PalDynamicAllocator (DYNAMIC_UNBOUND just works)

Export changes:

  • MarkDynamicUnboundPass marks KV cache buffers as DYNAMIC_UNBOUND
  • --lazy_kv_cache flag for Llama export

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Mar 19, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18350

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 48 New Failures, 1 Unrelated Failure, 3 Unclassified Failures

As of commit 0e196c3 with merge base 4741f3a (image):

NEW FAILURES - The following jobs have failed:

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 19, 2026
@psiddh psiddh changed the title DYNAMIC_UNBOUND support for portable runtime: lazy KV cache allocation <EXPERIMENTAL - DO NOT REVIEW> DYNAMIC_UNBOUND support for portable runtime: lazy KV cache allocation Mar 19, 2026
@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

default=False,
help="Mark KV cache buffers as DYNAMIC_UNBOUND so they are allocated "
"lazily at runtime instead of at load time. Reduces initial memory "
"usage when max_context_length is large.",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this because we do actually touch the full memory during attention?

Copy link
Copy Markdown
Contributor Author

@psiddh psiddh May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed out on this comment...

Yes, the full max_context_length buffer is allocated on first inference, not at load time. This defers the KV cache allocation from Module.load() to the first generate() call.

Sharing some Test Results:
Concrete KV cache costs for Qwen3-0.6B (28 layers, 8 KV heads, 128 head_dim,fp16):

▎ | max_context_length | KV Cache | Without PR | With PR (at load) |
▎ |-------------------- |----------|---------------|-------------------|
▎ | 128 (default) | 14 MB | Pre-allocated | 0 MB |
▎ | 1024 | 115 MB | Pre-allocated | 0 MB |
▎ | 2048 (standard) | 229 MB | Pre-allocated | 0 MB |
▎ | 4096 | 459 MB | Pre-allocated | 0 MB |
▎ | 16384 | 1.8 GB | OOM at load | 0 MB |

Note : KV cache sizes above are for fp16. fp32 doubles these values

With this PR I increased max_context_length to 4096 on Samsung S23 (8GB RAM) and tested 10+ multi-turn conversations with stable RSS:

  • Load RSS: ~100-120 MiB (no KV cache)
  • First inference RSS: ~1730 MiB (KV cache allocated on demand)
  • Subsequent turns: stable, no memory growth

Key benefits:

  1. Lower RSS at startup → survives Android LMKD longer
  2. DynamicAllocator::free() enables freeing cache on memory pressure
    (onTrimMemory) // Future enhacements
  3. Unlocks larger context lengths (4K-16K) that would have OOM'd at load time without
    this feature / lazy allocation

@psiddh psiddh force-pushed the dynamic_unbound_kv_cache branch 2 times, most recently from e823bc9 to b311810 Compare March 27, 2026 13:53
@psiddh psiddh force-pushed the dynamic_unbound_kv_cache branch from b311810 to 0779286 Compare May 26, 2026 17:27
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented May 26, 2026

CLA Not Signed

@psiddh psiddh changed the title <EXPERIMENTAL - DO NOT REVIEW> DYNAMIC_UNBOUND support for portable runtime: lazy KV cache allocation DYNAMIC_UNBOUND support for portable runtime: lazy KV cache allocation May 26, 2026
@psiddh
Copy link
Copy Markdown
Contributor Author

psiddh commented May 26, 2026

Moving the diff from Experimetal to Needs Review ( thoroughly tested) and requesting a formal review

@psiddh psiddh force-pushed the dynamic_unbound_kv_cache branch 2 times, most recently from e56b926 to e8a4154 Compare May 26, 2026 17:54
Adds DYNAMIC_UNBOUND tensor support to ExecuTorch, enabling lazy KV cache
allocation that defers memory to first inference instead of model load time.

Export (Python):
- MarkDynamicUnboundPass tags KV cache buffers as DYNAMIC_UNBOUND
- SpecPropPass reads the flag and sets shape_dynamism accordingly
- Memory planner skips DYNAMIC_UNBOUND tensors
- emit_mutable_buffer_names auto-enabled when MarkDynamicUnboundPass detected
- Export flag: --lazy_kv_cache

Runtime (C++):
- DynamicAllocator interface with PalDynamicAllocator (malloc-based) and
  TrackingDynamicAllocator (with stats) implementations
- TensorImpl gains dynamic_allocator_ and capacity_bytes_ fields behind
  ET_DYNAMIC_ALLOCATOR_ENABLED compile guard
- DYNAMIC_UNBOUND case in internal_resize_contiguous uses DynamicAllocator
  with 2x growth policy for amortized resizing
- tensor_parser_portable.cpp: DYNAMIC_UNBOUND tensors start with
  capacity_bytes=0 and nullptr data (lazy allocation)
- op_update_cache.cpp: maybe_resize_cache checks for null data pointer,
  triggers DynamicAllocator on first use
- op_sdpa.cpp: same null-data guard before update_cache calls
- method.cpp: FreeCall properly frees DYNAMIC_UNBOUND tensor memory
- MemoryManager accepts optional DynamicAllocator*
- Module::load_method creates PalDynamicAllocator when enabled
- util.h: get_rss_bytes reads /proc/self/statm for current RSS

Build:
- CMake option EXECUTORCH_ENABLE_DYNAMIC_ALLOCATOR adds -DET_DYNAMIC_ALLOCATOR_ENABLED
- All DYNAMIC_UNBOUND code guarded by #ifdef ET_DYNAMIC_ALLOCATOR_ENABLED

Tested on Samsung S23 with Qwen3 0.6B (fp16) and Qwen2.5-Math 1.5B (8da4w):
- Load RSS: ~100 MiB (vs ~2147 MiB without) — KV cache not pre-allocated
- First inference: +1.6 GB (KV cache allocated on demand)
- 10+ multi-turn conversations stable, no crashes
- Generation speed unchanged (10-37 tok/s)

Co-authored-by: Claude <noreply@anthropic.com>

# Conflicts:
#	CMakeLists.txt
#	examples/models/llama/export_llama_lib.py
#	extension/module/module.cpp
#	extension/module/module.h
#	runtime/core/portable_type/tensor_impl.h
#	runtime/executor/memory_manager.h
#	runtime/executor/method.cpp
@psiddh psiddh force-pushed the dynamic_unbound_kv_cache branch from e8a4154 to 21200ba Compare May 26, 2026 17:57
@psiddh
Copy link
Copy Markdown
Contributor Author

psiddh commented May 26, 2026

See this new Android App & Desktop App (in progress) to validate this PR thoroughly : meta-pytorch/executorch-examples#240

cc @mergennachin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants