llama-mmap: hint THP on mmap'd weights (Linux)#19
llama-mmap: hint THP on mmap'd weights (Linux)#19Marxist-Leninist wants to merge 1 commit intoPrismML-Eng:prismfrom
Conversation
|
Heads-up: the failing The PrismML-Eng labeler workflow ( server/webui:
- changed-files:
- all:
- any-glob-to-any-file:
- tools/server/webui/**
server/webui:
- changed-files:
- any-glob-to-any-file:
- tools/server/webui/**Happy to open either fix as a separate PR if you'd like. |
d74dd9b to
a4ce593
Compare
Issue madvise(MADV_HUGEPAGE) on the read-only file mapping used for model weights on Linux. For a 1 GB model this drops the potential page count from ~262K 4KB pages to ~512 2MB pages, reducing TLB pressure and (more importantly) reducing the number of re-faults when pages get evicted under memory pressure. No-op on kernels where THP is disabled. On 'madvise' mode (the common modern default for desktop distros), this is opt-in and requires the caller to ask. Guarded by defined(MADV_HUGEPAGE) so it compiles cleanly on non-Linux. Benchmark on a Skylake-SP VM, Bonsai-8B Q1_0, -fa on -ctk q8_0 -ctv q8_0 -t 12 -ub 128: neutral on this machine (~9.5 t/s tg128 both before and after) because the VM isn't memory-constrained. The change is intended for systems where the mapping does get evicted and re-faulted under pressure.
a4ce593 to
036a707
Compare
|
Switched the PR to point to prism branch, just cleaned that branch and applied the pending cuda and x86 PRs. |
There was a problem hiding this comment.
Pull request overview
Adds a Linux-specific madvise(MADV_HUGEPAGE) hint for the read-only mmap used to map model weights, aiming to encourage THP backing (2MB pages) and reduce TLB pressure / re-fault overhead under memory pressure.
Changes:
- On Linux, call
madvise(..., MADV_HUGEPAGE)on the weights mapping (skipped whennumais enabled). - Emit a debug log if the hint cannot be applied.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #ifdef __linux__ | ||
| // Hint the kernel to back this region with 2MB huge pages where possible. | ||
| // For a 1 GB model weights map this can drop the number of pages from ~262K | ||
| // 4KB pages to ~512 2MB pages, reducing TLB pressure and (critically) | ||
| // reducing the number of re-faults when pages get evicted under memory | ||
| // pressure. No-op if THP is not enabled / supported. | ||
| if (!numa) { | ||
| if (madvise(addr, file->size(), MADV_HUGEPAGE)) { | ||
| LLAMA_LOG_DEBUG("note: madvise(.., MADV_HUGEPAGE) not applied: %s\n", | ||
| strerror(errno)); | ||
| } | ||
| } | ||
| #endif |
| #ifdef __linux__ | ||
| // Hint the kernel to back this region with 2MB huge pages where possible. | ||
| // For a 1 GB model weights map this can drop the number of pages from ~262K | ||
| // 4KB pages to ~512 2MB pages, reducing TLB pressure and (critically) | ||
| // reducing the number of re-faults when pages get evicted under memory | ||
| // pressure. No-op if THP is not enabled / supported. | ||
| if (!numa) { | ||
| if (madvise(addr, file->size(), MADV_HUGEPAGE)) { | ||
| LLAMA_LOG_DEBUG("note: madvise(.., MADV_HUGEPAGE) not applied: %s\n", | ||
| strerror(errno)); | ||
| } | ||
| } | ||
| #endif | ||
|
|
There was a problem hiding this comment.
What is this part doing?
also the x86 related code is now in prism branch, so the only changes seems to be this?
Not too familiar with this
is there a speed difference if we do this?
|
I think will close this, after changing the new branch, is this just doing a LLAMA_LOG_DEBUG? |
Issue
madvise(MADV_HUGEPAGE)on the read-only file mapping used for model weights on Linux. For a 1 GB model this drops the potential page count from ~262K 4KB pages to ~512 2MB pages, reducing TLB pressure and (more importantly) reducing the number of re-faults when pages get evicted under memory pressure.Linux-only, guarded by
defined(MADV_HUGEPAGE)and__linux__. Skipped whennumais set. No-op where THP is disabled.Bench on a Skylake-SP VM, Bonsai-8B Q1_0,
-fa on -ctk q8_0 -ctv q8_0 -t 12 -ub 128: neutral (~9.5 t/s tg128 both before and after) because the VM isn't memory-constrained. The change is intended for systems where the mapping does get evicted under pressure (constrained laptops, containers).Tried the same hint on
ggml_aligned_mallocfor KV/activation buffers as well — that showed a ~5% regression with no visible AnonHugePages, dropped that half.