Skip to content

llama-mmap: hint THP on mmap'd weights (Linux)#19

Closed
Marxist-Leninist wants to merge 1 commit intoPrismML-Eng:prismfrom
Marxist-Leninist:feat/madv-hugepage
Closed

llama-mmap: hint THP on mmap'd weights (Linux)#19
Marxist-Leninist wants to merge 1 commit intoPrismML-Eng:prismfrom
Marxist-Leninist:feat/madv-hugepage

Conversation

@Marxist-Leninist
Copy link
Copy Markdown

@Marxist-Leninist Marxist-Leninist commented Apr 8, 2026

Issue madvise(MADV_HUGEPAGE) on the read-only file mapping used for model weights on Linux. For a 1 GB model this drops the potential page count from ~262K 4KB pages to ~512 2MB pages, reducing TLB pressure and (more importantly) reducing the number of re-faults when pages get evicted under memory pressure.

Linux-only, guarded by defined(MADV_HUGEPAGE) and __linux__. Skipped when numa is set. No-op where THP is disabled.

Bench on a Skylake-SP VM, Bonsai-8B Q1_0, -fa on -ctk q8_0 -ctv q8_0 -t 12 -ub 128: neutral (~9.5 t/s tg128 both before and after) because the VM isn't memory-constrained. The change is intended for systems where the mapping does get evicted under pressure (constrained laptops, containers).

Tried the same hint on ggml_aligned_malloc for KV/activation buffers as well — that showed a ~5% regression with no visible AnonHugePages, dropped that half.

@Marxist-Leninist
Copy link
Copy Markdown
Author

Heads-up: the failing labeler check here is a pre-existing issue in the fork's workflow config, not related to this patch.

The PrismML-Eng labeler workflow (.github/workflows/labeler.yml) hard-codes repository: "ggml-org/llama.cpp" on its checkout step and reads the labeler config from there. Upstream ggml-org/llama.cpp/.github/labeler.yml still uses the v5 composition syntax:

server/webui:
    - changed-files:
        - all:
            - any-glob-to-any-file:
                - tools/server/webui/**

actions/labeler@v6 removed the all: / any: composition keys, so it errors with Unknown config options were under "changed-files": all on every PR — not just this one. The fix is either:

  1. Bump actions/labeler back to v5 in the workflow, or
  2. Flatten the server/webui entry in upstream's labeler.yml (drop the all: wrapper since it only contains one child):
server/webui:
    - changed-files:
        - any-glob-to-any-file:
            - tools/server/webui/**

Happy to open either fix as a separate PR if you'd like.

Issue madvise(MADV_HUGEPAGE) on the read-only file mapping used for
model weights on Linux. For a 1 GB model this drops the potential
page count from ~262K 4KB pages to ~512 2MB pages, reducing TLB
pressure and (more importantly) reducing the number of re-faults
when pages get evicted under memory pressure.

No-op on kernels where THP is disabled. On 'madvise' mode (the
common modern default for desktop distros), this is opt-in and
requires the caller to ask. Guarded by defined(MADV_HUGEPAGE) so it
compiles cleanly on non-Linux.

Benchmark on a Skylake-SP VM, Bonsai-8B Q1_0, -fa on -ctk q8_0
-ctv q8_0 -t 12 -ub 128: neutral on this machine (~9.5 t/s tg128
both before and after) because the VM isn't memory-constrained.
The change is intended for systems where the mapping does get
evicted and re-faulted under pressure.
@khosravipasha khosravipasha changed the base branch from master to prism April 13, 2026 23:51
@khosravipasha
Copy link
Copy Markdown
Collaborator

Switched the PR to point to prism branch, just cleaned that branch and applied the pending cuda and x86 PRs.
Keeping master branch here exactly same as llama.cpp.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Linux-specific madvise(MADV_HUGEPAGE) hint for the read-only mmap used to map model weights, aiming to encourage THP backing (2MB pages) and reduce TLB pressure / re-fault overhead under memory pressure.

Changes:

  • On Linux, call madvise(..., MADV_HUGEPAGE) on the weights mapping (skipped when numa is enabled).
  • Emit a debug log if the hint cannot be applied.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/llama-mmap.cpp
Comment on lines +454 to +466
#ifdef __linux__
// Hint the kernel to back this region with 2MB huge pages where possible.
// For a 1 GB model weights map this can drop the number of pages from ~262K
// 4KB pages to ~512 2MB pages, reducing TLB pressure and (critically)
// reducing the number of re-faults when pages get evicted under memory
// pressure. No-op if THP is not enabled / supported.
if (!numa) {
if (madvise(addr, file->size(), MADV_HUGEPAGE)) {
LLAMA_LOG_DEBUG("note: madvise(.., MADV_HUGEPAGE) not applied: %s\n",
strerror(errno));
}
}
#endif
Comment thread src/llama-mmap.cpp
Comment on lines +454 to +467
#ifdef __linux__
// Hint the kernel to back this region with 2MB huge pages where possible.
// For a 1 GB model weights map this can drop the number of pages from ~262K
// 4KB pages to ~512 2MB pages, reducing TLB pressure and (critically)
// reducing the number of re-faults when pages get evicted under memory
// pressure. No-op if THP is not enabled / supported.
if (!numa) {
if (madvise(addr, file->size(), MADV_HUGEPAGE)) {
LLAMA_LOG_DEBUG("note: madvise(.., MADV_HUGEPAGE) not applied: %s\n",
strerror(errno));
}
}
#endif

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this part doing?
also the x86 related code is now in prism branch, so the only changes seems to be this?
Not too familiar with this

is there a speed difference if we do this?

@khosravipasha
Copy link
Copy Markdown
Collaborator

I think will close this, after changing the new branch, is this just doing a LLAMA_LOG_DEBUG?
I guess the cpu changes were already addressed in other PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants