Skip to content

Update llama-mmap to work with 32-bit emscripten#22497

Merged
reeselevine merged 3 commits intoggml-org:masterfrom
reeselevine:file-loader-changes
Apr 30, 2026
Merged

Update llama-mmap to work with 32-bit emscripten#22497
reeselevine merged 3 commits intoggml-org:masterfrom
reeselevine:file-loader-changes

Conversation

@reeselevine
Copy link
Copy Markdown
Contributor

@reeselevine reeselevine commented Apr 29, 2026

Overview

When compiling to 32-bit WebAssembly through Emscripten, std::fseek and std::ftell return a long, which is interpreted as a 32-bit signed value. Unfortunately, this means that any files above 2GB overflow the maximum positive integer, leading to bad results. This fixes that by delegating to fseeko and ftello in Emscripten builds, which return a 64-bit off_t that can be interpreted correctly in both 32-bit and 64-bit WASM builds.

Note that ggml does something similar in all cases: https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/gguf.cpp#L25. However I didn't make that full change here because I'm not sure if it would lead to issues in other places.

Additional Information

For a little more context, this, in combination with the origin private file system (OPFS), allows models > 2GB to be loaded by the WebGPU backend in the browser without splitting the models into shards.

With 64-bit WebAssembly, the models can be loaded directly without splitting and without this change (theoretically, I haven't tried it), but since some browsers (Safari) don't support 64-bit memory yet, and it's not clear when they will, I think having this as an option is useful.

Requirements

@reeselevine reeselevine requested a review from ggerganov as a code owner April 29, 2026 04:03
@reeselevine reeselevine requested review from CISC and ngxson April 29, 2026 04:03
@yomaytk
Copy link
Copy Markdown
Contributor

yomaytk commented Apr 29, 2026

FYI @reeselevine : I confirmed that we can run a model over 2GiB in size in my environment using wasm64. The following screenshot shows unsloth/Qwen3.5-35B-A3B-Q3_K_S-GGUF (15.3GiB) running in Chrome (sorry, the log is a bit messy since this is just a toy program).

スクリーンショット 2026-04-29 14 33 59

Also (you may already know this), while wasm64 enables us to use more than 4GiB of memory, the Wasm js-api spec sets the memory size limit to 16GiB, and existing tools such as emscripten and V8 follow this. So we cannot run a model larger than 16GiB even with wasm64. Some people have actually requested support for more than 16GiB (e.g., WebAssembly/spec#1892), but it seems they haven't reached a consensus yet. Considering this limitation, I think supporting OPFS is nice for the WebGPU backend.

@CISC
Copy link
Copy Markdown
Member

CISC commented Apr 29, 2026

Hmmm, shouldn't this whole file be updated to do like this?

#ifdef _WIN32
# define gguf_ftell _ftelli64
# define gguf_fseek _fseeki64
#else
# define gguf_ftell ftello
# define gguf_fseek fseeko
#endif

@reeselevine
Copy link
Copy Markdown
Contributor Author

@yomaytk cool! I'm curious, does ggml-webgpu's get_memory return 16GB or larger on your machine, or did you have to change it to get that model to work?

@CISC sure, I just wasn't sure if that would cause other issues, but I'll make that change

@yomaytk
Copy link
Copy Markdown
Contributor

yomaytk commented Apr 30, 2026

@yomaytk cool! I'm curious, does ggml-webgpu's get_memory return 16GB or larger on your machine, or did you have to change it to get that model to work?

When running the above Qwen model, ggml_backend_webgpu_device_get_memory returns 4294967292 (≈4GiB) and the function is called four times, so it seems that VRAM is allocated up to 16GiB in total. I didn't modify the core libraries or the WebGPU backend at all — I just modified some frontend code for the toy program.

Correction: the number of times ggml_backend_webgpu_device_get_memory is called seems to be unrelated to the model size.

@reeselevine
Copy link
Copy Markdown
Contributor Author

@yomaytk ok, this shows that the WebGPU implementation of get_memory really isn't correct right now 😅. We're not actually subtracting the allocated memory from free, so we always report the max buffer size as being free. On the other hand, your example clearly shows that the max buffer size is not the limit on GPUs with more memory. We will have to think more about the best way to handle this in the future (unrelated to this PR though).

@reeselevine
Copy link
Copy Markdown
Contributor Author

@CISC @ggerganov if you're happy with the full change I think this can be merged soon.

@reeselevine reeselevine merged commit 5cbfb18 into ggml-org:master Apr 30, 2026
46 checks passed
tekintian added a commit to tekintian/llama.cpp that referenced this pull request May 1, 2026
* 'master' of github.com:tekintian/llama.cpp: (659 commits)
  ggml-webgpu: Improve performance of mat-vec and mat-mat for MUL_MAT_ID (ggml-org#22464)
  Update llama-mmap to use ftello/fseeko (ggml-org#22497)
  common : check for null getpwuid in hf-cache (ggml-org#22550)
  vulkan: add get/set tensor 2d functions (ggml-org#22514)
  spec: fix argument typo (ggml-org#22552)
  ci : bump ty to 0.0.33 (ggml-org#22535)
  vendor : update cpp-httplib to 0.43.2 (ggml-org#22548)
  CUDA: fix tile FA kernel on Pascal (ggml-org#22541)
  scripts : add wc2wt.sh - create worktree from current HEAD (ggml-org#22513)
  add fast matmul iquants (ggml-org#22504)
  spec : fix draft model checkpoints (ggml-org#22521)
  spec : fix vocab compat checks in spec example (ggml-org#22426)
  common : do not pass prompt tokens to reasoning budget sampler (ggml-org#22488)
  hexagon: make vmem and buffer-size configurable (ggml-org#22487)
  CUDA: fuse SSM_CONV + ADD(bias) + SILU (ggml-org#22478)
  spec : disacard last drafted token with low prob (ggml-org#22506)
  sync : ggml
  ggml : bump version to 0.10.1 (ggml/1469)
  webui: fix slow mic stop and WAV encode (ggml-org#22480)
  ggml-cpu : disable tiled matmul on AIX to fix page boundary segfault (ggml-org#22293)
  ...

# Conflicts:
#	.gitignore
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants