Update llama-mmap to work with 32-bit emscripten#22497
Update llama-mmap to work with 32-bit emscripten#22497reeselevine merged 3 commits intoggml-org:masterfrom
Conversation
|
FYI @reeselevine : I confirmed that we can run a model over 2GiB in size in my environment using wasm64. The following screenshot shows unsloth/Qwen3.5-35B-A3B-Q3_K_S-GGUF (15.3GiB) running in Chrome (sorry, the log is a bit messy since this is just a toy program).
Also (you may already know this), while wasm64 enables us to use more than 4GiB of memory, the Wasm js-api spec sets the memory size limit to 16GiB, and existing tools such as emscripten and V8 follow this. So we cannot run a model larger than 16GiB even with wasm64. Some people have actually requested support for more than 16GiB (e.g., WebAssembly/spec#1892), but it seems they haven't reached a consensus yet. Considering this limitation, I think supporting OPFS is nice for the WebGPU backend. |
|
Hmmm, shouldn't this whole file be updated to do like this? Lines 21 to 27 in 36dafba |
When running the above Qwen model, Correction: the number of times |
|
@yomaytk ok, this shows that the WebGPU implementation of get_memory really isn't correct right now 😅. We're not actually subtracting the allocated memory from free, so we always report the max buffer size as being free. On the other hand, your example clearly shows that the max buffer size is not the limit on GPUs with more memory. We will have to think more about the best way to handle this in the future (unrelated to this PR though). |
|
@CISC @ggerganov if you're happy with the full change I think this can be merged soon. |
* 'master' of github.com:tekintian/llama.cpp: (659 commits) ggml-webgpu: Improve performance of mat-vec and mat-mat for MUL_MAT_ID (ggml-org#22464) Update llama-mmap to use ftello/fseeko (ggml-org#22497) common : check for null getpwuid in hf-cache (ggml-org#22550) vulkan: add get/set tensor 2d functions (ggml-org#22514) spec: fix argument typo (ggml-org#22552) ci : bump ty to 0.0.33 (ggml-org#22535) vendor : update cpp-httplib to 0.43.2 (ggml-org#22548) CUDA: fix tile FA kernel on Pascal (ggml-org#22541) scripts : add wc2wt.sh - create worktree from current HEAD (ggml-org#22513) add fast matmul iquants (ggml-org#22504) spec : fix draft model checkpoints (ggml-org#22521) spec : fix vocab compat checks in spec example (ggml-org#22426) common : do not pass prompt tokens to reasoning budget sampler (ggml-org#22488) hexagon: make vmem and buffer-size configurable (ggml-org#22487) CUDA: fuse SSM_CONV + ADD(bias) + SILU (ggml-org#22478) spec : disacard last drafted token with low prob (ggml-org#22506) sync : ggml ggml : bump version to 0.10.1 (ggml/1469) webui: fix slow mic stop and WAV encode (ggml-org#22480) ggml-cpu : disable tiled matmul on AIX to fix page boundary segfault (ggml-org#22293) ... # Conflicts: # .gitignore

Overview
When compiling to 32-bit WebAssembly through Emscripten,
std::fseekandstd::ftellreturn along, which is interpreted as a 32-bit signed value. Unfortunately, this means that any files above 2GB overflow the maximum positive integer, leading to bad results. This fixes that by delegating tofseekoandftelloin Emscripten builds, which return a 64-bitoff_tthat can be interpreted correctly in both 32-bit and 64-bit WASM builds.Note that ggml does something similar in all cases: https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/gguf.cpp#L25. However I didn't make that full change here because I'm not sure if it would lead to issues in other places.
Additional Information
For a little more context, this, in combination with the origin private file system (OPFS), allows models > 2GB to be loaded by the WebGPU backend in the browser without splitting the models into shards.
With 64-bit WebAssembly, the models can be loaded directly without splitting and without this change (theoretically, I haven't tried it), but since some browsers (Safari) don't support 64-bit memory yet, and it's not clear when they will, I think having this as an option is useful.
Requirements