Osimons/optimize fused rms norm f32 by Nexesenex · Pull Request #337 · Nexesenex/croco.cpp

Nexesenex · 2025-09-01T20:35:32Z

No description provided.

) Exposes ggml_backend_sched_split_graph() to allow splitting the graph without allocating compute buffers and uses it to split the graph for the automatic Flash Attention check.

ggml-ci

* metal : fix checks for available FA kernels ggml-ci * cont : fix comment [no ci]

* server : enable /slots by default and make it secure ggml-ci * server : fix tests to pass `--no-slots` when necessary * server : extend /props with info about enabled endpoints

* sampling : optimize sorting using bucket sort in more places ggml-ci * sampling : do not sort in dist sampler ggml-ci * sampling : avoid heap allocations for sort buffers ggml-ci * common : add option to sort sampling candidates by probability ggml-ci * sampling : revert the change for preserving sort buffers * sampling : use std::copy instead of memcpy * sampling : clarify purpose of partial sort helpers ggml-ci * cont : remove wrong comment [no ci] * common : update comment Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* CANN: fix RoPE cache issue on multi-device RoPE cache only needs to be computed once per token. However, in multi-device scenarios, not every device starts computation from layer 0, which may lead to unallocated memory issues and precision errors. This commit records the first layer of each device to avoid the above issues. * CANN: Optimize first-layer detection method * CANN: Remove trailing whitespace * CANN: Only cache the data that can be determined as unchanged through the parameters. * CANN: Update function comment

…ml-org#15690) * CUDA: fix build error from ambiguous __half conversions in conv2d Building conv2d with half precision failed because `__half` defines multiple implicit conversion operators (to float, int, short, etc.), causing ambiguous overload resolution when multiplying with float. Introduce a templated `to_float` helper that explicitly converts `__half` via `__half2float`, while passing through float unchanged. Use this helper in conv2d accumulation to ensure unambiguous and correct promotion to float. Fixes some build errors with half-precision kernels on CUDA. ggml-ci * CUDA: Replace custom to_float helper with unified ggml_cuda_cast and add half‑>float conversion * CUDA: Add missing convert.cuh header * CUDA: remove unnecessary extension in ggml_cuda_cast * CUDA: Address review comment, remove second type template argument

Fastdiv is much faster way to do integer division, which was identified as bottleneck in rms_norm_f32

This makes us more flexible in selecting the optimal threads w.r.t paralellizing across a col vs. launch-overheads of threads and mio throttles

CISC and others added 11 commits August 31, 2025 15:30

ci : explicitly set fa off or on (ggml-org#15692)

7d3c9f2

llama : separate compute buffer reserve from fattn check (ggml-org#15696

9777032

) Exposes ggml_backend_sched_split_graph() to allow splitting the graph without allocating compute buffers and uses it to split the graph for the automatic Flash Attention check.

llama : fix fattn reserve call n_seqs parameter (ggml-org#15699)

2749662

ggml-ci

metal : fix checks for available FA kernels (ggml-org#15700)

4efd5a8

* metal : fix checks for available FA kernels ggml-ci * cont : fix comment [no ci]

server : enable /slots by default and make it secure (ggml-org#15630)

0d161f0

* server : enable /slots by default and make it secure ggml-ci * server : fix tests to pass `--no-slots` when necessary * server : extend /props with info about enabled endpoints

CANN: Optimize MUL_MAT_ID (ggml-org#15658)

b9382c3

Add fastdiv, use it in modulo and use modulo in rms_norm_f32

b2e9831

Fastdiv is much faster way to do integer division, which was identified as bottleneck in rms_norm_f32

Support more block_size values in rms_norm_f32

bcc6c77

This makes us more flexible in selecting the optimal threads w.r.t paralellizing across a col vs. launch-overheads of threads and mio throttles

Nexesenex merged commit 82d1f85 into Nexesenex:lcpp_pr_optimize_fused_rms_norm_f32 Sep 1, 2025
45 of 49 checks passed

github-actions Bot added Nvidia GPU testing examples python server ggml devops Apple Metal Ascend NPU labels Sep 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Osimons/optimize fused rms norm f32#337

Osimons/optimize fused rms norm f32#337
Nexesenex merged 11 commits intoNexesenex:lcpp_pr_optimize_fused_rms_norm_f32from
ORippler:osimons/optimize_fused_rms_norm_f32

Nexesenex commented Sep 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

Nexesenex commented Sep 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants