Default HiddenAct::Gelu to GeLU + tanh in favour of GeLU erf #753
Default HiddenAct::Gelu to GeLU + tanh in favour of GeLU erf #753alvarobartt merged 21 commits intohuggingface:mainfrom
HiddenAct::Gelu to GeLU + tanh in favour of GeLU erf #753Conversation
|
Okay so after some more digging, it seems one of the main reasons to not change this would be speed of huggingface/candle#1062 |
|
I was able to address the issue with latency by raising this PR. huggingface/candle#3168 |
|
Wow awesome work huggingface/candle#3168 @vrdn-23! I thought the compiler uses constant propagation, so I roughly assume that TEI uses the approximate gelu (~= new gelu) for faster inference, while there's a marginal difference between the variants. We could benchmark the speed by updating the One minor point for discussion is that, if the latency stays comparable, it might make sense to keep this implementation. Anyway, looks great to me! I’d appreciate any feedback or thoughts you have! |
|
Sorry for the late response, but I was away for Thanksgiving break @kozistr !
I think that might have been true previously, but I think now that huggingface/candle#3168 has been merged, the gelu_erf (old gelu) implementation is actually faster than the new one. It would also be functionally most similar in producing outputs with the existing models, so since it seems to be a win in terms of both quality and latency, I would argue that maybe we stick to the consistent implementation across frameworks. Would love to hear your thoughts @alvarobartt @Narsil |
This commit adapts text-embeddings-inference for NVIDIA Jetson Orin (SM87) and L4 GPU (SM89), and integrates valuable community PRs. Changes: 1. SM87/SM89 CUDA Support - Added compute capability 8.7 and 8.9 support - Modified Dockerfile-cuda-all for multi-arch builds - Updated compute_cap.rs for SM87/89 detection Files: Dockerfile-cuda-all, cuda-all-entrypoint.sh, compute_cap.rs 2. PR huggingface#730: Qwen3 Reranker Support - Added classification head for Qwen3 reranking - Implemented template formatting system for chat-based reranking Files: models/qwen3.rs, core/templates.rs, core/lib.rs 3. PR huggingface#787: Batch Notification Performance Optimization - Implemented AtomicUsize counter for batch processing - Reduced unnecessary notify_one() calls - Only last request in batch triggers thread notification Files: core/infer.rs, router/http/server.rs, router/grpc/server.rs 4. PR huggingface#753: GeLU Activation Consistency Fix - Changed Gelu from approximate (gelu) to exact (gelu_erf) - Added NewGelu variant for backward compatibility Files: layers/linear.rs 5. PR huggingface#790: StaticEmbedding Model Support - Added support for 0_StaticEmbedding/ directory structure - Implemented fallback loading for model weights and tokenizer - Default to Mean pooling for StaticEmbedding models Files: models/static_embedding.rs (new), lib.rs, download.rs, router/lib.rs 6. PR huggingface#746: DebertaV2 Sequence Classification Support - Complete DebertaV2 model implementation - Support for sequence classification tasks (e.g., Llama Prompt Guard) - CPU and CUDA device support Files: models/debertav2.rs (new), lib.rs, models/mod.rs All changes have been tested and compile successfully with: cargo check --all-targets Compilation verified with CUDA support: cargo install --path router -F candle-cuda Target Hardware: NVIDIA Jetson Orin AGX (SM87), L4 GPU (SM89) Date: January 5, 2026
|
Just wanted to add a note that #784 should probably be merged in before this (if accepted), so that the speed-up gained by huggingface/candle#3168 can be utilized |
HiddenAct::Gelu and align {Gelu,GeluErf} with Transformers
alvarobartt
left a comment
There was a problem hiding this comment.
Thanks @vrdn-23, I've included some changes on top, but the PR is ready to be merged once the CI is green!
Also note that I haven't updated candle yet, as there are some issues with the lib linking for cudarc as candle is still enforcing it to dynamic-linking, meaning that we need to create a "fork" of candle that won't enforce it, so that it's up to the user to pick between static-linking, dynamic-linking or dynamic-loading. So it's tricky given that the constraint comes from both candle and cudarc (and also candle-extensions), so updating candle is not straight forward at the moment due to those constraints. Anyway, I'm working internally with the candle team to alleviate that soon, so that there's "unlinked" features, meaning that the libraries built with candle and/or cudarc can pick on the linking they want (cc @michaelfeil and @kozistr for visibility).
|
Thanks for the update and picking up my slack @alvarobartt. My day job has been hectic for the past month, so it's been hard to pick up the work on these PRs. Appreciate you helping with these! <3 |
|
Background: I am pretty sure that cublasLT cannot fuse any kind of actication funcitons. The difference in using gelu, fast_gelu is low. Can you provide a trained (non random init model with in-distribution activations), that has a embedding of cosine similarity <=0.99. I assume using GELU. In commercial inference products I used, I reverted the GELU_exact -> GELU_fast for this exact reason. I'd be skeptical that this not a perf regression, but am otherwise not biased. |
Indeed I'm working on adding BF16 support as we speak, started with a draft last week in order to support https://huggingface.co/voyageai/voyage-4-nano (and the rest of their recently released embedding models).
I'm also not 100% sure yet on adding this change over, skipping |
|
The reason I made this PR was so because I could get exact numbers for reproducibility with DebertaV2SequenceClassification models and have TEI match the same as Transformers. I am not sure if there is a performance regression by running this on CUDA, but atleast on metal, |
|
https://docs.rs/cudarc/latest/x86_64-pc-windows-msvc/cudarc/cublaslt/safe/enum.Activation.html Supports Gelu and Relu. Other models will run 10% slower. Usually, differences will be introduced by other modeling issues, not activation. As said, they will affect snapshot tests, but gelu_exact or gelu will produce the same scores etc on MTEB and other scores. Deberta was recently added, and multiple other reasons could be the cause. |
|
I think the reason here might be that using non-Gelu/Relu, the activation in CublasLT can be dropped? Not sure what happens with these activations, since we just pass it inside cublas lt. |
- Neglible implact on inference quality - GeLU erf not available on cuBLASLt - Better performance due to fused GeLU + tanh on CUDA - Alignment on CUDA, MPS and CPU
HiddenAct::Gelu and align {Gelu,GeluErf} with TransformersHiddenAct::Gelu to GeLU + tanh in favour of GeLU erf
alvarobartt
left a comment
There was a problem hiding this comment.
Hey guys @vrdn-23, @kozistr and @michaelfeil!
After deliberating for a bit, I've decided that the "safest" approach to maintain compatibility with cuBLASLt is to default to GeLU + tanh approximation and "ignore" exact GeLU (i.e., GeLU erf).
Despite the differences with Transformers, and hence with Sentence Transformers, it's worth aligning on GeLU + tanh for efficiency reasons but also to make sure that the logits, embeddings, etc. are reproducible across devices (CPU, Metal, CUDA, etc.).
Anyway, I agree with @vrdn-23 that even if the outputs only differ slightly, and as the impact on quality is negligible (+ cosine similarity on embeddings with GeLU + tanh and exact GeLU is ~1.0); it's worth adding a tracing::warn! message to state that gelu means GeLU + tanh and not exact GeLU to help explain the subtle differences (see screenshot below).
This being said, thanks for the PR @vrdn-23 and for the useful reviews and discussions @kozistr and @michaelfeil! 🤗
P.S. I'll leave this open for the next 24 hours, and then merge if everyone agrees / is aligned.
What does this PR do?
This PR unifies
HiddenAct::Geluto use GeLU + tanh over exact GeLU (aka. GeLU erf), despite Transformers deserializinghidden_actfromgeluas GeLU erf andgelu_pytorch_tanh(and others, see snippet below); all thegeluvariants will deserialize to GeLU + tanh.This might lead to subtle differences on the produced logits, it won't impact the performance or inference quality (see the modified snapshots). Additionally, adding GeLU erf will be problematic as it will lead to differences when running the same model on CPU or Metal, and CUDA; given that cuBLASLt will only implement GeLU and ReLU.
Before submitting
instasnapshots?Who can review?
@Narsil OR @alvarobartt OR @kozistr