Default `HiddenAct::Gelu` to GeLU + tanh in favour of GeLU erf by vrdn-23 · Pull Request #753 · huggingface/text-embeddings-inference

vrdn-23 · 2025-11-06T01:02:29Z

What does this PR do?

This PR unifies HiddenAct::Gelu to use GeLU + tanh over exact GeLU (aka. GeLU erf), despite Transformers deserializing hidden_act from gelu as GeLU erf and gelu_pytorch_tanh (and others, see snippet below); all the gelu variants will deserialize to GeLU + tanh.

ACT2CLS = {
    "gelu": GELUActivation,
    "gelu_10": (ClippedGELUActivation, {"min": -10, "max": 10}),
    "gelu_fast": FastGELUActivation,
    "gelu_new": NewGELUActivation,
    "gelu_python": (GELUActivation, {"use_gelu_python": True}),
    "gelu_pytorch_tanh": GELUTanh,
    "gelu_python_tanh": (GELUTanh, {"use_gelu_tanh_python": True}),
    "gelu_accurate": AccurateGELUActivation,
    ...

This might lead to subtle differences on the produced logits, it won't impact the performance or inference quality (see the modified snapshots). Additionally, adding GeLU erf will be problematic as it will lead to differences when running the same model on CPU or Metal, and CUDA; given that cuBLASLt will only implement GeLU and ReLU.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the documentation guidelines.
Did you write any new necessary tests? If applicable, did you include or update the insta snapshots?

Who can review?

@Narsil OR @alvarobartt OR @kozistr

vrdn-23 · 2025-11-06T01:44:28Z

Okay so after some more digging, it seems one of the main reasons to not change this would be speed of gelu_erf() compared to gelu(). I was digging through the candle repository and saw some relevant issues here. I can run some benchmarks later this week to see if there is still a performance loss.

huggingface/candle#1062
huggingface/candle#2418
huggingface/candle#1926 (comment)

vrdn-23 · 2025-11-10T19:30:44Z

I was able to address the issue with latency by raising this PR. huggingface/candle#3168
So hopefully we can make this change without any loss in performance now!

kozistr · 2025-11-11T12:59:45Z

Wow awesome work huggingface/candle#3168 @vrdn-23!

I thought the compiler uses constant propagation, so sqrt(1/2) part doesn't actually execute at every call.

I roughly assume that TEI uses the approximate gelu (~= new gelu) for faster inference, while there's a marginal difference between the variants.

We could benchmark the speed by updating the candle-nn version to the latest commit, which includes your candle-nn PR, here!

One minor point for discussion is that, if the latency stays comparable, it might make sense to keep this implementation.

Anyway, looks great to me!

I’d appreciate any feedback or thoughts you have!

vrdn-23 · 2025-12-03T11:39:21Z

Sorry for the late response, but I was away for Thanksgiving break @kozistr !

I roughly assume that TEI uses the approximate gelu (~= new gelu) for faster inference, while there's a marginal difference between the variants.

I think that might have been true previously, but I think now that huggingface/candle#3168 has been merged, the gelu_erf (old gelu) implementation is actually faster than the new one. It would also be functionally most similar in producing outputs with the existing models, so since it seems to be a win in terms of both quality and latency, I would argue that maybe we stick to the consistent implementation across frameworks.

Would love to hear your thoughts @alvarobartt @Narsil

This commit adapts text-embeddings-inference for NVIDIA Jetson Orin (SM87) and L4 GPU (SM89), and integrates valuable community PRs. Changes: 1. SM87/SM89 CUDA Support - Added compute capability 8.7 and 8.9 support - Modified Dockerfile-cuda-all for multi-arch builds - Updated compute_cap.rs for SM87/89 detection Files: Dockerfile-cuda-all, cuda-all-entrypoint.sh, compute_cap.rs 2. PR huggingface#730: Qwen3 Reranker Support - Added classification head for Qwen3 reranking - Implemented template formatting system for chat-based reranking Files: models/qwen3.rs, core/templates.rs, core/lib.rs 3. PR huggingface#787: Batch Notification Performance Optimization - Implemented AtomicUsize counter for batch processing - Reduced unnecessary notify_one() calls - Only last request in batch triggers thread notification Files: core/infer.rs, router/http/server.rs, router/grpc/server.rs 4. PR huggingface#753: GeLU Activation Consistency Fix - Changed Gelu from approximate (gelu) to exact (gelu_erf) - Added NewGelu variant for backward compatibility Files: layers/linear.rs 5. PR huggingface#790: StaticEmbedding Model Support - Added support for 0_StaticEmbedding/ directory structure - Implemented fallback loading for model weights and tokenizer - Default to Mean pooling for StaticEmbedding models Files: models/static_embedding.rs (new), lib.rs, download.rs, router/lib.rs 6. PR huggingface#746: DebertaV2 Sequence Classification Support - Complete DebertaV2 model implementation - Support for sequence classification tasks (e.g., Llama Prompt Guard) - CPU and CUDA device support Files: models/debertav2.rs (new), lib.rs, models/mod.rs All changes have been tested and compile successfully with: cargo check --all-targets Compilation verified with CUDA support: cargo install --path router -F candle-cuda Target Hardware: NVIDIA Jetson Orin AGX (SM87), L4 GPU (SM89) Date: January 5, 2026

vrdn-23 · 2026-01-06T15:20:05Z

Just wanted to add a note that #784 should probably be merged in before this (if accepted), so that the speed-up gained by huggingface/candle#3168 can be utilized

…vation

…o vrdn-23/fix-gelu-activation

alvarobartt

Thanks @vrdn-23, I've included some changes on top, but the PR is ready to be merged once the CI is green!

Also note that I haven't updated candle yet, as there are some issues with the lib linking for cudarc as candle is still enforcing it to dynamic-linking, meaning that we need to create a "fork" of candle that won't enforce it, so that it's up to the user to pick between static-linking, dynamic-linking or dynamic-loading. So it's tricky given that the constraint comes from both candle and cudarc (and also candle-extensions), so updating candle is not straight forward at the moment due to those constraints. Anyway, I'm working internally with the candle team to alleviate that soon, so that there's "unlinked" features, meaning that the libraries built with candle and/or cudarc can pick on the linking they want (cc @michaelfeil and @kozistr for visibility).

vrdn-23 · 2026-01-23T20:25:04Z

Thanks for the update and picking up my slack @alvarobartt. My day job has been hectic for the past month, so it's been hard to pick up the work on these PRs. Appreciate you helping with these! <3

michaelfeil · 2026-01-24T00:05:49Z

Background: I am pretty sure that cublasLT cannot fuse any kind of actication funcitons.

The difference in using gelu, fast_gelu is low.
This matters at most in training. Candle has a fp16 (no bf16! support).

Can you provide a trained (non random init model with in-distribution activations), that has a embedding of cosine similarity <=0.99. I assume using GELU. In commercial inference products I used, I reverted the GELU_exact -> GELU_fast for this exact reason.

I'd be skeptical that this not a perf regression, but am otherwise not biased.

alvarobartt · 2026-01-24T08:12:00Z

The difference in using gelu, fast_gelu is low. This matters at most in training. Candle has a fp16 (no bf16! support).

Indeed I'm working on adding BF16 support as we speak, started with a draft last week in order to support https://huggingface.co/voyageai/voyage-4-nano (and the rest of their recently released embedding models).

I'd be skeptical that this not a perf regression, but am otherwise not biased.

I'm also not 100% sure yet on adding this change over, skipping gelu_erf and just aliasing every gelu variant in the config.json to gelu() instead; meaning that we simply skip the HiddenAct::GeluErf variant in favour of HiddenAct::Gelu. Also taking into consideration that gelu is faster (in theory) than gelu_erf, so I might revert the GeluErf variant in favour of Gelu, thoughts @michaelfeil?

vrdn-23 · 2026-01-26T16:28:21Z

The reason I made this PR was so because I could get exact numbers for reproducibility with DebertaV2SequenceClassification models and have TEI match the same as Transformers. I am not sure if there is a performance regression by running this on CUDA, but atleast on metal, gelu_erf should be faster than gelu now. I also think that if we are making everything a single Activation with Gelu we should leave a substantial comment on why that decision was made and also maybe have some benchmarks for what could be the min/max/median/avg difference in embeddings produced by models from using gelu instead of gelu_erf compared to transformers.

michaelfeil · 2026-01-26T16:59:45Z

https://docs.rs/cudarc/latest/x86_64-pc-windows-msvc/cudarc/cublaslt/safe/enum.Activation.html Supports Gelu and Relu. Other models will run 10% slower. Usually, differences will be introduced by other modeling issues, not activation.

As said, they will affect snapshot tests, but gelu_exact or gelu will produce the same scores etc on MTEB and other scores. Deberta was recently added, and multiple other reasons could be the cause.

michaelfeil · 2026-01-26T17:56:17Z

I think the reason here might be that using non-Gelu/Relu, the activation in CublasLT can be dropped? Not sure what happens with these activations, since we just pass it inside cublas lt.

- Neglible implact on inference quality - GeLU erf not available on cuBLASLt - Better performance due to fused GeLU + tanh on CUDA - Alignment on CUDA, MPS and CPU

alvarobartt

Hey guys @vrdn-23, @kozistr and @michaelfeil!

After deliberating for a bit, I've decided that the "safest" approach to maintain compatibility with cuBLASLt is to default to GeLU + tanh approximation and "ignore" exact GeLU (i.e., GeLU erf).

Despite the differences with Transformers, and hence with Sentence Transformers, it's worth aligning on GeLU + tanh for efficiency reasons but also to make sure that the logits, embeddings, etc. are reproducible across devices (CPU, Metal, CUDA, etc.).

Anyway, I agree with @vrdn-23 that even if the outputs only differ slightly, and as the impact on quality is negligible (+ cosine similarity on embeddings with GeLU + tanh and exact GeLU is ~1.0); it's worth adding a tracing::warn! message to state that gelu means GeLU + tanh and not exact GeLU to help explain the subtle differences (see screenshot below).

This being said, thanks for the PR @vrdn-23 and for the useful reviews and discussions @kozistr and @michaelfeil! 🤗

P.S. I'll leave this open for the next 24 hours, and then merge if everyone agrees / is aligned.

vrdn-23 added 3 commits November 5, 2025 16:37

[Bugfix] Make Gelu Activations consistent across frameworks

a3f6719

simplify a bit further

aae4307

style change

2ca266c

vrdn-23 mentioned this pull request Nov 6, 2025

candle tensor operations are bit slower than pytorch tensor operations huggingface/candle#1926

Closed

vrdn-23 added 3 commits January 18, 2026 12:12

Merge remote-tracking branch 'origin/main' into vrdn-23/fix-gelu-acti…

5ca56db

…vation

Merge branch 'main' into vrdn-23/fix-gelu-activation

4f8c82b

Merge remote-tracking branch 'origin/vrdn-23/fix-gelu-activation' int…

c29a699

…o vrdn-23/fix-gelu-activation

alvarobartt previously approved these changes Jan 21, 2026

View reviewed changes

Use GeluErf (for gelu) and Gelu (for the rest)

d365667

alvarobartt dismissed their stale review via d365667 January 21, 2026 17:14

Merge branch 'main' into vrdn-23/fix-gelu-activation

c8b5348

alvarobartt previously approved these changes Jan 21, 2026

View reviewed changes

Fix HiddenAct::{Gelu,GeluErf}

088b46f

alvarobartt dismissed their stale review via 088b46f January 21, 2026 17:17

alvarobartt added 7 commits January 21, 2026 18:23

Remove rename_all in favour of per-field renames

d8685ef

Replace Gelu with GeluErf when applicable

f75b68d

Add missing GeluErf in match over HiddenAct for cublast

11fe2ed

Update insta snapshot for sentence-transformers/all-MiniLM-L6-v2

75bd75d

Match {GeluErf,Gelu} to cudarc::cublaslt::Activation::Gelu

962f946

Update insta snapshots on CUDA

784bad3

Add missing if- condition on HiddenAct

f410bb3

alvarobartt changed the title ~~[Bugfix] Make Gelu Activations consistent across frameworks~~ Fix HiddenAct::Gelu and align {Gelu,GeluErf} with Transformers Jan 23, 2026

alvarobartt previously approved these changes Jan 23, 2026

View reviewed changes

alvarobartt added this to the v1.9.0 milestone Jan 23, 2026

Fix insta snapshot for test_emotions in test_bert.rs

0f90c37

alvarobartt dismissed their stale review via 0f90c37 January 23, 2026 20:22

kozistr mentioned this pull request Jan 25, 2026

Introduce Fused MoE to Nomic MoE #717

Draft

7 tasks

alvarobartt added 2 commits January 28, 2026 20:01

Use GeLU + tanh instead of GeLU erf

29ed309

- Neglible implact on inference quality - GeLU erf not available on cuBLASLt - Better performance due to fused GeLU + tanh on CUDA - Alignment on CUDA, MPS and CPU

Re-run cargo test w/ GeLU + tanh over GeLU erf

412e2bb

alvarobartt changed the title ~~Fix HiddenAct::Gelu and align {Gelu,GeluErf} with Transformers~~ Default HiddenAct::Gelu to GeLU + tanh in favour of GeLU erf Jan 28, 2026

alvarobartt added 2 commits January 28, 2026 21:13

Add tracing::warn on GeLU + tanh before loading model

35da7b5

Re-run cargo test w/ candle-cuda features

92f42af

alvarobartt approved these changes Jan 28, 2026

View reviewed changes

alvarobartt merged commit cb9de7a into huggingface:main Jan 30, 2026
2 of 13 checks passed

alvarobartt added a commit to vrdn-23/text-embeddings-inference that referenced this pull request Feb 5, 2026

Drop GeluExact from HiddenAct (more information in huggingface#753)

135b169

BrewTestBot mentioned this pull request Feb 17, 2026

text-embeddings-inference 1.9.0 Homebrew/homebrew-core#267964

Merged

Conversation

vrdn-23 commented Nov 6, 2025 • edited by alvarobartt Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

vrdn-23 commented Nov 6, 2025

Uh oh!

vrdn-23 commented Nov 10, 2025

Uh oh!

kozistr commented Nov 11, 2025

Uh oh!

vrdn-23 commented Dec 3, 2025

Uh oh!

vrdn-23 commented Jan 6, 2026

Uh oh!

alvarobartt left a comment

Choose a reason for hiding this comment

Uh oh!

vrdn-23 commented Jan 23, 2026

Uh oh!

michaelfeil commented Jan 24, 2026

Uh oh!

alvarobartt commented Jan 24, 2026

Uh oh!

vrdn-23 commented Jan 26, 2026

Uh oh!

michaelfeil commented Jan 26, 2026

Uh oh!

michaelfeil commented Jan 26, 2026

Uh oh!

alvarobartt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vrdn-23 commented Nov 6, 2025 •

edited by alvarobartt

Loading