Skip to content

Default HiddenAct::Gelu to GeLU + tanh in favour of GeLU erf #753

Merged
alvarobartt merged 21 commits intohuggingface:mainfrom
vrdn-23:vrdn-23/fix-gelu-activation
Jan 30, 2026
Merged

Default HiddenAct::Gelu to GeLU + tanh in favour of GeLU erf #753
alvarobartt merged 21 commits intohuggingface:mainfrom
vrdn-23:vrdn-23/fix-gelu-activation

Conversation

@vrdn-23
Copy link
Copy Markdown
Contributor

@vrdn-23 vrdn-23 commented Nov 6, 2025

What does this PR do?

This PR unifies HiddenAct::Gelu to use GeLU + tanh over exact GeLU (aka. GeLU erf), despite Transformers deserializing hidden_act from gelu as GeLU erf and gelu_pytorch_tanh (and others, see snippet below); all the gelu variants will deserialize to GeLU + tanh.

ACT2CLS = {
    "gelu": GELUActivation,
    "gelu_10": (ClippedGELUActivation, {"min": -10, "max": 10}),
    "gelu_fast": FastGELUActivation,
    "gelu_new": NewGELUActivation,
    "gelu_python": (GELUActivation, {"use_gelu_python": True}),
    "gelu_pytorch_tanh": GELUTanh,
    "gelu_python_tanh": (GELUTanh, {"use_gelu_tanh_python": True}),
    "gelu_accurate": AccurateGELUActivation,
    ...

This might lead to subtle differences on the produced logits, it won't impact the performance or inference quality (see the modified snapshots). Additionally, adding GeLU erf will be problematic as it will lead to differences when running the same model on CPU or Metal, and CUDA; given that cuBLASLt will only implement GeLU and ReLU.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline?
  • Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the documentation guidelines.
  • Did you write any new necessary tests? If applicable, did you include or update the insta snapshots?

Who can review?

@Narsil OR @alvarobartt OR @kozistr

@vrdn-23
Copy link
Copy Markdown
Contributor Author

vrdn-23 commented Nov 6, 2025

Okay so after some more digging, it seems one of the main reasons to not change this would be speed of gelu_erf() compared to gelu(). I was digging through the candle repository and saw some relevant issues here. I can run some benchmarks later this week to see if there is still a performance loss.

huggingface/candle#1062
huggingface/candle#2418
huggingface/candle#1926 (comment)

@vrdn-23
Copy link
Copy Markdown
Contributor Author

vrdn-23 commented Nov 10, 2025

I was able to address the issue with latency by raising this PR. huggingface/candle#3168
So hopefully we can make this change without any loss in performance now!

@kozistr
Copy link
Copy Markdown
Contributor

kozistr commented Nov 11, 2025

Wow awesome work huggingface/candle#3168 @vrdn-23!

I thought the compiler uses constant propagation, so sqrt(1/2) part doesn't actually execute at every call.

I roughly assume that TEI uses the approximate gelu (~= new gelu) for faster inference, while there's a marginal difference between the variants.

We could benchmark the speed by updating the candle-nn version to the latest commit, which includes your candle-nn PR, here!

One minor point for discussion is that, if the latency stays comparable, it might make sense to keep this implementation.

Anyway, looks great to me!

I’d appreciate any feedback or thoughts you have!

@vrdn-23
Copy link
Copy Markdown
Contributor Author

vrdn-23 commented Dec 3, 2025

Sorry for the late response, but I was away for Thanksgiving break @kozistr !

I roughly assume that TEI uses the approximate gelu (~= new gelu) for faster inference, while there's a marginal difference between the variants.

I think that might have been true previously, but I think now that huggingface/candle#3168 has been merged, the gelu_erf (old gelu) implementation is actually faster than the new one. It would also be functionally most similar in producing outputs with the existing models, so since it seems to be a win in terms of both quality and latency, I would argue that maybe we stick to the consistent implementation across frameworks.

Would love to hear your thoughts @alvarobartt @Narsil

thomas-hiddenpeak added a commit to thomas-hiddenpeak/RMinte-Orin-TEI that referenced this pull request Jan 5, 2026
This commit adapts text-embeddings-inference for NVIDIA Jetson Orin (SM87)
and L4 GPU (SM89), and integrates valuable community PRs.

Changes:

1. SM87/SM89 CUDA Support
   - Added compute capability 8.7 and 8.9 support
   - Modified Dockerfile-cuda-all for multi-arch builds
   - Updated compute_cap.rs for SM87/89 detection
   Files: Dockerfile-cuda-all, cuda-all-entrypoint.sh, compute_cap.rs

2. PR huggingface#730: Qwen3 Reranker Support
   - Added classification head for Qwen3 reranking
   - Implemented template formatting system for chat-based reranking
   Files: models/qwen3.rs, core/templates.rs, core/lib.rs

3. PR huggingface#787: Batch Notification Performance Optimization
   - Implemented AtomicUsize counter for batch processing
   - Reduced unnecessary notify_one() calls
   - Only last request in batch triggers thread notification
   Files: core/infer.rs, router/http/server.rs, router/grpc/server.rs

4. PR huggingface#753: GeLU Activation Consistency Fix
   - Changed Gelu from approximate (gelu) to exact (gelu_erf)
   - Added NewGelu variant for backward compatibility
   Files: layers/linear.rs

5. PR huggingface#790: StaticEmbedding Model Support
   - Added support for 0_StaticEmbedding/ directory structure
   - Implemented fallback loading for model weights and tokenizer
   - Default to Mean pooling for StaticEmbedding models
   Files: models/static_embedding.rs (new), lib.rs, download.rs, router/lib.rs

6. PR huggingface#746: DebertaV2 Sequence Classification Support
   - Complete DebertaV2 model implementation
   - Support for sequence classification tasks (e.g., Llama Prompt Guard)
   - CPU and CUDA device support
   Files: models/debertav2.rs (new), lib.rs, models/mod.rs

All changes have been tested and compile successfully with:
  cargo check --all-targets

Compilation verified with CUDA support:
  cargo install --path router -F candle-cuda

Target Hardware: NVIDIA Jetson Orin AGX (SM87), L4 GPU (SM89)
Date: January 5, 2026
@vrdn-23
Copy link
Copy Markdown
Contributor Author

vrdn-23 commented Jan 6, 2026

Just wanted to add a note that #784 should probably be merged in before this (if accepted), so that the speed-up gained by huggingface/candle#3168 can be utilized

alvarobartt
alvarobartt previously approved these changes Jan 21, 2026
alvarobartt
alvarobartt previously approved these changes Jan 21, 2026
@alvarobartt alvarobartt changed the title [Bugfix] Make Gelu Activations consistent across frameworks Fix HiddenAct::Gelu and align {Gelu,GeluErf} with Transformers Jan 23, 2026
alvarobartt
alvarobartt previously approved these changes Jan 23, 2026
Copy link
Copy Markdown
Member

@alvarobartt alvarobartt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vrdn-23, I've included some changes on top, but the PR is ready to be merged once the CI is green!

Also note that I haven't updated candle yet, as there are some issues with the lib linking for cudarc as candle is still enforcing it to dynamic-linking, meaning that we need to create a "fork" of candle that won't enforce it, so that it's up to the user to pick between static-linking, dynamic-linking or dynamic-loading. So it's tricky given that the constraint comes from both candle and cudarc (and also candle-extensions), so updating candle is not straight forward at the moment due to those constraints. Anyway, I'm working internally with the candle team to alleviate that soon, so that there's "unlinked" features, meaning that the libraries built with candle and/or cudarc can pick on the linking they want (cc @michaelfeil and @kozistr for visibility).

@alvarobartt alvarobartt added this to the v1.9.0 milestone Jan 23, 2026
@vrdn-23
Copy link
Copy Markdown
Contributor Author

vrdn-23 commented Jan 23, 2026

Thanks for the update and picking up my slack @alvarobartt. My day job has been hectic for the past month, so it's been hard to pick up the work on these PRs. Appreciate you helping with these! <3

@michaelfeil
Copy link
Copy Markdown
Contributor

Background: I am pretty sure that cublasLT cannot fuse any kind of actication funcitons.

The difference in using gelu, fast_gelu is low.
This matters at most in training. Candle has a fp16 (no bf16! support).

Can you provide a trained (non random init model with in-distribution activations), that has a embedding of cosine similarity <=0.99. I assume using GELU. In commercial inference products I used, I reverted the GELU_exact -> GELU_fast for this exact reason.

I'd be skeptical that this not a perf regression, but am otherwise not biased.

@alvarobartt
Copy link
Copy Markdown
Member

The difference in using gelu, fast_gelu is low. This matters at most in training. Candle has a fp16 (no bf16! support).

Indeed I'm working on adding BF16 support as we speak, started with a draft last week in order to support https://huggingface.co/voyageai/voyage-4-nano (and the rest of their recently released embedding models).

I'd be skeptical that this not a perf regression, but am otherwise not biased.

I'm also not 100% sure yet on adding this change over, skipping gelu_erf and just aliasing every gelu variant in the config.json to gelu() instead; meaning that we simply skip the HiddenAct::GeluErf variant in favour of HiddenAct::Gelu. Also taking into consideration that gelu is faster (in theory) than gelu_erf, so I might revert the GeluErf variant in favour of Gelu, thoughts @michaelfeil?

@kozistr kozistr mentioned this pull request Jan 25, 2026
7 tasks
@vrdn-23
Copy link
Copy Markdown
Contributor Author

vrdn-23 commented Jan 26, 2026

The reason I made this PR was so because I could get exact numbers for reproducibility with DebertaV2SequenceClassification models and have TEI match the same as Transformers. I am not sure if there is a performance regression by running this on CUDA, but atleast on metal, gelu_erf should be faster than gelu now. I also think that if we are making everything a single Activation with Gelu we should leave a substantial comment on why that decision was made and also maybe have some benchmarks for what could be the min/max/median/avg difference in embeddings produced by models from using gelu instead of gelu_erf compared to transformers.

@michaelfeil
Copy link
Copy Markdown
Contributor

https://docs.rs/cudarc/latest/x86_64-pc-windows-msvc/cudarc/cublaslt/safe/enum.Activation.html Supports Gelu and Relu. Other models will run 10% slower. Usually, differences will be introduced by other modeling issues, not activation.

As said, they will affect snapshot tests, but gelu_exact or gelu will produce the same scores etc on MTEB and other scores. Deberta was recently added, and multiple other reasons could be the cause.

@michaelfeil
Copy link
Copy Markdown
Contributor

I think the reason here might be that using non-Gelu/Relu, the activation in CublasLT can be dropped? Not sure what happens with these activations, since we just pass it inside cublas lt.

- Neglible implact on inference quality
- GeLU erf not available on cuBLASLt
- Better performance due to fused GeLU + tanh on CUDA
- Alignment on CUDA, MPS and CPU
@alvarobartt alvarobartt changed the title Fix HiddenAct::Gelu and align {Gelu,GeluErf} with Transformers Default HiddenAct::Gelu to GeLU + tanh in favour of GeLU erf Jan 28, 2026
Copy link
Copy Markdown
Member

@alvarobartt alvarobartt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey guys @vrdn-23, @kozistr and @michaelfeil!

After deliberating for a bit, I've decided that the "safest" approach to maintain compatibility with cuBLASLt is to default to GeLU + tanh approximation and "ignore" exact GeLU (i.e., GeLU erf).

Despite the differences with Transformers, and hence with Sentence Transformers, it's worth aligning on GeLU + tanh for efficiency reasons but also to make sure that the logits, embeddings, etc. are reproducible across devices (CPU, Metal, CUDA, etc.).

Anyway, I agree with @vrdn-23 that even if the outputs only differ slightly, and as the impact on quality is negligible (+ cosine similarity on embeddings with GeLU + tanh and exact GeLU is ~1.0); it's worth adding a tracing::warn! message to state that gelu means GeLU + tanh and not exact GeLU to help explain the subtle differences (see screenshot below).

This being said, thanks for the PR @vrdn-23 and for the useful reviews and discussions @kozistr and @michaelfeil! 🤗

P.S. I'll leave this open for the next 24 hours, and then merge if everyone agrees / is aligned.

@alvarobartt alvarobartt merged commit cb9de7a into huggingface:main Jan 30, 2026
2 of 13 checks passed
alvarobartt added a commit to vrdn-23/text-embeddings-inference that referenced this pull request Feb 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants