Add bidirectional attention and projection layer support for Qwen3-based models by williambarberjr · Pull Request #808 · huggingface/text-embeddings-inference

williambarberjr · 2026-01-30T21:43:17Z

What does this PR do?

This PR adds support for voyageai/voyage-4-nano, a Qwen3-based embedding model that uses bidirectional attention and a projection layer.

Changes

1. Bidirectional Attention Support

Added use_bidirectional_attention config field (default: false)
When true, disables causal masking in the attention mechanism
voyage-4-nano and similar embedding models use bidirectional attention to see the full context

2. Projection Layer Support

Added num_labels config field for output projection dimension
When set, loads linear.weight from safetensors root level and applies projection after final normalization
voyage-4-nano projects from hidden_size=1024 to output_dim=2048

Model Configuration

Models using these features should have in their config.json:

{
  "use_bidirectional_attention": true,
  "num_labels": 2048
}

Testing

Tested with voyageai/voyage-4-nano:

✅ Output dimension: 2048 (correct)
✅ Cosine similarity vs HuggingFace transformers: 0.999965
✅ Inference time: ~9ms on L4 GPU (vs 35ms with transformers)

Files Changed

backends/candle/src/models/flash_qwen3.rs - CUDA/flash attention implementation
backends/candle/src/models/qwen3.rs - CPU/Metal implementation + config struct
backends/candle/Cargo.toml - Added cudarc dev-dependency for CUDA tests
backends/candle/tests/test_voyage_nano.rs - CPU test with snapshots
backends/candle/tests/test_flash_voyage_nano.rs - CUDA test with snapshots
README.md - Added voyage-4-nano to supported models table
docs/source/en/supported_models.md - Added voyage-4-nano to docs

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the documentation guidelines.
Did you write any new necessary tests? If applicable, did you include or update the insta snapshots?

Who can review?

@Narsil @alvarobartt - This adds two new config fields to support voyage-4-nano embedding model. The changes are backwards compatible (both fields default to disabled behavior).

Add two new config fields to Qwen3 to support voyage-4-nano and similar models: - `use_bidirectional_attention`: When true, disables causal masking for embedding models that use full bidirectional attention - `num_labels`: When set, loads projection layer from linear.weight at safetensors root level (e.g., 1024 -> 2048 for voyage-4-nano) Both fields are backwards compatible, defaulting to disabled behavior. Changes: - backends/candle/src/models/qwen3.rs: Add config fields and CPU impl - backends/candle/src/models/flash_qwen3.rs: Add CUDA/flash-attn impl - backends/candle/tests/test_voyage_nano.rs: CPU tests with snapshots - backends/candle/tests/test_flash_voyage_nano.rs: CUDA tests - README.md, docs/source/en/supported_models.md: Add voyage-4-nano Tested with voyageai/voyage-4-nano: - Output dimension: 2048 (correct) - Cosine similarity vs transformers: 0.999965 - Inference time: ~9ms on L4 GPU (vs 35ms with transformers)

alvarobartt

Hey @williambarberjr thanks for the PR, indeed I have a PR to add BF16 support for both Metal and CUDA, which I'll merge before this one to make sure that we use the correct dtype for Voyage AI Embedding models 🎉

alvarobartt · 2026-01-31T11:14:42Z

Q: Did you validate the cosine similarity both with and without normalization? Or only with normalized embeddings? See e.g. https://huggingface.co/voyageai/voyage-4-nano#via-sentence-transformers, which won't normalize the embeddings as the default value for normalize_embeddings is false, meaning that when calling /embed you should set normalize: false as in Text Embeddings Inference that same parameter is true by default.

alvarobartt · 2026-01-31T16:02:09Z

See #809, as per BF16 support as mentioned in the review 🤗

Correct order Co-authored-by: Alvaro Bartolome <36760800+alvarobartt@users.noreply.github.com>

alvarobartt · 2026-02-07T10:59:14Z

Hey @williambarberjr thanks again for the PR, did you have time to read the comments? Would you need any help tackling those? I really want this feature in v1.9.0 releasing (hopefully) soon, so let me know if you'd need help 🤗

williambarberjr · 2026-02-08T03:55:16Z

Sorry for being so slow after the quick review!

Reviewed and agree with the changes/comments. Implemented them, re-ran the validation comparing TEI vs sentence-transformers (voyageai/voyage-4-nano, trust_remote_code=True) on an A100.

MRL dimensions parity (2048/1024/512/256):

Query path (prompt_name="query"): all means ~0.99996+
- 2048: 0.9999653
- 1024: 0.9999656
- 512: 0.9999639
- 256: 0.9999615
Document path (prompt_name="document"): all means ~0.99997+
- 2048: 0.9999787
- 1024: 0.9999772
- 512: 0.9999777
- 256: 0.9999790

Normalization parity (2048 dim, 5 texts):

normalize=false (TEI) vs normalize_embeddings=False (ST): mean cosine 0.9999758
normalize=true (TEI) vs normalize_embeddings=True (ST): mean cosine 0.9999753

Extra sanity checks:

TEI raw vector norms were non-unit (~18–26), confirming normalize=false returns unnormalized vectors.
TEI normalized norms were ~1.0, as expected.

Validation environment:

Hardware: NVIDIA A100-SXM4-40GB
Driver / CUDA (from nvidia-smi): 580.126.09 / CUDA 13.0
Python: 3.10.12
torch==2.10.0
sentence-transformers==5.2.2
transformers==5.1.0
TEI binary: text-embeddings-router 1.8.3
TEI commit used: 4cb198b
TEI launch flags: --model-id voyageai/voyage-4-nano --dtype float16 --pooling mean

gazb23 · 2026-02-11T02:58:29Z

Please merge this ;D

alvarobartt · 2026-02-11T09:35:46Z

Yes @gazb23 the idea is to merge this today, I'm still yet to test it myself before, but expect it to be merged today EOD 🤗

HuggingFaceDocBuilderDev · 2026-02-11T19:28:34Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

michaelfeil · 2026-04-02T17:02:44Z

@williambarberjr this is mostly a performance bug, since for the voyage model you need / should perform the layer after pooling. Since the pooling is activation free, running it after projection is better (2048x faster).

alvarobartt self-requested a review January 31, 2026 09:36

alvarobartt reviewed Jan 31, 2026

View reviewed changes

Update README.md

cfaec23

Correct order Co-authored-by: Alvaro Bartolome <36760800+alvarobartt@users.noreply.github.com>

alvarobartt added this to the v1.9.0 milestone Feb 7, 2026

williambarberjr added 3 commits February 7, 2026 21:02

candle: simplify qwen3 varbuilder prefix and flash attention wiring

438e023

docs: keep supported models table sorted by size

40240b2

candle: remove unused cudarc dev-dependency

4cb198b

vrdn-23 mentioned this pull request Feb 8, 2026

Requesting to support for voyageai/voyage-4-nano #816

Open

2 tasks

Merge branch 'main' into voyage-4-nano-support

86bc6ce

alvarobartt reviewed Feb 11, 2026

View reviewed changes

Comment thread backends/candle/src/models/flash_qwen3.rs

alvarobartt reviewed Feb 11, 2026

View reviewed changes

Comment thread backends/candle/src/models/qwen3.rs

alvarobartt reviewed Feb 11, 2026

View reviewed changes

Comment thread backends/candle/src/models/qwen3.rs

Apply suggestions from code review

e19b6ea

Merge branch 'main' into voyage-4-nano-support

25f1b83

alvarobartt merged commit b59f754 into huggingface:main Feb 12, 2026
2 of 16 checks passed

BrewTestBot mentioned this pull request Feb 17, 2026

text-embeddings-inference 1.9.0 Homebrew/homebrew-core#267964

Merged

malaiwah mentioned this pull request Jun 25, 2026

Qwen3/Gemma3 candle skip attention masks for equal-length batches #882

Open

4 tasks

Uh oh!

Conversation

williambarberjr commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Changes

1. Bidirectional Attention Support

2. Projection Layer Support

Model Configuration

Testing

Files Changed

Before submitting

Who can review?

Uh oh!

alvarobartt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alvarobartt commented Jan 31, 2026

Uh oh!

alvarobartt commented Jan 31, 2026

Uh oh!

alvarobartt commented Feb 7, 2026

Uh oh!

williambarberjr commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gazb23 commented Feb 11, 2026

Uh oh!

alvarobartt commented Feb 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Feb 11, 2026

Uh oh!

Uh oh!

michaelfeil commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

williambarberjr commented Jan 30, 2026 •

edited

Loading

williambarberjr commented Feb 8, 2026 •

edited

Loading