Skip to content

Add bidirectional attention and projection layer support for Qwen3-based models#808

Merged
alvarobartt merged 8 commits intohuggingface:mainfrom
turbopuffer:voyage-4-nano-support
Feb 12, 2026
Merged

Add bidirectional attention and projection layer support for Qwen3-based models#808
alvarobartt merged 8 commits intohuggingface:mainfrom
turbopuffer:voyage-4-nano-support

Conversation

@williambarberjr
Copy link
Copy Markdown
Contributor

@williambarberjr williambarberjr commented Jan 30, 2026

What does this PR do?

This PR adds support for voyageai/voyage-4-nano, a Qwen3-based embedding model that uses bidirectional attention and a projection layer.

Changes

1. Bidirectional Attention Support

  • Added use_bidirectional_attention config field (default: false)
  • When true, disables causal masking in the attention mechanism
  • voyage-4-nano and similar embedding models use bidirectional attention to see the full context

2. Projection Layer Support

  • Added num_labels config field for output projection dimension
  • When set, loads linear.weight from safetensors root level and applies projection after final normalization
  • voyage-4-nano projects from hidden_size=1024 to output_dim=2048

Model Configuration

Models using these features should have in their config.json:

{
  "use_bidirectional_attention": true,
  "num_labels": 2048
}

Testing

Tested with voyageai/voyage-4-nano:

  • ✅ Output dimension: 2048 (correct)
  • ✅ Cosine similarity vs HuggingFace transformers: 0.999965
  • ✅ Inference time: ~9ms on L4 GPU (vs 35ms with transformers)

Files Changed

  • backends/candle/src/models/flash_qwen3.rs - CUDA/flash attention implementation
  • backends/candle/src/models/qwen3.rs - CPU/Metal implementation + config struct
  • backends/candle/Cargo.toml - Added cudarc dev-dependency for CUDA tests
  • backends/candle/tests/test_voyage_nano.rs - CPU test with snapshots
  • backends/candle/tests/test_flash_voyage_nano.rs - CUDA test with snapshots
  • README.md - Added voyage-4-nano to supported models table
  • docs/source/en/supported_models.md - Added voyage-4-nano to docs

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline?
  • Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the documentation guidelines.
  • Did you write any new necessary tests? If applicable, did you include or update the insta snapshots?

Who can review?

@Narsil @alvarobartt - This adds two new config fields to support voyage-4-nano embedding model. The changes are backwards compatible (both fields default to disabled behavior).

@williambarberjr williambarberjr force-pushed the voyage-4-nano-support branch 6 times, most recently from bd2bc16 to 539f322 Compare January 30, 2026 22:59
Add two new config fields to Qwen3 to support voyage-4-nano and similar models:

- `use_bidirectional_attention`: When true, disables causal masking
  for embedding models that use full bidirectional attention
- `num_labels`: When set, loads projection layer from linear.weight
  at safetensors root level (e.g., 1024 -> 2048 for voyage-4-nano)

Both fields are backwards compatible, defaulting to disabled behavior.

Changes:
- backends/candle/src/models/qwen3.rs: Add config fields and CPU impl
- backends/candle/src/models/flash_qwen3.rs: Add CUDA/flash-attn impl
- backends/candle/tests/test_voyage_nano.rs: CPU tests with snapshots
- backends/candle/tests/test_flash_voyage_nano.rs: CUDA tests
- README.md, docs/source/en/supported_models.md: Add voyage-4-nano

Tested with voyageai/voyage-4-nano:
- Output dimension: 2048 (correct)
- Cosine similarity vs transformers: 0.999965
- Inference time: ~9ms on L4 GPU (vs 35ms with transformers)
@alvarobartt alvarobartt self-requested a review January 31, 2026 09:36
Copy link
Copy Markdown
Member

@alvarobartt alvarobartt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @williambarberjr thanks for the PR, indeed I have a PR to add BF16 support for both Metal and CUDA, which I'll merge before this one to make sure that we use the correct dtype for Voyage AI Embedding models 🎉

Comment thread README.md Outdated
Comment thread backends/candle/Cargo.toml Outdated
Comment thread backends/candle/src/models/qwen3.rs Outdated
Comment thread docs/source/en/supported_models.md Outdated
Comment thread backends/candle/src/models/flash_qwen3.rs Outdated
Comment thread backends/candle/src/models/flash_qwen3.rs Outdated
Comment thread backends/candle/src/models/flash_qwen3.rs Outdated
Comment thread backends/candle/src/models/flash_qwen3.rs Outdated
@alvarobartt
Copy link
Copy Markdown
Member

Q: Did you validate the cosine similarity both with and without normalization? Or only with normalized embeddings? See e.g. https://huggingface.co/voyageai/voyage-4-nano#via-sentence-transformers, which won't normalize the embeddings as the default value for normalize_embeddings is false, meaning that when calling /embed you should set normalize: false as in Text Embeddings Inference that same parameter is true by default.

@alvarobartt
Copy link
Copy Markdown
Member

See #809, as per BF16 support as mentioned in the review 🤗

Correct order

Co-authored-by: Alvaro Bartolome <36760800+alvarobartt@users.noreply.github.com>
@alvarobartt alvarobartt added this to the v1.9.0 milestone Feb 7, 2026
@alvarobartt
Copy link
Copy Markdown
Member

Hey @williambarberjr thanks again for the PR, did you have time to read the comments? Would you need any help tackling those? I really want this feature in v1.9.0 releasing (hopefully) soon, so let me know if you'd need help 🤗

@williambarberjr
Copy link
Copy Markdown
Contributor Author

williambarberjr commented Feb 8, 2026

Sorry for being so slow after the quick review!

Reviewed and agree with the changes/comments. Implemented them, re-ran the validation comparing TEI vs sentence-transformers (voyageai/voyage-4-nano, trust_remote_code=True) on an A100.

MRL dimensions parity (2048/1024/512/256):

  • Query path (prompt_name="query"): all means ~0.99996+
    • 2048: 0.9999653
    • 1024: 0.9999656
    • 512: 0.9999639
    • 256: 0.9999615
  • Document path (prompt_name="document"): all means ~0.99997+
    • 2048: 0.9999787
    • 1024: 0.9999772
    • 512: 0.9999777
    • 256: 0.9999790

Normalization parity (2048 dim, 5 texts):

  • normalize=false (TEI) vs normalize_embeddings=False (ST): mean cosine 0.9999758
  • normalize=true (TEI) vs normalize_embeddings=True (ST): mean cosine 0.9999753

Extra sanity checks:

  • TEI raw vector norms were non-unit (~18–26), confirming normalize=false returns unnormalized vectors.
  • TEI normalized norms were ~1.0, as expected.

Validation environment:

  • Hardware: NVIDIA A100-SXM4-40GB
  • Driver / CUDA (from nvidia-smi): 580.126.09 / CUDA 13.0
  • Python: 3.10.12
  • torch==2.10.0
  • sentence-transformers==5.2.2
  • transformers==5.1.0
  • TEI binary: text-embeddings-router 1.8.3
  • TEI commit used: 4cb198b
  • TEI launch flags: --model-id voyageai/voyage-4-nano --dtype float16 --pooling mean

@gazb23
Copy link
Copy Markdown

gazb23 commented Feb 11, 2026

Please merge this ;D

@alvarobartt
Copy link
Copy Markdown
Member

Yes @gazb23 the idea is to merge this today, I'm still yet to test it myself before, but expect it to be merged today EOD 🤗

Comment thread backends/candle/src/models/flash_qwen3.rs
Comment thread backends/candle/src/models/qwen3.rs
Comment thread backends/candle/src/models/qwen3.rs
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@alvarobartt alvarobartt merged commit b59f754 into huggingface:main Feb 12, 2026
2 of 16 checks passed
@michaelfeil
Copy link
Copy Markdown
Contributor

@williambarberjr this is mostly a performance bug, since for the voyage model you need / should perform the layer after pooling. Since the pooling is activation free, running it after projection is better (2048x faster).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants