[Feature] Add support for Qwen3 Reranker with Sequence Classifier head#730
[Feature] Add support for Qwen3 Reranker with Sequence Classifier head#730sigridjineth wants to merge 5 commits intohuggingface:mainfrom
Conversation
- Added Qwen3ClassificationHead with flexible tensor loading that handles: - score.weight at top level (for converted Qwen3 rerankers) - classifier.weight/bias patterns for standard models - Updated Qwen3Model and FlashQwen3Model to support classification - Added predict() method implementations for both model variants - Extended Qwen3Config with id2label field for classification - Added test case for Qwen3 reranker models with snapshot The implementation supports Qwen3 models converted to sequence classifiers for reranking tasks (e.g., tomaarsen/Qwen3-Reranker-0.6B-seq-cls). The classification head gracefully handles different tensor naming conventions from various conversion approaches. Tested with both embedding and reranking Qwen3 models. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Daniel Chalef <131175+danielchalef@users.noreply.github.com>
- Added optional fields `instruction` and `use_template` to `RerankRequest` for custom instructions and template usage. - Updated rerank logic to apply templates conditionally based on model type and user input. - Introduced template formatting for Qwen3 models to improve reranking results. This update allows for more flexible and context-aware reranking capabilities, particularly for Qwen3 models.
…classification head in the Qwen3 model.
|
Hey @sigridjineth, thanks for re-opening this PR! Given the amount of changes this might fall into 1.9.0 rather than 1.8.3, but please let me organize the roadmap and come up with a plan 🤗 |
|
Expected this PR to be merged :) |
|
@alvarobartt Just in case you missed it, any reason this PR hasn't been tagged with the v1.9.0 milestone? |
This commit adapts text-embeddings-inference for NVIDIA Jetson Orin (SM87) and L4 GPU (SM89), and integrates valuable community PRs. Changes: 1. SM87/SM89 CUDA Support - Added compute capability 8.7 and 8.9 support - Modified Dockerfile-cuda-all for multi-arch builds - Updated compute_cap.rs for SM87/89 detection Files: Dockerfile-cuda-all, cuda-all-entrypoint.sh, compute_cap.rs 2. PR huggingface#730: Qwen3 Reranker Support - Added classification head for Qwen3 reranking - Implemented template formatting system for chat-based reranking Files: models/qwen3.rs, core/templates.rs, core/lib.rs 3. PR huggingface#787: Batch Notification Performance Optimization - Implemented AtomicUsize counter for batch processing - Reduced unnecessary notify_one() calls - Only last request in batch triggers thread notification Files: core/infer.rs, router/http/server.rs, router/grpc/server.rs 4. PR huggingface#753: GeLU Activation Consistency Fix - Changed Gelu from approximate (gelu) to exact (gelu_erf) - Added NewGelu variant for backward compatibility Files: layers/linear.rs 5. PR huggingface#790: StaticEmbedding Model Support - Added support for 0_StaticEmbedding/ directory structure - Implemented fallback loading for model weights and tokenizer - Default to Mean pooling for StaticEmbedding models Files: models/static_embedding.rs (new), lib.rs, download.rs, router/lib.rs 6. PR huggingface#746: DebertaV2 Sequence Classification Support - Complete DebertaV2 model implementation - Support for sequence classification tasks (e.g., Llama Prompt Guard) - CPU and CUDA device support Files: models/debertav2.rs (new), lib.rs, models/mod.rs All changes have been tested and compile successfully with: cargo check --all-targets Compilation verified with CUDA support: cargo install --path router -F candle-cuda Target Hardware: NVIDIA Jetson Orin AGX (SM87), L4 GPU (SM89) Date: January 5, 2026
|
|
||
| // Apply template if needed for Qwen3 rerankers | ||
| let model_id = info.model_id.clone(); | ||
| let use_template = req.use_template.unwrap_or_else(|| { |
There was a problem hiding this comment.
I am not sure if this works, since qwen defines the rerank propablity as logprobs of yes / yes + no, which means the classifier of the TEI backend gives you two outputs, while RERANK knows how to work with 1 output only.
| @@ -0,0 +1,110 @@ | |||
| use std::fmt::Write; | |||
There was a problem hiding this comment.
Great way of templating.
| @@ -1,6 +1,7 @@ | |||
| use crate::flash_attn::flash_attn_varlen; | |||
There was a problem hiding this comment.
qwen3 parts LGTM to be, the templating might better be opened as a separate PR?
|
@alvarobartt should I bump this PR to be matched with latest commit? |
What does this PR do?
Before submitting
instasnapshots?Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.