fix(classify): honor raw_scores flag (return logits, softmax only when raw_scores=False)#662
fix(classify): honor raw_scores flag (return logits, softmax only when raw_scores=False)#662Anai-Guo wants to merge 1 commit into
Conversation
…max conditionally The text-classification pipeline applies softmax internally, so the `/classify` endpoint always returned softmax probabilities regardless of the `raw_scores` flag (the post-processing block was a no-op). Pass function_to_apply="none" so the model emits raw logits, then apply softmax in the batch handler only when raw_scores is False, mirroring the rerank path. Fixes michaelfeil#658
There was a problem hiding this comment.
Code Review
This pull request updates the classification pipeline to retrieve raw logits from the underlying models by passing function_to_apply="none" to the pipelines, and manually computes the softmax probabilities in batch_handler.py when raw_scores is False. The review feedback highlights a potential breaking change where the default behavior of classify now returns raw logits instead of probabilities, and suggests changing the default value of raw_scores to False to preserve backward compatibility. Additionally, it is recommended to optimize the softmax computation using pure Python math.exp to avoid NumPy overhead on small lists, along with adding defensive checks for empty predictions.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if raw_scores: | ||
| # perform softmax on scores | ||
| pass | ||
| if not raw_scores: |
There was a problem hiding this comment.
The default value of raw_scores in the classify method signature (line 224) is currently True.
Previously, because the pipeline applied softmax internally and the if raw_scores: block was a no-op, calling classify with default arguments returned softmax probabilities. With this change, calling classify with default arguments will now return raw logits, which is a breaking change for the default behavior of the API.
To preserve the original default behavior of returning softmax probabilities, please change the default value of raw_scores in the signature of classify (line 224) to False (matching the behavior of rerank).
| # the model returns raw logits; convert them to probabilities | ||
| for prediction in classifications: | ||
| logits = np.array([label["score"] for label in prediction]) | ||
| exp = np.exp(logits - logits.max()) | ||
| probs = exp / exp.sum() | ||
| for label, prob in zip(prediction, probs): | ||
| label["score"] = float(prob) |
There was a problem hiding this comment.
Using NumPy array operations inside a loop on the main event loop introduces unnecessary overhead, especially since classification tasks typically have a very small number of classes (e.g., 2 to 10). We can optimize this by using pure Python with math.exp, which is significantly faster for small lists and avoids NumPy overhead.
Additionally, we should add defensive checks to handle cases where prediction might be empty to prevent potential runtime errors.
import math
# the model returns raw logits; convert them to probabilities
for prediction in classifications:
scores = [label["score"] for label in prediction]
if not scores:
continue
max_score = max(scores)
exps = [math.exp(s - max_score) for s in scores]
sum_exps = sum(exps)
for label, exp_val in zip(prediction, exps):
label["score"] = exp_val / sum_exps if sum_exps > 0 else 0.0There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 94af43e02f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| logits = np.array([label["score"] for label in prediction]) | ||
| exp = np.exp(logits - logits.max()) | ||
| probs = exp / exp.sum() |
There was a problem hiding this comment.
Preserve sigmoid probabilities for multi-label classifiers
For multi-label sequence classifiers such as the documented GoEmotions model, labels are independent and the Transformers pipeline previously normalized raw logits with sigmoid rather than across-label softmax. This new default raw_scores=false path always divides by the sum over every label, forcing each prediction's scores to sum to 1 and suppressing valid co-occurring labels whenever more than one class applies, so /classify now returns non-HF probabilities for those models.
Useful? React with 👍 / 👎.
Greptile SummaryThis PR fixes a long-standing bug where the
Confidence Score: 3/5The flag inversion is correct for single-label classifiers, but multi-label models will silently receive softmax-transformed scores instead of the per-label sigmoid the HF pipeline would have applied. The libs/infinity_emb/infinity_emb/inference/batch_handler.py — the softmax applied unconditionally needs to account for multi-label model types Important Files Changed
Sequence DiagramsequenceDiagram
participant Client
participant Server as infinity_server
participant Engine as BatchHandler.classify
participant Backend as SentenceClassifier / OptimumClassifier
Client->>Server: "POST /classify {input, raw_scores}"
Server->>Engine: classify(sentences, raw_scores)
Engine->>Backend: _schedule(PredictSingle items)
Backend->>Backend: "encode_core() pipeline(function_to_apply=none)"
Note over Backend: Returns raw logits (no activation)
Backend-->>Engine: classifications (raw logits)
alt "raw_scores == False"
Engine->>Engine: stable softmax over logits per prediction
Note over Engine: softmax is wrong for multi-label models (should be sigmoid)
end
Engine-->>Server: classifications, usage
Server-->>Client: JSON response
Reviews (1): Last reviewed commit: "fix(classify): honor raw_scores by retur..." | Re-trigger Greptile |
| if not raw_scores: | ||
| # the model returns raw logits; convert them to probabilities | ||
| for prediction in classifications: | ||
| logits = np.array([label["score"] for label in prediction]) | ||
| exp = np.exp(logits - logits.max()) | ||
| probs = exp / exp.sum() | ||
| for label, prob in zip(prediction, probs): | ||
| label["score"] = float(prob) |
There was a problem hiding this comment.
Softmax incorrectly applied to multi-label classifiers
The HF text-classification pipeline applies sigmoid (not softmax) for models with problem_type == "multi_label_classification" or num_labels == 1. By calling function_to_apply="none" in both classifier backends and then unconditionally applying softmax here, a multi-label classifier will produce wrong probabilities — labels are forced to be mutually exclusive (sum to 1) rather than independently scored per-class. The correct fix would check the model's config and apply sigmoid or softmax accordingly, mirroring the pipeline's own default logic.
| return_length=False, | ||
| ).encodings | ||
| return [len(t.tokens) for t in tks] | ||
| return [len(t.tokens) for t in tks] No newline at end of file |
There was a problem hiding this comment.
Missing newline at end of file. Both changed classifier files (
torch.py and optimum.py) lost their trailing newline, which will cause diff noise in future patches and may break some POSIX-compliant tooling.
| return [len(t.tokens) for t in tks] | |
| return [len(t.tokens) for t in tks] |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
|
Good catch on the multi-label case. You're right: HF's text-classification pipeline picks The clean fix is to stop second-guessing the pipeline: thread @michaelfeil happy to push that version if you'd prefer it over the current approach — just want to confirm the threading-through-the-scheduler direction is acceptable before reworking it. |
Problem
/classifyignores theraw_scoresparameter — softmax probabilities are always returned (#658).The classifiers run an HF
text-classificationpipeline(..., top_k=None), which applies softmax internally by default. The handler's post-processing block was a no-op:So the flag never had any effect, and raw logits were never reachable.
Fix
Mirror the existing
rerankpath (model returns raw scores, activation applied conditionally in the handler):SentenceClassifier/OptimumClassifier: call the pipeline withfunction_to_apply="none"so it emits raw logits.BatchHandler.classify: apply a numerically-stable softmax only whenraw_scoresisFalse.raw_scores=Truenow returns logits; the defaultraw_scores=Falsepath is unchanged (softmax is monotonic, so the descending order the pipeline produces is preserved, and the resulting probabilities are identical).🤖 Generated with Claude Code