Skip to content

fix(classify): honor raw_scores flag (return logits, softmax only when raw_scores=False)#662

Open
Anai-Guo wants to merge 1 commit into
michaelfeil:mainfrom
Anai-Guo:fix/classify-raw-scores
Open

fix(classify): honor raw_scores flag (return logits, softmax only when raw_scores=False)#662
Anai-Guo wants to merge 1 commit into
michaelfeil:mainfrom
Anai-Guo:fix/classify-raw-scores

Conversation

@Anai-Guo
Copy link
Copy Markdown

@Anai-Guo Anai-Guo commented Jun 4, 2026

Problem

/classify ignores the raw_scores parameter — softmax probabilities are always returned (#658).

The classifiers run an HF text-classification pipeline(..., top_k=None), which applies softmax internally by default. The handler's post-processing block was a no-op:

if raw_scores:
    # perform softmax on scores
    pass

So the flag never had any effect, and raw logits were never reachable.

Fix

Mirror the existing rerank path (model returns raw scores, activation applied conditionally in the handler):

  • SentenceClassifier / OptimumClassifier: call the pipeline with function_to_apply="none" so it emits raw logits.
  • BatchHandler.classify: apply a numerically-stable softmax only when raw_scores is False.

raw_scores=True now returns logits; the default raw_scores=False path is unchanged (softmax is monotonic, so the descending order the pipeline produces is preserved, and the resulting probabilities are identical).

🤖 Generated with Claude Code

…max conditionally

The text-classification pipeline applies softmax internally, so the `/classify` endpoint always returned softmax probabilities regardless of the `raw_scores` flag (the post-processing block was a no-op). Pass function_to_apply="none" so the model emits raw logits, then apply softmax in the batch handler only when raw_scores is False, mirroring the rerank path.

Fixes michaelfeil#658
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the classification pipeline to retrieve raw logits from the underlying models by passing function_to_apply="none" to the pipelines, and manually computes the softmax probabilities in batch_handler.py when raw_scores is False. The review feedback highlights a potential breaking change where the default behavior of classify now returns raw logits instead of probabilities, and suggests changing the default value of raw_scores to False to preserve backward compatibility. Additionally, it is recommended to optimize the softmax computation using pure Python math.exp to avoid NumPy overhead on small lists, along with adding defensive checks for empty predictions.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

if raw_scores:
# perform softmax on scores
pass
if not raw_scores:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The default value of raw_scores in the classify method signature (line 224) is currently True.

Previously, because the pipeline applied softmax internally and the if raw_scores: block was a no-op, calling classify with default arguments returned softmax probabilities. With this change, calling classify with default arguments will now return raw logits, which is a breaking change for the default behavior of the API.

To preserve the original default behavior of returning softmax probabilities, please change the default value of raw_scores in the signature of classify (line 224) to False (matching the behavior of rerank).

Comment on lines +248 to +254
# the model returns raw logits; convert them to probabilities
for prediction in classifications:
logits = np.array([label["score"] for label in prediction])
exp = np.exp(logits - logits.max())
probs = exp / exp.sum()
for label, prob in zip(prediction, probs):
label["score"] = float(prob)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using NumPy array operations inside a loop on the main event loop introduces unnecessary overhead, especially since classification tasks typically have a very small number of classes (e.g., 2 to 10). We can optimize this by using pure Python with math.exp, which is significantly faster for small lists and avoids NumPy overhead.

Additionally, we should add defensive checks to handle cases where prediction might be empty to prevent potential runtime errors.

            import math
            # the model returns raw logits; convert them to probabilities
            for prediction in classifications:
                scores = [label["score"] for label in prediction]
                if not scores:
                    continue
                max_score = max(scores)
                exps = [math.exp(s - max_score) for s in scores]
                sum_exps = sum(exps)
                for label, exp_val in zip(prediction, exps):
                    label["score"] = exp_val / sum_exps if sum_exps > 0 else 0.0

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 94af43e02f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +250 to +252
logits = np.array([label["score"] for label in prediction])
exp = np.exp(logits - logits.max())
probs = exp / exp.sum()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve sigmoid probabilities for multi-label classifiers

For multi-label sequence classifiers such as the documented GoEmotions model, labels are independent and the Transformers pipeline previously normalized raw logits with sigmoid rather than across-label softmax. This new default raw_scores=false path always divides by the sum over every label, forcing each prediction's scores to sum to 1 and suppressing valid co-occurring labels whenever more than one class applies, so /classify now returns non-HF probabilities for those models.

Useful? React with 👍 / 👎.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 4, 2026

Greptile Summary

This PR fixes a long-standing bug where the /classify endpoint's raw_scores flag was a no-op, always returning softmax probabilities. Both classifier backends now pass function_to_apply="none" to get raw logits, and the BatchHandler applies a numerically-stable softmax conditionally when raw_scores=False.

  • torch.py / optimum.py: encode_core now passes function_to_apply="none" to the HF pipeline so all downstream post-processing happens in one place.
  • batch_handler.py: The inverted condition (if not raw_scores) now correctly applies a log-sum-exp-stabilised softmax, mirroring the rerank path. However, the softmax is applied unconditionally for all model types — for multi-label classifiers the HF pipeline would normally apply sigmoid (independent per-label scores), not softmax (mutually-exclusive scores), so those models will silently return incorrect probabilities when raw_scores=False.

Confidence Score: 3/5

The flag inversion is correct for single-label classifiers, but multi-label models will silently receive softmax-transformed scores instead of the per-label sigmoid the HF pipeline would have applied.

The raw_scores flag fix works correctly for single-label text classifiers, which are the common case. However, by bypassing the pipeline's built-in activation-function selection with function_to_apply="none" and always applying softmax in the handler, any multi-label classification model (where problem_type == "multi_label_classification") will silently produce wrong probabilities — softmax forces mutually exclusive outputs, but multi-label tasks require independent sigmoid-scaled scores per class.

libs/infinity_emb/infinity_emb/inference/batch_handler.py — the softmax applied unconditionally needs to account for multi-label model types

Important Files Changed

Filename Overview
libs/infinity_emb/infinity_emb/inference/batch_handler.py Fixes the classify raw_scores flag with numerically-stable softmax, but unconditionally applies softmax for all model types, breaking multi-label classifiers that require sigmoid instead.
libs/infinity_emb/infinity_emb/transformer/classifier/torch.py Correctly passes function_to_apply="none" to the HF pipeline so raw logits are returned for post-processing; trailing newline removed.
libs/infinity_emb/infinity_emb/transformer/classifier/optimum.py Correctly passes function_to_apply="none" to the ONNX pipeline so raw logits are returned; trailing newline removed.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Server as infinity_server
    participant Engine as BatchHandler.classify
    participant Backend as SentenceClassifier / OptimumClassifier

    Client->>Server: "POST /classify {input, raw_scores}"
    Server->>Engine: classify(sentences, raw_scores)
    Engine->>Backend: _schedule(PredictSingle items)
    Backend->>Backend: "encode_core() pipeline(function_to_apply=none)"
    Note over Backend: Returns raw logits (no activation)
    Backend-->>Engine: classifications (raw logits)
    alt "raw_scores == False"
        Engine->>Engine: stable softmax over logits per prediction
        Note over Engine: softmax is wrong for multi-label models (should be sigmoid)
    end
    Engine-->>Server: classifications, usage
    Server-->>Client: JSON response
Loading

Reviews (1): Last reviewed commit: "fix(classify): honor raw_scores by retur..." | Re-trigger Greptile

Comment on lines +247 to +254
if not raw_scores:
# the model returns raw logits; convert them to probabilities
for prediction in classifications:
logits = np.array([label["score"] for label in prediction])
exp = np.exp(logits - logits.max())
probs = exp / exp.sum()
for label, prob in zip(prediction, probs):
label["score"] = float(prob)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Softmax incorrectly applied to multi-label classifiers

The HF text-classification pipeline applies sigmoid (not softmax) for models with problem_type == "multi_label_classification" or num_labels == 1. By calling function_to_apply="none" in both classifier backends and then unconditionally applying softmax here, a multi-label classifier will produce wrong probabilities — labels are forced to be mutually exclusive (sum to 1) rather than independently scored per-class. The correct fix would check the model's config and apply sigmoid or softmax accordingly, mirroring the pipeline's own default logic.

return_length=False,
).encodings
return [len(t.tokens) for t in tks]
return [len(t.tokens) for t in tks] No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Missing newline at end of file. Both changed classifier files (torch.py and optimum.py) lost their trailing newline, which will cause diff noise in future patches and may break some POSIX-compliant tooling.

Suggested change
return [len(t.tokens) for t in tks]
return [len(t.tokens) for t in tks]

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@Anai-Guo
Copy link
Copy Markdown
Author

Anai-Guo commented Jun 7, 2026

Good catch on the multi-label case. You're right: HF's text-classification pipeline picks function_to_apply per model — softmax for single-label (problem_type == "single_label_classification" / num_labels > 1) but sigmoid for multi-label (problem_type == "multi_label_classification" / num_labels == 1). My current patch forces function_to_apply="none" in both backends and then unconditionally re-applies softmax in BatchHandler.classify, so a multi-label model would now return softmax instead of sigmoid when raw_scores=False — a regression for that model class. The single-label path (the common /classify use case) is unaffected.

The clean fix is to stop second-guessing the pipeline: thread raw_scores down to the backend and set function_to_apply="none" only when raw logits are requested, otherwise leave it unset so the pipeline applies its own (correct) default. That lets me drop the manual softmax in classify() entirely and handles both model types correctly.

@michaelfeil happy to push that version if you'd prefer it over the current approach — just want to confirm the threading-through-the-scheduler direction is acceptable before reworking it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant