BertScorer early-stopping default (scoring_f1@0.5) collapses on sparse multilabel

### Summary

`BertScorer`'s default early stopping uses `scoring_f1` (F1 thresholded at 0.5) as `metric_for_best_model`. On **sparse multilabel** data, calibrated sigmoid probabilities sit well below 0.5, so this metric reads ≈0 from the very first eval, the `EarlyStoppingCallback` (patience 3) fires almost immediately, and `load_best_model_at_end=True` restores a near-random early checkpoint. The fitted model then outputs near-constant low probabilities → after thresholding the pipeline predicts essentially all-positive (degenerate).

### Where

`autointent/modules/scoring/_bert.py`:
- Default `EarlyStoppingConfig` → `{'val_fraction': 0.2, 'patience': 3, 'threshold': 0.0, 'metric': 'scoring_f1'}`.
- `_get_compute_metrics()` computes `scoring_f1` on the raw eval predictions (0.5-thresholded for multilabel).
- `_train()` sets `metric_for_best_model=self.early_stopping_config.metric` and `load_best_model_at_end=(metric is not None)`, plus an `EarlyStoppingCallback`.

### Evidence

Fine-tuning `bert-base-uncased` on GoEmotions (28 classes, ~4% positive rate, ~2.5k balanced rows):
- With the default early stopping, `eval_scoring_f1` is stuck at ~0.034 every epoch and the pipeline scores macro `decision_f1` ≈ **0.107** (degenerate all-positive; recall→1.0, precision→base rate).
- With early stopping **disabled** (train full epochs) at the same LR, the same model learns and reaches macro `decision_f1` ≈ **0.17–0.41** depending on data size.

So the model is trainable; the threshold-0.5 early-stopping metric is what breaks it.

### Suggested fix

Use a threshold-free signal for `metric_for_best_model` on multilabel:
- `eval_loss`, or a ranking metric (e.g. `neg_coverage` / MAP), or compute F1 at a tuned/optimal threshold rather than a fixed 0.5; or
- disable early stopping by default for multilabel tasks.

Any of these would prevent the immediate-stop collapse while keeping early stopping useful for multiclass.

### Environment

AutoIntent 0.3.1, MPS, `transformers-no-hpo` preset, GoEmotions multilabel.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BertScorer early-stopping default (scoring_f1@0.5) collapses on sparse multilabel #323

Summary

Where

Evidence

Suggested fix

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

BertScorer early-stopping default (scoring_f1@0.5) collapses on sparse multilabel #323

Description

Summary

Where

Evidence

Suggested fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions