Summary
BertScorer's default early stopping uses scoring_f1 (F1 thresholded at 0.5) as metric_for_best_model. On sparse multilabel data, calibrated sigmoid probabilities sit well below 0.5, so this metric reads ≈0 from the very first eval, the EarlyStoppingCallback (patience 3) fires almost immediately, and load_best_model_at_end=True restores a near-random early checkpoint. The fitted model then outputs near-constant low probabilities → after thresholding the pipeline predicts essentially all-positive (degenerate).
Where
autointent/modules/scoring/_bert.py:
- Default
EarlyStoppingConfig → {'val_fraction': 0.2, 'patience': 3, 'threshold': 0.0, 'metric': 'scoring_f1'}.
_get_compute_metrics() computes scoring_f1 on the raw eval predictions (0.5-thresholded for multilabel).
_train() sets metric_for_best_model=self.early_stopping_config.metric and load_best_model_at_end=(metric is not None), plus an EarlyStoppingCallback.
Evidence
Fine-tuning bert-base-uncased on GoEmotions (28 classes, ~4% positive rate, ~2.5k balanced rows):
- With the default early stopping,
eval_scoring_f1 is stuck at ~0.034 every epoch and the pipeline scores macro decision_f1 ≈ 0.107 (degenerate all-positive; recall→1.0, precision→base rate).
- With early stopping disabled (train full epochs) at the same LR, the same model learns and reaches macro
decision_f1 ≈ 0.17–0.41 depending on data size.
So the model is trainable; the threshold-0.5 early-stopping metric is what breaks it.
Suggested fix
Use a threshold-free signal for metric_for_best_model on multilabel:
eval_loss, or a ranking metric (e.g. neg_coverage / MAP), or compute F1 at a tuned/optimal threshold rather than a fixed 0.5; or
- disable early stopping by default for multilabel tasks.
Any of these would prevent the immediate-stop collapse while keeping early stopping useful for multiclass.
Environment
AutoIntent 0.3.1, MPS, transformers-no-hpo preset, GoEmotions multilabel.
Summary
BertScorer's default early stopping usesscoring_f1(F1 thresholded at 0.5) asmetric_for_best_model. On sparse multilabel data, calibrated sigmoid probabilities sit well below 0.5, so this metric reads ≈0 from the very first eval, theEarlyStoppingCallback(patience 3) fires almost immediately, andload_best_model_at_end=Truerestores a near-random early checkpoint. The fitted model then outputs near-constant low probabilities → after thresholding the pipeline predicts essentially all-positive (degenerate).Where
autointent/modules/scoring/_bert.py:EarlyStoppingConfig→{'val_fraction': 0.2, 'patience': 3, 'threshold': 0.0, 'metric': 'scoring_f1'}._get_compute_metrics()computesscoring_f1on the raw eval predictions (0.5-thresholded for multilabel)._train()setsmetric_for_best_model=self.early_stopping_config.metricandload_best_model_at_end=(metric is not None), plus anEarlyStoppingCallback.Evidence
Fine-tuning
bert-base-uncasedon GoEmotions (28 classes, ~4% positive rate, ~2.5k balanced rows):eval_scoring_f1is stuck at ~0.034 every epoch and the pipeline scores macrodecision_f1≈ 0.107 (degenerate all-positive; recall→1.0, precision→base rate).decision_f1≈ 0.17–0.41 depending on data size.So the model is trainable; the threshold-0.5 early-stopping metric is what breaks it.
Suggested fix
Use a threshold-free signal for
metric_for_best_modelon multilabel:eval_loss, or a ranking metric (e.g.neg_coverage/ MAP), or compute F1 at a tuned/optimal threshold rather than a fixed 0.5; orAny of these would prevent the immediate-stop collapse while keeping early stopping useful for multiclass.
Environment
AutoIntent 0.3.1, MPS,
transformers-no-hpopreset, GoEmotions multilabel.