Skip to content

BertScorer early-stopping default (scoring_f1@0.5) collapses on sparse multilabel #323

Description

@voorhs

Summary

BertScorer's default early stopping uses scoring_f1 (F1 thresholded at 0.5) as metric_for_best_model. On sparse multilabel data, calibrated sigmoid probabilities sit well below 0.5, so this metric reads ≈0 from the very first eval, the EarlyStoppingCallback (patience 3) fires almost immediately, and load_best_model_at_end=True restores a near-random early checkpoint. The fitted model then outputs near-constant low probabilities → after thresholding the pipeline predicts essentially all-positive (degenerate).

Where

autointent/modules/scoring/_bert.py:

  • Default EarlyStoppingConfig{'val_fraction': 0.2, 'patience': 3, 'threshold': 0.0, 'metric': 'scoring_f1'}.
  • _get_compute_metrics() computes scoring_f1 on the raw eval predictions (0.5-thresholded for multilabel).
  • _train() sets metric_for_best_model=self.early_stopping_config.metric and load_best_model_at_end=(metric is not None), plus an EarlyStoppingCallback.

Evidence

Fine-tuning bert-base-uncased on GoEmotions (28 classes, ~4% positive rate, ~2.5k balanced rows):

  • With the default early stopping, eval_scoring_f1 is stuck at ~0.034 every epoch and the pipeline scores macro decision_f10.107 (degenerate all-positive; recall→1.0, precision→base rate).
  • With early stopping disabled (train full epochs) at the same LR, the same model learns and reaches macro decision_f10.17–0.41 depending on data size.

So the model is trainable; the threshold-0.5 early-stopping metric is what breaks it.

Suggested fix

Use a threshold-free signal for metric_for_best_model on multilabel:

  • eval_loss, or a ranking metric (e.g. neg_coverage / MAP), or compute F1 at a tuned/optimal threshold rather than a fixed 0.5; or
  • disable early stopping by default for multilabel tasks.

Any of these would prevent the immediate-stop collapse while keeping early stopping useful for multiclass.

Environment

AutoIntent 0.3.1, MPS, transformers-no-hpo preset, GoEmotions multilabel.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions