Summary
Fine-tuning the bert scorer on sparse multilabel data collapses to the base-rate (degenerate output) unless the learning rate is tuned narrowly. Contributing factors: plain BCEWithLogitsLoss with no pos_weight/focal option (so the trivial "predict the ~4% base rate everywhere" minimizes loss), no LR warmup in the TrainingArguments, and a brittle preset default LR.
Where
autointent/modules/scoring/_bert.py → BertScorer._train():
TrainingArguments(...) has no warmup_ratio / warmup_steps.
- Loss is the HF default for
problem_type="multi_label_classification" (BCEWithLogitsLoss, no pos_weight).
_presets/transformers-no-hpo.yaml ships learning_rate: [7.0e-5]; transformers-light.yaml searches 1e-5…1e-4.
Evidence (bert-base-uncased, GoEmotions 28-class, ~2.5k balanced rows, MPS)
Best macro-F1 at the optimal threshold, full epochs:
| LR |
result |
| 1e-5 |
collapse (≈0.03; BCE plateaus at the base-rate floor ~0.17, near-constant outputs) |
| 3e-5 |
collapse |
| 2e-5 |
learns (≈0.22) |
| 2e-5 + warmup 0.1 |
≈0.18 (warmup alone didn't help here) |
So the stable LR band is narrow and the shipped defaults (3e-5 / 7e-5) land in the collapse region on this task.
Suggested fixes (in rough priority)
- Expose a class-imbalance loss option (
pos_weight or focal loss) via a Trainer subclass with a custom compute_loss — the principled fix for sparse multilabel; should widen the stable region substantially.
- Add
warmup_ratio/warmup_steps to TrainingArguments and expose it as a hyperparameter.
- More robust preset defaults for multilabel (e.g. lr ≈ 2e-5 + warmup) so out-of-the-box runs don't silently collapse.
LR sensitivity is partly inherent, but (1)–(2) make it far less knife-edged.
Environment
AutoIntent 0.3.1, MPS, transformers/transformers-no-hpo presets, GoEmotions multilabel (28 classes, mean ~1.18 labels/example).
Summary
Fine-tuning the
bertscorer on sparse multilabel data collapses to the base-rate (degenerate output) unless the learning rate is tuned narrowly. Contributing factors: plainBCEWithLogitsLosswith nopos_weight/focal option (so the trivial "predict the ~4% base rate everywhere" minimizes loss), no LR warmup in theTrainingArguments, and a brittle preset default LR.Where
autointent/modules/scoring/_bert.py→BertScorer._train():TrainingArguments(...)has nowarmup_ratio/warmup_steps.problem_type="multi_label_classification"(BCEWithLogitsLoss, nopos_weight)._presets/transformers-no-hpo.yamlshipslearning_rate: [7.0e-5];transformers-light.yamlsearches1e-5…1e-4.Evidence (bert-base-uncased, GoEmotions 28-class, ~2.5k balanced rows, MPS)
Best macro-F1 at the optimal threshold, full epochs:
So the stable LR band is narrow and the shipped defaults (3e-5 / 7e-5) land in the collapse region on this task.
Suggested fixes (in rough priority)
pos_weightor focal loss) via aTrainersubclass with a customcompute_loss— the principled fix for sparse multilabel; should widen the stable region substantially.warmup_ratio/warmup_stepstoTrainingArgumentsand expose it as a hyperparameter.LR sensitivity is partly inherent, but (1)–(2) make it far less knife-edged.
Environment
AutoIntent 0.3.1, MPS,
transformers/transformers-no-hpopresets, GoEmotions multilabel (28 classes, mean ~1.18 labels/example).