BertScorer multilabel fine-tuning collapses on sparse data (no pos_weight/focal, no warmup, brittle default LR)

### Summary

Fine-tuning the `bert` scorer on sparse multilabel data collapses to the base-rate (degenerate output) unless the learning rate is tuned narrowly. Contributing factors: plain `BCEWithLogitsLoss` with **no `pos_weight`/focal option** (so the trivial "predict the ~4% base rate everywhere" minimizes loss), **no LR warmup** in the `TrainingArguments`, and a brittle preset default LR.

### Where

`autointent/modules/scoring/_bert.py` → `BertScorer._train()`:
- `TrainingArguments(...)` has no `warmup_ratio` / `warmup_steps`.
- Loss is the HF default for `problem_type="multi_label_classification"` (`BCEWithLogitsLoss`, no `pos_weight`).
- `_presets/transformers-no-hpo.yaml` ships `learning_rate: [7.0e-5]`; `transformers-light.yaml` searches `1e-5…1e-4`.

### Evidence (bert-base-uncased, GoEmotions 28-class, ~2.5k balanced rows, MPS)

Best macro-F1 at the optimal threshold, full epochs:

| LR | result |
|----|--------|
| 1e-5 | collapse (≈0.03; BCE plateaus at the base-rate floor ~0.17, near-constant outputs) |
| 3e-5 | collapse |
| 2e-5 | learns (≈0.22) |
| 2e-5 + warmup 0.1 | ≈0.18 (warmup alone didn't help here) |

So the stable LR band is narrow and the shipped defaults (3e-5 / 7e-5) land in the collapse region on this task.

### Suggested fixes (in rough priority)

1. **Expose a class-imbalance loss option** (`pos_weight` or focal loss) via a `Trainer` subclass with a custom `compute_loss` — the principled fix for sparse multilabel; should widen the stable region substantially.
2. **Add `warmup_ratio`/`warmup_steps`** to `TrainingArguments` and expose it as a hyperparameter.
3. **More robust preset defaults** for multilabel (e.g. lr ≈ 2e-5 + warmup) so out-of-the-box runs don't silently collapse.

LR sensitivity is partly inherent, but (1)–(2) make it far less knife-edged.

### Environment

AutoIntent 0.3.1, MPS, `transformers`/`transformers-no-hpo` presets, GoEmotions multilabel (28 classes, mean ~1.18 labels/example).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BertScorer multilabel fine-tuning collapses on sparse data (no pos_weight/focal, no warmup, brittle default LR) #324

Summary

Where

Evidence (bert-base-uncased, GoEmotions 28-class, ~2.5k balanced rows, MPS)

Suggested fixes (in rough priority)

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

LR	result
1e-5	collapse (≈0.03; BCE plateaus at the base-rate floor ~0.17, near-constant outputs)
3e-5	collapse
2e-5	learns (≈0.22)
2e-5 + warmup 0.1	≈0.18 (warmup alone didn't help here)

Uh oh!

BertScorer multilabel fine-tuning collapses on sparse data (no pos_weight/focal, no warmup, brittle default LR) #324

Description

Summary

Where

Evidence (bert-base-uncased, GoEmotions 28-class, ~2.5k balanced rows, MPS)

Suggested fixes (in rough priority)

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions