fix: Muon momentum, harden streaming by pszemraj · Pull Request #25 · pszemraj/NeoBERT

pszemraj · 2026-04-09T14:43:17Z

The distributed-support work broke Muon's optimizer semantics without it being obvious - the FSDP2 owner-compute plumbing was fine, but momentum behavior drifted. This PR fixes that, makes streaming pretraining less fragile against transient HF Hub failures, and does some overdue housekeeping.

Changes

Muon optimizer

Restored standard Nesterov momentum behavior.
Split fused qkv.weight Q/K/V before orthogonalization, then repack into the fused layout.
Cleaned up defaults/naming: param_policy: hidden_2d, norm_factor: neobert / norm_factor: muon_reference.
Kept the FSDP2 owner-compute path (it was never the problem) and added tests for it.

Streaming robustness

Added retry with backoff for transient HF Hub read failures.
Supports resume from the last yielded example when the HF iterable dataset exposes state restore.

Repo cleanup

Moved GLUE validation logic src/neobert/glue/, HF classifier adapter src/neobert/huggingface/.
Split classifier/wrapper code out of model/model.py.
Consolidated duplicated checkpoint-loading logic and tokenizer test helpers.
Reorganized docs into docs/guides/ + docs/reference/, added a training optimization guide.
Test runs now fail on warnings by default.

still to test

Still need real 2-rank FSDP2 manual tests on multi-GPU - the single-node mocks pass but that's not a substitute.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bb0ccf6f9c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Remove redundant param_policy normalization/validation in _build_param_groups that duplicated MuonClipConfig.__post_init__ - Reset retry counter after each successful yield so isolated transient failures at different positions each get the full retry budget - Remove misleading __len__ on RetryingStreamingDataset that created false "has length" signal for DataLoader - Add test verifying retry budget resets between successful yields

The eval path used _move_batch_to_device with non_blocking=True but never re-pinned batches, making transfers effectively blocking when disable_dispatch was true. Rather than patching the eval path, enable loader-side pinning unconditionally so every consumer (train loop, eval loop) gets page-locked batches. The training path's explicit _pin_cpu_tensors call remains necessary for gradient-accumulated batches (torch.cat produces unpinned tensors) and is a no-op otherwise.

pszemraj · 2026-04-09T21:12:12Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d1a251a4fe

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

pszemraj · 2026-04-10T00:35:15Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d458f47cf4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

pszemraj · 2026-04-10T05:08:51Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4f66e1065d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

pszemraj · 2026-04-10T17:33:55Z

@codex review

chatgpt-codex-connector · 2026-04-10T17:45:20Z

Codex Review: Didn't find any major issues. Breezy!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

User added 24 commits March 27, 2026 03:57

fix(muon): restore baseline-compatible defaults

13609eb

fix(muon): default to transformer-only routing

7ec513b

docs(muon): clarify default routing rationale

097d64c

refactor(checkpointing): consolidate step checkpoint loading

24f06b6

refactor(glue): localize glue validation

5b087c2

refactor(model): split sequence classifiers into module

81ed554

refactor(eval): reuse shared hf rotary helper

57d2431

refactor(model): trim classifier module boilerplate

7f326da

docs: deduplicate guides and remove stale dev notes

ad0b3a2

refactor(tests): deduplicate tokenizer builders

7695ee7

refactor(huggingface): move classifier adapter into hf package

117c1b0

refactor(model): split wrapper heads from backbone

52d97d8

fix(warnings): fail on runtime warnings

d465c46

fix(streaming): retry transient hub read failures

5735226

refactor(muon): align naming with hidden-weight semantics

b229093

fix(muon): restore standard nesterov momentum

a979727

fix(muon): split fused qkv updates by projection

08e668a

fix(muon): restore true original scaling semantics

f6898f7

refactor(muon): clarify default scaling terminology

05af62c

docs: reorganize guides and add training optimization guide

4363a42

docs: remove hard-wrapped markdown prose

9fc22d0

docs: remove stale fork wording from readme

07b50d1

fix: preserve streaming retry setup and cuda pinning

cb677d9

fix: restore cuda dataloader pinning

bb0ccf6

pszemraj added the bug Something isn't working label Apr 9, 2026

pszemraj self-assigned this Apr 9, 2026

chatgpt-codex-connector Bot reviewed Apr 9, 2026

View reviewed changes

Comment thread src/neobert/pretraining/trainer.py Outdated

User added 3 commits April 9, 2026 14:48

fix: preserve streaming resume state and MTEB pinning

393bbab

fix: preserve streaming peeks and direct-step latest loads

d1a251a

chatgpt-codex-connector Bot reviewed Apr 9, 2026

View reviewed changes

Comment thread src/neobert/streaming.py Outdated

User added 2 commits April 9, 2026 17:22

fix: snapshot streaming cursor before retries

c25789a

fix: validate contrastive pretrained backbone init

d458f47

chatgpt-codex-connector Bot reviewed Apr 10, 2026

View reviewed changes

Comment thread src/neobert/checkpointing.py Outdated

User added 4 commits April 9, 2026 21:46

fix: restrict checkpoint tag path matching

9e3dcf8

fix: detect hf streaming datasets

af22fd0

fix: harden streaming dataset handling

fd74d59

fix: honor cpu accelerator requests

4f66e10

chatgpt-codex-connector Bot reviewed Apr 10, 2026

View reviewed changes

Comment thread src/neobert/contrastive/trainer.py

Comment thread src/neobert/checkpointing.py Outdated

User added 2 commits April 10, 2026 02:50

fix: harden resume and contrastive training semantics

f10a373

fix: sync checkpoint tokenizer length

fdb79f0

pszemraj requested a review from amazingvince April 13, 2026 15:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Muon momentum, harden streaming#25

fix: Muon momentum, harden streaming#25
pszemraj wants to merge 36 commits into
mainfrom
fix/muon-correctness

pszemraj commented Apr 9, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

pszemraj commented Apr 9, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

pszemraj commented Apr 10, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

pszemraj commented Apr 10, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

pszemraj commented Apr 10, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pszemraj commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

still to test

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

pszemraj commented Apr 9, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

pszemraj commented Apr 10, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

pszemraj commented Apr 10, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

pszemraj commented Apr 10, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pszemraj commented Apr 9, 2026 •

edited

Loading