fix: Muon momentum, harden streaming#25
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bb0ccf6f9c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Remove redundant param_policy normalization/validation in _build_param_groups that duplicated MuonClipConfig.__post_init__ - Reset retry counter after each successful yield so isolated transient failures at different positions each get the full retry budget - Remove misleading __len__ on RetryingStreamingDataset that created false "has length" signal for DataLoader - Add test verifying retry budget resets between successful yields
The eval path used _move_batch_to_device with non_blocking=True but never re-pinned batches, making transfers effectively blocking when disable_dispatch was true. Rather than patching the eval path, enable loader-side pinning unconditionally so every consumer (train loop, eval loop) gets page-locked batches. The training path's explicit _pin_cpu_tensors call remains necessary for gradient-accumulated batches (torch.cat produces unpinned tensors) and is a no-op otherwise.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d1a251a4fe
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d458f47cf4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4f66e1065d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
|
Codex Review: Didn't find any major issues. Breezy! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
The distributed-support work broke Muon's optimizer semantics without it being obvious - the FSDP2 owner-compute plumbing was fine, but momentum behavior drifted. This PR fixes that, makes streaming pretraining less fragile against transient HF Hub failures, and does some overdue housekeeping.
Changes
Muon optimizer
qkv.weightQ/K/V before orthogonalization, then repack into the fused layout.param_policy: hidden_2d,norm_factor: neobert/norm_factor: muon_reference.Streaming robustness
Repo cleanup
src/neobert/glue/, HF classifier adaptersrc/neobert/huggingface/.model/model.py.docs/guides/+docs/reference/, added a training optimization guide.still to test