You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Refresh shared training refactor on top of ART main
* Rename Megatron merge helper
* Deduplicate local and shared training logic
* Fix Megatron rope theta compatibility
* Remove Megatron rope theta workaround
* Align Unsloth SFT weight decay defaults
* remove apex from no-build-isolation-package
* update install script
* Fix Megatron job finalization ordering
* Share Megatron worker loop
* Default Megatron grad accumulation by DP size
* Collapse Megatron shared API into train module
* Remove Megatron shared shim
* Collapse Unsloth shared API into train module
* Lighten Megatron orchestration imports
* fix: normalize SFT loss by token count before backward pass
The loss was not being divided by global_trainable_tokens before
calling backward(), causing gradients to scale with batch size
and grad_norm to explode to infinity during Megatron SFT training.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Revert "fix: normalize SFT loss by token count before backward pass"
This reverts commit d08f2ad.
* Support Megatron SFT in local backend
* refactor: extract create_identity_lora as standalone function
Extract the identity LoRA creation logic from MegatronService._create_identity_lora
into a module-level create_identity_lora() function so it can be reused by the
serverless training backend. The class method now delegates to this function.
This avoids duplicating the MoE-aware identity LoRA creation logic (fused expert
targets + convert_checkpoint_if_needed A/B swap) across repos.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Fix SFT main_grad fallback in Megatron
* Fix ART lint and type issues
* Simplify ty-safe optimizer access
* test: drop megatron sft batch unit test
* refactor: revert direct safetensors import in moe conversion
* style: format megatron oracle harness
* refactor: use direct safetensors import in routing replay
* fix: isolate megatron optimizer states and step counts
* Add SFT oracle coverage and shared grad scheduling
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: FurtherAI <FurtherAI@users.noreply.github.com>
0 commit comments