Conversation
c1abe2e to
a1b8efc
Compare
FurtherAI
left a comment
There was a problem hiding this comment.
I will propose a merge with main after #619. We'll need to collect the changes I made to train.py mostly and settle on a unified API.
The rest of it looks good! Though I'll test the actual megatron training and ensure the merge still keeps correctness and performance. You should extend the correctness tests to SFT though.
b19e94c to
c2039fc
Compare
There was a problem hiding this comment.
- The PR now signals "all done" before offloading Megatron state. The sentinel is written in
shared.py, while offload happens later intrain.py. On main, offload happened first attrain.pyand only then did rank 0
emit completion attrain.py. That can reintroduce same-node wake/OOM races. - Missing adapter checkpoints no longer fail hard. The new shared loader in
shared.py
silently resets LoRA params instead of raising. main hard-fails on a missing adapter. - The log_path API cutover is still partial. The shared job schema exposes log_path in
jobs.py, and shared execution uses it inshared.py, but the local controller still tails
and deletes the default path inservice.py.
All the good points found by codex. I'll add to 1 that the some objects such as the packed tensors need to be deleted before the dirs can be removed, since they are memory mapped and thus hold the files open. So all object deletion should happen before dir removal.
Correctness tests were passing, which is good.
dbc9f8a to
3a679cb
Compare
|
Good catch. I think everything has been updated now, and I also added some additional training loop refactoring. My Codex also flagged a few other issues, but they do not appear to be regressions in this branch. I am not sure whether they are real issues, though:
|
The previous issues seem to be solved. Looks good Did you add SFT to the correctness tests? Or is that why |
|
Changes look good, I ran the correctness tests and they are passing. |
8096ed4 to
f6cd445
Compare
…nto feat/shared-training-code
The loss was not being divided by global_trainable_tokens before calling backward(), causing gradients to scale with batch size and grad_norm to explode to infinity during Megatron SFT training. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This reverts commit d08f2ad.
Extract the identity LoRA creation logic from MegatronService._create_identity_lora into a module-level create_identity_lora() function so it can be reused by the serverless training backend. The class method now delegates to this function. This avoids duplicating the MoE-aware identity LoRA creation logic (fused expert targets + convert_checkpoint_if_needed A/B swap) across repos. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
I modified the api a bit to be able to run the correctness tests. Big changes are:
Small fix:
The correctness/sensitivity tests are passing so looks pretty good. Could you check that it runs end-to-end? (trainability will be broken, bug is in offload, but I'll push a fix for that in my next PR). |
We have a lot of similar training code implemented differently across ART and the serverless backend. That makes RL/SFT changes hard to reason about and easy to drift.
This PR moves the reusable training/runtime logic into ART and makes the serverless backend import it instead of reimplementing it.
Before
We had duplicated training logic in multiple places:
So changing training behavior often meant touching multiple implementations.
How It Works Now
Shared backend glue
Shared RL config construction and metric aggregation now live in:
src/art/_backend_training.pyBoth of these now use that shared helper:
src/art/local/backend.pysrc/art/serverless/backend.pyShared Megatron job protocol and execution
Megatron’s shared cross-repo pieces now live in ART:
src/art/megatron/train.pysrc/art/megatron/client.pysrc/art/megatron/jobs.pysrc/art/megatron/sft_batches.pysrc/art/megatron/merge.pysrc/art/megatron/runtime_env.pyThe public worker API is now in
src/art/megatron/train.py, not a separateshared.py.That module owns the actual Megatron worker execution:
The orchestration layers are now thin wrappers around that API.
Shared Unsloth execution
Shared Unsloth execution now lives in:
src/art/unsloth/train.pyThat module exposes the reusable serverless-facing API:
create_unsloth_train_contextrun_unsloth_rl_trainingrun_unsloth_sft_trainingServerless now imports those directly instead of maintaining a separate implementation.
Design Choice
Megatron orchestration is still separate from Megatron execution.
The shared ART execution layer knows how to run Megatron jobs, but process lifecycle / handoff / sleep-wake behavior stays in the
callers:
src/art/megatron/service.pytrainers/megatron_trainer.pyandmegatron/train.pyThis keeps the shared code focused on training logic rather than backend-specific worker management.
Current End-to-End Flows
ART local RL with Megatron
LocalBackend.train()/_train_model()build config and metrics viasrc/art/_backend_training.pyMegatronService.train()pauses vLLM, ensures the Megatron process is running, and writes a job file viasrc/art/megatron/ client.pysrc/art/megatron/train.pypicks up that job and runsrun_megatron_rl_job()MegatronServicepublishes the resulting checkpoint and re-registers the LoRA for inferenceART local SFT with Megatron
LocalBackend._train_sft()tokenizes trajectories intoSFTBatchMegatronService.train_sft()materializes those batches to disk and writes aMegatronSFTTrainingJobsrc/art/megatron/train.pyrunsrun_megatron_sft_job()Serverless RL
trainers/__init__.pyMegatronTrainer.train()orUnslothTrainer.train()delegates into ART shared executionmegatron/train.pyServerless SFT
SFTBatchobjectsMegatronTrainer.train_sft()uses ART’s sharedmaterialize_sft_batches()andMegatronSFTTrainingJobrun_megatron_sft_job()UnslothTrainer.train_sft()callsrun_unsloth_sft_training()from ART directlyOperational Note
For Megatron in serverless, local
/tmp/megatron_training_jobsfiles are only a per-worker handoff between the trainer and the Megatronsubprocess. They are not the customer-facing queue.
Customer backlog still lives in Temporal.
This PR also clears stale local Megatron job JSONs before enqueueing a new job in serverless so an interrupted worker does not
accidentally replay an old local job file.
Result
After this change: