Share training code between backends by Kovbo · Pull Request #626 · OpenPipe/ART

Kovbo · 2026-03-21T01:42:13Z

We have a lot of similar training code implemented differently across ART and the serverless backend. That makes RL/SFT changes hard to reason about and easy to drift.

This PR moves the reusable training/runtime logic into ART and makes the serverless backend import it instead of reimplementing it.

Before

We had duplicated training logic in multiple places:

ART local Megatron had its own runtime setup and RL worker loop
Serverless Megatron had its own runtime setup plus separate RL and SFT loops
Serverless Unsloth had its own RL/SFT execution logic
ART local/backend and serverless/backend each built RL config objects and aggregated training metrics separately

So changing training behavior often meant touching multiple implementations.

How It Works Now

Shared backend glue

Shared RL config construction and metric aggregation now live in:

src/art/_backend_training.py

Both of these now use that shared helper:

src/art/local/backend.py
src/art/serverless/backend.py

Shared Megatron job protocol and execution

Megatron’s shared cross-repo pieces now live in ART:

src/art/megatron/train.py
src/art/megatron/client.py
src/art/megatron/jobs.py
src/art/megatron/sft_batches.py
src/art/megatron/merge.py
src/art/megatron/runtime_env.py

The public worker API is now in src/art/megatron/train.py, not a separate shared.py.

That module owns the actual Megatron worker execution:

runtime/model/optimizer setup
RL loop
SFT loop
LoRA + optimizer load/save
metrics logging
job finalization and cleanup

The orchestration layers are now thin wrappers around that API.

Shared Unsloth execution

Shared Unsloth execution now lives in:

src/art/unsloth/train.py

That module exposes the reusable serverless-facing API:

create_unsloth_train_context
run_unsloth_rl_training
run_unsloth_sft_training

Serverless now imports those directly instead of maintaining a separate implementation.

Design Choice

Megatron orchestration is still separate from Megatron execution.

The shared ART execution layer knows how to run Megatron jobs, but process lifecycle / handoff / sleep-wake behavior stays in the
callers:

ART local orchestration lives in src/art/megatron/service.py
Serverless Megatron orchestration lives in trainers/megatron_trainer.py and megatron/train.py

This keeps the shared code focused on training logic rather than backend-specific worker management.

Current End-to-End Flows

ART local RL with Megatron

LocalBackend.train() / _train_model() build config and metrics via src/art/_backend_training.py
MegatronService.train() pauses vLLM, ensures the Megatron process is running, and writes a job file via src/art/megatron/ client.py
The Megatron worker loop in src/art/megatron/train.py picks up that job and runs run_megatron_rl_job()
MegatronService publishes the resulting checkpoint and re-registers the LoRA for inference

ART local SFT with Megatron

LocalBackend._train_sft() tokenizes trajectories into SFTBatch
MegatronService.train_sft() materializes those batches to disk and writes a MegatronSFTTrainingJob
The same worker loop in src/art/megatron/train.py runs run_megatron_sft_job()

Serverless RL

The workflow restores model state/artifacts and chooses a trainer via trainers/__init__.py
MegatronTrainer.train() or UnslothTrainer.train() delegates into ART shared execution
Megatron uses ART’s shared job schema/client and ART’s worker loop via megatron/train.py
Unsloth uses ART’s shared train context and RL helper directly

Serverless SFT

The workflow tokenizes uploaded training data into SFTBatch objects
MegatronTrainer.train_sft() uses ART’s shared materialize_sft_batches() and MegatronSFTTrainingJob
The Megatron worker loop runs run_megatron_sft_job()
UnslothTrainer.train_sft() calls run_unsloth_sft_training() from ART directly

Operational Note

For Megatron in serverless, local /tmp/megatron_training_jobs files are only a per-worker handoff between the trainer and the Megatron
subprocess. They are not the customer-facing queue.

Customer backlog still lives in Temporal.

This PR also clears stale local Megatron job JSONs before enqueueing a new job in serverless so an interrupted worker does not
accidentally replay an old local job file.

Result

After this change:

training execution logic lives in ART
serverless imports ART’s Megatron and Unsloth training code instead of reimplementing it
RL backend config/metric aggregation is shared
Megatron RL and SFT use one shared worker implementation
changes to training behavior mostly happen in one place

FurtherAI

I will propose a merge with main after #619. We'll need to collect the changes I made to train.py mostly and settle on a unified API.

The rest of it looks good! Though I'll test the actual megatron training and ensure the merge still keeps correctness and performance. You should extend the correctness tests to SFT though.

FurtherAI

The PR now signals "all done" before offloading Megatron state. The sentinel is written in shared.py, while offload happens later in train.py. On main, offload happened first at train.py and only then did rank 0
emit completion at train.py. That can reintroduce same-node wake/OOM races.
Missing adapter checkpoints no longer fail hard. The new shared loader in shared.py
silently resets LoRA params instead of raising. main hard-fails on a missing adapter.
The log_path API cutover is still partial. The shared job schema exposes log_path in jobs.py, and shared execution uses it in shared.py, but the local controller still tails
and deletes the default path in service.py.

All the good points found by codex. I'll add to 1 that the some objects such as the packed tensors need to be deleted before the dirs can be removed, since they are memory mapped and thus hold the files open. So all object deletion should happen before dir removal.

Correctness tests were passing, which is good.

Kovbo · 2026-03-31T01:22:12Z

Good catch.

I think everything has been updated now, and I also added some additional training loop refactoring.

My Codex also flagged a few other issues, but they do not appear to be regressions in this branch. I am not sure whether they are real issues, though:

/art/megatron/train.py:355 still requires grad_accumulation_sequences % dp_world_size == 0, and the public config builders still leave it at the default 1 in /art/_backend_training.py:15. So DP>1 Megatron can break through backend issue, just not caused by our sharing refactor.
In /art/megatron/train.py:495-540, reduced_loss is computed as sum of per-rank averages, not a global token-weighted average. That makes DP>1 logging wrong.

FurtherAI · 2026-03-31T18:45:56Z

/art/megatron/train.py:355 still requires grad_accumulation_sequences % dp_world_size == 0, and the public config builders still leave it at the default 1 in /art/_backend_training.py:15. So DP>1 Megatron can break through backend issue, just not caused by our sharing refactor.

In /art/megatron/train.py:495-540, reduced_loss is computed as sum of per-rank averages, not a global token-weighted average. That makes DP>1 logging wrong.

Yeah, let's set the default grad_accumulation_sequences to None and if none, update it to dp_world_size here. The assert is intentional though, so it stays.
As for this, finalize_model_grads in finalize_model_grads_extended correctly reduces this across ranks I believe. It would be good to drop a comment there though ("# num_tokens is reduced in place across ranks").

The previous issues seem to be solved. Looks good

Did you add SFT to the correctness tests? Or is that why supports_sft=False?

FurtherAI · 2026-04-01T06:24:52Z

Changes look good, I ran the correctness tests and they are passing.
Last thing is the plan for SFT correctness tests?

…ing-code

…nto feat/shared-training-code

The loss was not being divided by global_trainable_tokens before calling backward(), causing gradients to scale with batch size and grad_norm to explode to infinity during Megatron SFT training. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

This reverts commit d08f2ad.

Extract the identity LoRA creation logic from MegatronService._create_identity_lora into a module-level create_identity_lora() function so it can be reused by the serverless training backend. The class method now delegates to this function. This avoids duplicating the MoE-aware identity LoRA creation logic (fused expert targets + convert_checkpoint_if_needed A/B swap) across repos. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

FurtherAI · 2026-04-07T22:06:35Z

I modified the api a bit to be able to run the correctness tests. Big changes are:

has a separated out training step, which matches rl pretty closely
matches rl gradient accumulation so data parallelism works correctly

Small fix:

optimizer states keyed by objective (SFT vs RL) so back to back jobs wouldn't overwrite the other's optimizer.

The correctness/sensitivity tests are passing so looks pretty good. Could you check that it runs end-to-end? (trainability will be broken, bug is in offload, but I'll push a fix for that in my next PR).

Kovbo force-pushed the feat/shared-training-code branch from c1abe2e to a1b8efc Compare March 21, 2026 02:03

Kovbo requested a review from bradhilton March 24, 2026 20:32

Kovbo marked this pull request as ready for review March 24, 2026 20:59

Kovbo requested a review from FurtherAI March 25, 2026 18:20

FurtherAI reviewed Mar 25, 2026

View reviewed changes

Refresh shared training refactor on top of ART main

c2039fc

Kovbo force-pushed the feat/shared-training-code branch from b19e94c to c2039fc Compare March 28, 2026 01:03

Kovbo added 7 commits March 28, 2026 01:10

Rename Megatron merge helper

19c906b

Deduplicate local and shared training logic

9d75910

Fix Megatron rope theta compatibility

6d0d2ae

Remove Megatron rope theta workaround

9c474c9

Align Unsloth SFT weight decay defaults

2fa8ffb

remove apex from no-build-isolation-package

8cb71cc

update install script

3a679cb

Kovbo requested a review from FurtherAI March 30, 2026 20:17

FurtherAI requested changes Mar 30, 2026

View reviewed changes

Kovbo force-pushed the feat/shared-training-code branch from dbc9f8a to 3a679cb Compare March 30, 2026 22:52

Kovbo added 2 commits March 30, 2026 23:08

Fix Megatron job finalization ordering

9e90c7d

Share Megatron worker loop

511d72c

Kovbo requested a review from FurtherAI March 31, 2026 01:02

Kovbo added 2 commits April 1, 2026 02:10

Default Megatron grad accumulation by DP size

2e64da0

Collapse Megatron shared API into train module

0cee7cf

Kovbo added 3 commits April 1, 2026 18:50

Remove Megatron shared shim

911c082

Collapse Unsloth shared API into train module

0fa9a2b

Lighten Megatron orchestration imports

f6cd445

Kovbo force-pushed the feat/shared-training-code branch 2 times, most recently from 8096ed4 to f6cd445 Compare April 2, 2026 02:18

Kovbo and others added 15 commits April 2, 2026 12:07

Merge branch 'main' of github.com:OpenPipe/ART into feat/shared-train…

ff28081

…ing-code

Merge branch 'feat/shared-training-code' of github.com:OpenPipe/ART i…

3116a1b

…nto feat/shared-training-code

Revert "fix: normalize SFT loss by token count before backward pass"

21dd5a3

This reverts commit d08f2ad.

Support Megatron SFT in local backend

d68ae3d

Fix SFT main_grad fallback in Megatron

baac098

Fix ART lint and type issues

aa2fd4b

Simplify ty-safe optimizer access

497ff3c

test: drop megatron sft batch unit test

2be0333

refactor: revert direct safetensors import in moe conversion

b322072

style: format megatron oracle harness

7c5a02b

refactor: use direct safetensors import in routing replay

82fa9d0

fix: isolate megatron optimizer states and step counts

40e66aa

Add SFT oracle coverage and shared grad scheduling

9bf7001

FurtherAI approved these changes Apr 7, 2026

View reviewed changes

Kovbo merged commit 8ad8e50 into main Apr 7, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Share training code between backends#626

Share training code between backends#626
Kovbo merged 30 commits intomainfrom
feat/shared-training-code

Kovbo commented Mar 21, 2026 •

edited

Loading

Uh oh!

FurtherAI left a comment

Uh oh!

FurtherAI left a comment •

edited

Loading

Uh oh!

Kovbo commented Mar 31, 2026

Uh oh!

FurtherAI commented Mar 31, 2026

Uh oh!

FurtherAI commented Apr 1, 2026

Uh oh!

FurtherAI commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Kovbo commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before

How It Works Now

Shared backend glue

Shared Megatron job protocol and execution

Shared Unsloth execution

Design Choice

Current End-to-End Flows

ART local RL with Megatron

ART local SFT with Megatron

Serverless RL

Serverless SFT

Operational Note

Result

Uh oh!

FurtherAI left a comment

Choose a reason for hiding this comment

Uh oh!

FurtherAI left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kovbo commented Mar 31, 2026

Uh oh!

FurtherAI commented Mar 31, 2026

Uh oh!

FurtherAI commented Apr 1, 2026

Uh oh!

FurtherAI commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Kovbo commented Mar 21, 2026 •

edited

Loading

FurtherAI left a comment •

edited

Loading