[WIP] Feat/tensor colocated weight sync by HT-Yuan · Pull Request #1164 · inclusionAI/AReaL

HT-Yuan · 2026-04-11T11:03:50Z

Description

Add backend-aware dispatching for colocated tensor weight synchronization, enabling vLLM's native IPCWeightTransferEngine as an alternative to the existing SGLang FlattenedTensorBucket + MultiprocessingSerializer path.

Previously, the tensor weight update path in FSDPEngine was hardcoded to SGLang's serialization format. This PR introduces a tensor_target_backend parameter that flows from rl_trainer.py → train_controller.py → fsdp_engine.py, allowing the engine to dispatch to the correct transport mechanism based on the rollout backend.

Key changes

vllm_remote.py — VLLMBackend gains send_tensor_weight_update() which delegates to vLLM's IPCWeightTransferEngine.trainer_send_weights(); RemotevLLMEngine gains update_weights_from_tensor().
fsdp_engine.py — _flush_colocated_tensor_bucket() refactored to dispatch based on supports_direct_tensor_weight_update; SGLang logic extracted to _flush_sglang_tensor_bucket(); added _make_tensor_backend() factory.
remote_inf_engine.py — RemoteInfBackendProtocol gains build_tensor_weight_update_requests() method declaration.
engine_api.py / train_controller.py / megatron_engine.py / archon_engine.py — connect_engine() signature extended with tensor_target_backend: str | None for interface alignment.
rl_trainer.py — passes self.rollout_alloc.backend as tensor_target_backend.

Related Issue

Fixes #(issue)

Type of Change

Checklist

I have read the Contributing Guide
Pre-commit hooks pass (pre-commit run --all-files)
Relevant tests pass; new tests added for new functionality
Documentation updated (if applicable; built with ./docs/build_all.sh)
Branch is up to date with main
Self-reviewed via /review-pr command
This PR was created by a coding agent via /create-pr
This PR is a breaking change

Breaking Change Details (if applicable):

N/A — The tensor_target_backend parameter is optional with a default of None (falls back to "sglang"), so existing callers are unaffected.

Additional Context

Architecture

gemini-code-assist

Code Review

This pull request introduces a new 'tensor' weight update mode for colocated training and inference, utilizing CUDA IPC for efficient transfers. It implements a two-phase update process in the FSDP engine—staging parameters to CPU pinned memory before transferring them to the inference engine—and adds backend support for both SGLang and vLLM. Review feedback highlights opportunities to reduce code duplication in parameter selection and request building, improve network efficiency by reusing HTTP sessions across buckets, and enhance performance by removing expensive and unnecessary GPU cache clearing calls during the update loop.

gemini-code-assist · 2026-04-11T11:06:10Z

+        if self.config.use_lora:
+            param_iterator = (
+                (name, param)
+                for name, param in self._get_model_name_parameters(meta)
+                if param.requires_grad
+            )
+        else:
+            param_iterator = self._get_model_name_parameters(meta)


The logic for selecting parameters to update (filtering by requires_grad when use_lora is enabled) is duplicated from _update_weights_from_distributed. Consider refactoring this into a helper method (e.g., _get_trainable_parameters) to ensure consistency and reduce code duplication.

gemini-code-assist · 2026-04-11T11:06:10Z

+            with tms_context:
+                if current_platform.device_type == "cuda" and torch.cuda.is_available():
+                    current_platform.set_device(int(os.environ.get("LOCAL_RANK", 0)))
+
+                bucket: list[tuple[str, torch.Tensor]] = []
+                bucket_bytes = 0
+
+                for name, cpu_tensor in staged:
+                    tensor_bytes = cpu_tensor.numel() * cpu_tensor.element_size()
+
+                    if bucket_bytes + tensor_bytes > weight_chunked_mem_size and bucket:
+                        self._flush_colocated_tensor_bucket(bucket, meta)
+                        bucket = []
+                        bucket_bytes = 0
+
+                    gpu_tensor = cpu_tensor.to(
+                        current_platform.current_device(), non_blocking=False
+                    )
+                    bucket.append((name, gpu_tensor))
+                    bucket_bytes += tensor_bytes
+
+                if bucket:
+                    self._flush_colocated_tensor_bucket(bucket, meta)
+        finally:
+            staged.clear()


In _apply_colocated_tensor_weights, consider creating a single aiohttp.ClientSession and passing it down to the flush methods. Currently, a new session (and connection pool) is created for every bucket in _send_tensor_to_servers, which is inefficient when processing many buckets during a weight update.

gemini-code-assist · 2026-04-11T11:06:10Z

+            if current_platform.device_type == "cuda" and torch.cuda.is_available():
+                torch.cuda.ipc_collect()
+                torch.cuda.empty_cache()


torch.cuda.empty_cache() is an expensive operation that synchronizes the GPU and can significantly degrade performance, especially when called repeatedly in a loop (as it is here via _apply_colocated_tensor_weights). Since torch.cuda.ipc_collect() is already called to release IPC handles, consider removing empty_cache() or moving it outside the loop to avoid unnecessary overhead.

Suggested change

if current_platform.device_type == "cuda" and torch.cuda.is_available():

torch.cuda.ipc_collect()

torch.cuda.empty_cache()

if current_platform.device_type == "cuda" and torch.cuda.is_available():

torch.cuda.ipc_collect()

gemini-code-assist · 2026-04-11T11:06:11Z

+    def _send_tensor_to_servers(
+        self,
+        serialized_named_tensors: list[str],
+        addresses: list[str],
+        weight_version: str | None = None,
+    ) -> None:
+        """Send serialized tensor data to SGLang servers via HTTP."""
+        import asyncio
+
+        import aiohttp
+        import uvloop
+
+        from areal.infra.utils.http import arequest_with_retry, get_default_connector
+
+        payload: dict[str, Any] = {
+            "serialized_named_tensors": serialized_named_tensors,
+            "load_format": "flattened_bucket",
+            "flush_cache": False,
+        }
+        if weight_version is not None:
+            payload["weight_version"] = weight_version
+
+        async def _fn():
+            async with aiohttp.ClientSession(
+                timeout=aiohttp.ClientTimeout(total=600),
+                read_bufsize=1024 * 1024 * 10,
+                connector=get_default_connector(),
+            ) as session:
+                jobs = [
+                    arequest_with_retry(
+                        session=session,
+                        addr=addr,
+                        endpoint="/update_weights_from_tensor",
+                        payload=payload,
+                        method="POST",
+                        max_retries=1,
+                        timeout=600,
+                    )
+                    for addr in addresses
+                ]
+                await asyncio.gather(*jobs)
+
+        uvloop.run(_fn())


The _send_tensor_to_servers method and the payload building logic in _flush_sglang_tensor_bucket (lines 1545-1550) duplicate logic already present in areal.infra.remote_inf_engine._update_weights_from_tensor and the backend's build_tensor_weight_update_requests. Consider refactoring to use the shared helper from remote_inf_engine to improve maintainability and ensure consistency across different weight update paths.

HT-Yuan · 2026-04-23T08:27:38Z

@garrett4wade
As mentioned earlier, regarding #1214 and #1157 , this PR should probably be closed.At present, CUDA IPC communication cannot be performed without enabling TMS. In my opinion, it makes little sense to temporarily store training weights on the CPU as Slime does.What do you think?

garrett4wade · 2026-04-23T13:07:52Z

@HT-Yuan I think the high-level decision is adopting the CUDA IPC and P2P primitives in awex, and with the frontend developed in #1214 . We can add sglang extensions within AReaL to allow customized IPC communication endpoints. The CPU serialization approach should be abandoned.

github-actions · 2026-05-08T02:41:27Z

This pull request has been automatically marked as stale because it has not had recent activity within the last 14 days.

Please add a comment or push new commits to keep it active.

Thank you for your contribution!

HT-Yuan added 2 commits April 11, 2026 01:45

init

0827f56

support vllm

f75a9fa

HT-Yuan marked this pull request as draft April 11, 2026 11:04

gemini-code-assist Bot reviewed Apr 11, 2026

View reviewed changes

HT-Yuan added 6 commits April 11, 2026 23:42

update

3aec227

fix

816060e

fix

f93103e

fix

4fc043c

fix

fcf253c

fix

8fdcfb1

github-actions Bot added the stale label May 8, 2026

HT-Yuan closed this May 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Feat/tensor colocated weight sync#1164

[WIP] Feat/tensor colocated weight sync#1164
HT-Yuan wants to merge 8 commits intoinclusionAI:mainfrom
HT-Yuan:feat/tensor-colocated-weight-sync

HT-Yuan commented Apr 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 11, 2026

Uh oh!

gemini-code-assist Bot Apr 11, 2026

Uh oh!

gemini-code-assist Bot Apr 11, 2026

Uh oh!

gemini-code-assist Bot Apr 11, 2026

Uh oh!

HT-Yuan commented Apr 23, 2026

Uh oh!

garrett4wade commented Apr 23, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

HT-Yuan commented Apr 11, 2026

Description

Key changes

Related Issue

Type of Change

Checklist

Additional Context

Architecture

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

HT-Yuan commented Apr 23, 2026

Uh oh!

garrett4wade commented Apr 23, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants