fix ray mem leak#4487

Merged

lvhan028 merged 4 commits intoInternLM:mainfrom

grimoire:fix-memleak

Apr 10, 2026

Collaborator

grimoire commented Apr 2, 2026 •

edited

Loading

Non-compiled ray GraphNode has not expose objectref free to python.
~~RayEngineWorker might keep the stream output if the stream is cancelled or breaked before finish.~~
The owner of routed_experts is PytorchEngine, client objectref free can not deref the object in Engine. A named Actor is created.

grimoire added 4 commits

April 2, 2026 18:18


          fix mem leak

e78106c


          fix clear

a746211


          Merge branch 'main' into fix-memleak

7e309a2


          revert change

5f86206

grimoire marked this pull request as ready for review

April 10, 2026 04:25

Copilot AI review requested due to automatic review settings

April 10, 2026 04:25

Copilot started reviewing on behalf of grimoire

April 10, 2026 04:26

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Pull request overview

This PR targets Ray-related memory leaks in the PyTorch engine’s distributed (Ray) execution path by changing how input ObjectRefs are handled and introducing a Ray actor–backed store for transferring large routed_experts outputs across process boundaries.

Changes:

Replace ray.dag execution with direct per-worker forward_async.remote(...) calls to avoid DAG-retained input ObjectRefs.
Add a detached named Ray actor (SharedStore) to own/stage routed_experts and return an opaque key instead of embedding an ObjectRef in the response.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File	Description
`lmdeploy/pytorch/engine/executor/ray_executor.py`	Switches async forward dispatch from `dag.execute` to direct remote calls, aiming to prevent input ObjectRef retention.
`lmdeploy/pytorch/engine/engine_instance.py`	Adds a Ray actor-based shared store and changes `routed_experts` extra output to return a store key when Ray transfer is enabled.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

lmdeploy/pytorch/engine/engine_instance.py

+                      return key
+                  def get(self, key):
+                      import ray

lmdeploy/pytorch/engine/engine_instance.py

Comment on lines 141 to +145

                       if routed_experts is not None and resp.type in [ResponseType.FINISH, ResponseType.CANCEL]:
                           if self._enable_transfer_obj_ref:
-                              import pybase64
                               import ray
-                              ref = ray.put(routed_experts)
-                              data = ray.cloudpickle.dumps(ref)
-                              outputs['routed_experts'] = pybase64.b64encode(data).decode('utf-8')
+                              key = ray.get(_SHARED_STORE.put.remote(routed_experts))
+                              outputs['routed_experts'] = key

lmdeploy/pytorch/engine/engine_instance.py

Comment on lines +94 to +95

		if len(all_data) > 0:
		ray.internal.free(all_data, local_only=False)

lmdeploy/pytorch/engine/engine_instance.py

+                          _SHARED_STORE = ray.remote(num_cpus=0,)(SharedStore).options(
+                              name=name,
+                              namespace='lmdeploy',
+                              lifetime='detached',

lmdeploy/pytorch/engine/engine_instance.py

Comment on lines +109 to +113

+                          _SHARED_STORE = ray.remote(num_cpus=0,)(SharedStore).options(
+                              name=name,
+                              namespace='lmdeploy',
+                              lifetime='detached',
+                          ).remote()

lmdeploy/pytorch/engine/engine_instance.py

Comment on lines 143 to +145

                               import ray
-                              ref = ray.put(routed_experts)
-                              data = ray.cloudpickle.dumps(ref)
-                              outputs['routed_experts'] = pybase64.b64encode(data).decode('utf-8')
+                              key = ray.get(_SHARED_STORE.put.remote(routed_experts))
+                              outputs['routed_experts'] = key

lmdeploy/pytorch/engine/executor/ray_executor.py

Comment on lines 508 to +512

                       self._prev_inputs = ray.put(inputs)
-                      # make sure in order
-                      self._prev_out = self.dag.execute(self._prev_inputs)
+                      # non-compiled dag would add input object ref, and the ref can not be released in python
+                      self._prev_out = [
+                          worker.forward_async.remote(self._prev_inputs) for worker in self.workers
+                      ]

lvhan028 requested a review from RunningLeon

April 10, 2026 04:45

lvhan028 added the Bug:P1 label

lvhan028 approved these changes

View reviewed changes

RunningLeon approved these changes

View reviewed changes

Collaborator

RunningLeon left a comment

LGTM

lvhan028 merged commit 7dce446 into InternLM:main

9 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels