Skip to content

fix ray mem leak#4487

Merged
lvhan028 merged 4 commits intoInternLM:mainfrom
grimoire:fix-memleak
Apr 10, 2026
Merged

fix ray mem leak#4487
lvhan028 merged 4 commits intoInternLM:mainfrom
grimoire:fix-memleak

Conversation

@grimoire
Copy link
Copy Markdown
Collaborator

@grimoire grimoire commented Apr 2, 2026

  • Non-compiled ray GraphNode has not expose objectref free to python.
  • RayEngineWorker might keep the stream output if the stream is cancelled or breaked before finish.
  • The owner of routed_experts is PytorchEngine, client objectref free can not deref the object in Engine. A named Actor is created.

@grimoire grimoire marked this pull request as ready for review April 10, 2026 04:25
Copilot AI review requested due to automatic review settings April 10, 2026 04:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets Ray-related memory leaks in the PyTorch engine’s distributed (Ray) execution path by changing how input ObjectRefs are handled and introducing a Ray actor–backed store for transferring large routed_experts outputs across process boundaries.

Changes:

  • Replace ray.dag execution with direct per-worker forward_async.remote(...) calls to avoid DAG-retained input ObjectRefs.
  • Add a detached named Ray actor (SharedStore) to own/stage routed_experts and return an opaque key instead of embedding an ObjectRef in the response.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File Description
lmdeploy/pytorch/engine/executor/ray_executor.py Switches async forward dispatch from dag.execute to direct remote calls, aiming to prevent input ObjectRef retention.
lmdeploy/pytorch/engine/engine_instance.py Adds a Ray actor-based shared store and changes routed_experts extra output to return a store key when Ray transfer is enabled.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

return key

def get(self, key):
import ray
Comment on lines 141 to +145
if routed_experts is not None and resp.type in [ResponseType.FINISH, ResponseType.CANCEL]:
if self._enable_transfer_obj_ref:
import pybase64
import ray

ref = ray.put(routed_experts)
data = ray.cloudpickle.dumps(ref)
outputs['routed_experts'] = pybase64.b64encode(data).decode('utf-8')
key = ray.get(_SHARED_STORE.put.remote(routed_experts))
outputs['routed_experts'] = key
Comment on lines +94 to +95
if len(all_data) > 0:
ray.internal.free(all_data, local_only=False)
_SHARED_STORE = ray.remote(num_cpus=0,)(SharedStore).options(
name=name,
namespace='lmdeploy',
lifetime='detached',
Comment on lines +109 to +113
_SHARED_STORE = ray.remote(num_cpus=0,)(SharedStore).options(
name=name,
namespace='lmdeploy',
lifetime='detached',
).remote()
Comment on lines 143 to +145
import ray

ref = ray.put(routed_experts)
data = ray.cloudpickle.dumps(ref)
outputs['routed_experts'] = pybase64.b64encode(data).decode('utf-8')
key = ray.get(_SHARED_STORE.put.remote(routed_experts))
outputs['routed_experts'] = key
Comment on lines 508 to +512
self._prev_inputs = ray.put(inputs)
# make sure in order
self._prev_out = self.dag.execute(self._prev_inputs)
# non-compiled dag would add input object ref, and the ref can not be released in python
self._prev_out = [
worker.forward_async.remote(self._prev_inputs) for worker in self.workers
]
@lvhan028 lvhan028 requested a review from RunningLeon April 10, 2026 04:45
Copy link
Copy Markdown
Collaborator

@RunningLeon RunningLeon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lvhan028 lvhan028 merged commit 7dce446 into InternLM:main Apr 10, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants