Skip to content

Bugfix/fix ray re-init fail #902#903

Open
FangwenDave wants to merge 2 commits into
alibaba:masterfrom
FangwenDave:bugfix/fix-ray-re-init-fail
Open

Bugfix/fix ray re-init fail #902#903
FangwenDave wants to merge 2 commits into
alibaba:masterfrom
FangwenDave:bugfix/fix-ray-re-init-fail

Conversation

@FangwenDave
Copy link
Copy Markdown
Collaborator

fix issue #902

Add 'one change at a time' principle to the Development Workflow section:
- Don't refactor when adding features
- Don't add features when refactoring

This prevents bugfix PRs from sneaking in unrelated structural changes
(method extraction, renaming, call-chain adjustments).
Root cause: when ray.init() raised non-InternalServerRockError exceptions
during scheduled reconnect, the exception was uncaught and Ray was left
in shutdown state. Subsequent .remote() calls triggered Ray's auto-init
hook, which started a full local Ray cluster on the admin machine.
Multiple concurrent requests each spawned their own local cluster,
exhausting memory and causing OOM kill.

Changes:
- main.py: set RAY_ENABLE_AUTO_CONNECT=0 before any ray import to
  disable Ray auto-init entirely
- ray_service.py: wrap ray.init() with asyncio.wait_for timeout to
  prevent indefinite blocking when head-side specific-server hangs
- ray_service.py: add 'except Exception' branch in _reconnect_ray to
  catch ray.init() failures and trigger _retry_ray_init recovery
- ray_service.py: verify connection via ray.cluster_resources() after
  init (init success alone does not guarantee usable connection)
- config.py: add ray_init_timeout_seconds (default 60s) config

Tests: add 4 cases covering exception capture, retry success/failure,
and init-hang timeout scenarios.
@FangwenDave FangwenDave requested a review from zhongwen666 April 27, 2026 09:56
@FangwenDave FangwenDave changed the title Bugfix/fix ray re init fail #902 Bugfix/fix ray re-init fail #902 Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant