Skip to content

WIP: feat: Add router in rollout controller for simpler proxy server usage#1013

Draft
zhanghaotong wants to merge 2 commits into
areal-project:mainfrom
zhanghaotong:add-router-in-rollout-controller
Draft

WIP: feat: Add router in rollout controller for simpler proxy server usage#1013
zhanghaotong wants to merge 2 commits into
areal-project:mainfrom
zhanghaotong:add-router-in-rollout-controller

Conversation

@zhanghaotong
Copy link
Copy Markdown
Contributor

Description

This PR introduces a ProxyRouter in RolloutController to centralize worker selection and proxy address resolution, making proxy server usage simpler and more extensible.

Key changes:

  • Adds a Strategy-pattern router (round-robin, random) for selecting rollout workers, easy to extend with new strategies.
  • Adds worker_routing_strategy to InferenceEngineConfig to configure routing behavior.
  • Updates RolloutController to use the router for both worker selection and proxy address lookup (with backward-compatible wrappers).
  • Adds/updates CLI reference docs to document the new config field.
  • Adds unit tests covering routing strategies and proxy address handling.

Related Issue

#907

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not
    work as expected)
  • Documentation update
  • Code refactoring (no functional changes)
  • Performance improvement
  • Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • I have run formatting tools (pre-commit or manual)
  • I have run relevant unit tests and they pass
  • I have added tests for new functionality
  • I have updated documentation if needed
  • My branch is up to date with main
  • This PR introduces breaking changes (if yes, fill out details below)
  • If this PR changes documentation, I have built and previewed it locally with
    jb build docs
  • No critical issues raised by AI reviewers (/gemini review)

Breaking Change Details (if applicable):
N/A

Additional Context

  • New config: InferenceEngineConfig.worker_routing_strategy (choices: round-robin, random; default: round-robin).
  • Tests added in tests/test_proxy_router.py.
  • Docs updated: docs/en/cli_reference.md, docs/zh/cli_reference.md.

Need help? Check the Contributing Guide or ask in
GitHub Discussions!

Signed-off-by: 皓聪 <zhanghaotong.zht@antgroup.com>
Signed-off-by: 皓聪 <zhanghaotong.zht@antgroup.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the worker selection and proxy address management within the RolloutController by introducing a ProxyRouter. This change centralizes routing logic, making it more modular and easier to extend with new strategies like random selection, while also simplifying how proxy servers are utilized. The update includes a new configuration option for users to specify their preferred routing strategy and ensures backward compatibility for existing functionalities.

Highlights

  • Centralized Worker Routing: Introduced a new ProxyRouter in the RolloutController to centralize worker selection and proxy address resolution, simplifying proxy server usage and enhancing extensibility.
  • Pluggable Routing Strategies: Implemented a Strategy pattern for worker selection, allowing for different routing behaviors such as 'round-robin' (default) and 'random', with easy extensibility for future strategies.
  • New Configuration Option: Added a worker_routing_strategy field to InferenceEngineConfig, enabling users to configure the desired routing behavior via CLI.
  • Backward Compatibility: Existing methods like get_proxy_addr and _choose_worker in RolloutController were updated to use the new ProxyRouter internally while maintaining their original interfaces for backward compatibility.
  • Enhanced Test Coverage: Added comprehensive unit tests for the new ProxyRouter and its associated worker selection strategies, ensuring robustness and correctness.
  • Documentation Updates: Updated CLI reference documentation in both English and Chinese to reflect the new worker_routing_strategy configuration option.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • areal/api/cli_args.py
    • Added worker_routing_strategy field to InferenceEngineConfig with 'round-robin' (default) and 'random' choices for worker routing.
  • areal/infra/controller/rollout_controller.py
    • Introduced RoutingStrategy enum for worker routing options.
    • Defined WorkerSelector abstract base class and concrete implementations RoundRobinSelector and RandomSelector following the Strategy pattern.
    • Implemented ProxyRouter class to encapsulate worker selection and proxy address management.
    • Replaced the _current_worker_idx field with _proxy_router in RolloutController.
    • Initialized _proxy_router in _async_initialize using the configured worker_routing_strategy.
    • Updated _async_start_proxy to pass proxy addresses to the _proxy_router.
    • Modified get_proxy_addr to delegate to _proxy_router.get_proxy_addr for backward compatibility.
    • Added route_worker_with_proxy as a new public method to directly use the ProxyRouter for combined worker and proxy address retrieval.
    • Updated _choose_worker to delegate worker selection to _proxy_router.route for backward compatibility, including a fallback to legacy round-robin if the router is not initialized.
  • docs/en/cli_reference.md
    • Documented the new worker_routing_strategy field in the English CLI reference, detailing its purpose and available choices.
  • docs/zh/cli_reference.md
    • Documented the new worker_routing_strategy field in the Chinese CLI reference, detailing its purpose and available choices.
  • tests/test_proxy_router.py
    • Added a new test file test_proxy_router.py.
    • Implemented unit tests for ProxyRouter covering scenarios without proxies, with proxies, get_proxy_addr, update_proxy_addrs, empty worker lists, random strategy, string strategy conversion, invalid strategy handling, and reset functionality.
    • Included a test to verify the proper implementation of the Strategy pattern.
Activity
  • A new feature was implemented to add a router in the rollout controller.
  • New unit tests were added in tests/test_proxy_router.py to cover the new functionality.
  • Documentation for the CLI reference was updated in both English (docs/en/cli_reference.md) and Chinese (docs/zh/cli_reference.md).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a ProxyRouter using the Strategy pattern to handle worker selection in the RolloutController, which is a great improvement for extensibility. The implementation is solid, and the addition of unit tests and documentation is appreciated.

I've found a critical issue in the fallback logic of _choose_worker that would break the round-robin behavior. I've also pointed out a significant maintainability issue in the new test file tests/test_proxy_router.py due to code duplication. Addressing these points will make this feature more robust and maintainable.

Comment on lines +896 to +903
if self._proxy_router is not None:
worker, rank, _ = self._proxy_router.route()
return worker, rank

# Fallback to legacy round-robin if router not initialized
fallback_selector = RoundRobinSelector()
rank = fallback_selector.select(self.workers)
return self.workers[rank], rank
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The fallback logic for _choose_worker is incorrect. It creates a new RoundRobinSelector on every call, which will always select the first worker (rank 0) because its internal index is reset to 0 upon instantiation. This breaks the expected round-robin behavior and will cause all fallback requests to be routed to a single worker, potentially overloading it.

Since _proxy_router should be initialized before this method is called, it's safer to raise an error if it's not available, similar to how route_worker_with_proxy handles it.

Suggested change
if self._proxy_router is not None:
worker, rank, _ = self._proxy_router.route()
return worker, rank
# Fallback to legacy round-robin if router not initialized
fallback_selector = RoundRobinSelector()
rank = fallback_selector.select(self.workers)
return self.workers[rank], rank
if self._proxy_router is None:
raise RuntimeError("ProxyRouter not initialized. Call initialize() first.")
worker, rank, _ = self._proxy_router.route()
return worker, rank

Comment on lines +9 to +143
@dataclass
class MockWorker:
"""Mock Worker for testing."""

id: str
ip: str = "127.0.0.1"


class RoutingStrategy(str, Enum):
"""Enumeration of available worker routing strategies."""

ROUND_ROBIN = "round-robin"
RANDOM = "random"


class WorkerSelector(ABC):
"""Abstract base class for worker selection strategies."""

@abstractmethod
def select(self, workers: list[MockWorker]) -> int:
"""Select a worker and return its rank."""
pass

@abstractmethod
def reset(self) -> None:
"""Reset the selector state."""
pass


class RoundRobinSelector(WorkerSelector):
"""Round-robin worker selection strategy."""

def __init__(self):
self._current_idx = 0

def select(self, workers: list[MockWorker]) -> int:
"""Select the next worker in round-robin order."""
if not workers:
raise RuntimeError("No workers available to choose from.")

rank = self._current_idx
self._current_idx = (self._current_idx + 1) % len(workers)
return rank

def reset(self) -> None:
"""Reset the round-robin index to 0."""
self._current_idx = 0


class RandomSelector(WorkerSelector):
"""Random worker selection strategy."""

def select(self, workers: list[MockWorker]) -> int:
"""Randomly select a worker."""
if not workers:
raise RuntimeError("No workers available to choose from.")

return random.randint(0, len(workers) - 1)

def reset(self) -> None:
"""No-op for stateless random strategy."""
pass


class ProxyRouter:
"""Router for choosing workers and managing proxy addresses using Strategy Pattern."""

# Strategy factory: maps RoutingStrategy enum to WorkerSelector classes
_SELECTOR_FACTORY: dict[RoutingStrategy, type[WorkerSelector]] = {
RoutingStrategy.ROUND_ROBIN: RoundRobinSelector,
RoutingStrategy.RANDOM: RandomSelector,
}

def __init__(
self,
workers: list[MockWorker],
proxy_addrs: list[str] | None = None,
routing_strategy: RoutingStrategy | str = RoutingStrategy.ROUND_ROBIN,
):
"""Initialize the ProxyRouter."""
self.workers = workers
self.proxy_addrs = proxy_addrs or []
self._proxy_enabled = bool(proxy_addrs)

# Convert string to enum if necessary
if isinstance(routing_strategy, str):
try:
self.strategy = RoutingStrategy(routing_strategy)
except ValueError:
valid_strategies = ", ".join([s.value for s in RoutingStrategy])
raise ValueError(
f"Invalid routing_strategy: {routing_strategy}. "
f"Must be one of: {valid_strategies}"
)
else:
self.strategy = routing_strategy

# Create the appropriate selector using the factory
selector_class = self._SELECTOR_FACTORY.get(self.strategy)
if selector_class is None:
raise ValueError(
f"No selector implementation found for strategy: {self.strategy}"
)
self.selector = selector_class()

def route(self) -> tuple[MockWorker, int, str | None]:
"""Choose a worker and get its proxy address."""
# Delegate selection to the strategy
rank = self.selector.select(self.workers)
worker = self.workers[rank]

# Get proxy address if available
proxy_addr = (
self.proxy_addrs[rank]
if self._proxy_enabled and rank < len(self.proxy_addrs)
else None
)

return worker, rank, proxy_addr

def get_proxy_addr(self, rank: int) -> str | None:
"""Get the proxy server address for a specific worker rank."""
if not self._proxy_enabled or rank >= len(self.proxy_addrs):
return None
return self.proxy_addrs[rank]

def update_proxy_addrs(self, proxy_addrs: list[str]) -> None:
"""Update the proxy addresses."""
self.proxy_addrs = proxy_addrs
self._proxy_enabled = bool(proxy_addrs)

def reset(self) -> None:
"""Reset the routing strategy's internal state."""
self.selector.reset()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The test file duplicates several classes (RoutingStrategy, WorkerSelector, RoundRobinSelector, RandomSelector, ProxyRouter) from areal.infra.controller.rollout_controller, and also defines a MockWorker. This is a significant maintainability issue, as any changes to the original classes will require manual updates in the test file, and they can easily get out of sync.

Please remove the duplicated class definitions (lines 17-143) and the MockWorker class (lines 9-14). Instead, import the necessary components from the application code.

# At the top of tests/test_proxy_router.py

# Add these imports
from areal.api import Worker
from areal.infra.controller.rollout_controller import (
    ProxyRouter,
    RandomSelector,
    RoundRobinSelector,
    RoutingStrategy,
)

# In your test functions, replace MockWorker with Worker, for example:
# workers = [Worker(id=f"worker-{i}", ip="127.0.0.1") for i in range(3)]

This makes the tests more robust and easier to maintain.

Copy link
Copy Markdown
Collaborator

@garrett4wade garrett4wade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. While this PR creates a new abstraction for the router, it does not introduce useful features like workload-aware scheduling and effectively benchmark the performance. I think we should defer the change of API before seeing clear inference throughput improvement under a smarter scheduling strategy.

@zhanghaotong
Copy link
Copy Markdown
Contributor Author

zhanghaotong commented Mar 10, 2026

Thanks for the contribution. While this PR creates a new abstraction for the router, it does not introduce useful features like workload-aware scheduling and effectively benchmark the performance. I think we should defer the change of API before seeing clear inference throughput improvement under a smarter scheduling strategy.

Sure, I'll write some more useful scheduling strategies. Let me first mark this PR as WIP

@zhanghaotong zhanghaotong changed the title feat: Add router in rollout controller for simpler proxy server usage WIP: feat: Add router in rollout controller for simpler proxy server usage Mar 10, 2026
@garrett4wade garrett4wade marked this pull request as draft March 16, 2026 01:58
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 8, 2026

This pull request has been automatically marked as stale because it has not had recent activity within the last 14 days.

Please add a comment or push new commits to keep it active.

Thank you for your contribution!

@github-actions github-actions Bot added the stale label Apr 8, 2026
@github-actions github-actions Bot removed the stale label Apr 24, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

This pull request has been automatically marked as stale because it has not had recent activity within the last 14 days.

Please add a comment or push new commits to keep it active.

Thank you for your contribution!

@github-actions github-actions Bot added the stale label May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants