Skip to content

[Data] Decouple sub-progress metrics from progress bar implementations#62262

Closed
400Ping wants to merge 6 commits into
ray-project:masterfrom
400Ping:data/decouple-SubProgressMixin
Closed

[Data] Decouple sub-progress metrics from progress bar implementations#62262
400Ping wants to merge 6 commits into
ray-project:masterfrom
400Ping:data/decouple-SubProgressMixin

Conversation

@400Ping
Copy link
Copy Markdown
Contributor

@400Ping 400Ping commented Apr 1, 2026

Summary

This PR refactors Ray Data sub-progress reporting so operators expose sub-progress as metrics instead of receiving concrete progress bar objects from progress managers.

Before this change, the flow looked like this:

  1. Operator exposed sub-progress bar names.
  2. Progress manager created sub-progress bar objects.
  3. Progress manager injected those bar objects back into the operator.
  4. Operator and scheduler logic updated those objects directly.

After this change, the flow is:

  1. Operator exposes Dict[str, ProgressMetrics].
  2. Driver-side execution updates sub-progress through internal updater helpers.
  3. Progress managers read the metrics and attach display-specific state as needed.
  4. Progress managers own rendering, while operators own execution state.

This removes the previous bi-directional coupling between operators and progress bar implementations.

What Changed

API changes

  • Replace the sub-progress name-only interface with a metrics-based interface:
    • get_sub_progress_metrics() -> Dict[str, ProgressMetrics]
  • ProgressMetrics is now a frozen dataclass snapshot with:
    • name
    • total
    • completed

Internal execution changes

  • Introduce internal SubProgressUpdater helpers for driver-side mutation paths that still need:
    • update(...)
    • fetch_until_complete(...)
    • block_until_complete(...)
  • TaskContext now carries:
    • sub_progress_metrics
    • sub_progress_updaters

Progress manager changes

  • tqdm, rich, logging, and noop progress managers no longer inject sub-progress bar objects back into operators.
  • Managers now attach display-specific state to internal updater helpers and render from metrics.
  • NoopSubProgressBar is no longer needed.

Operator and scheduler updates

Updated the following sub-progress users to the new model:

  • AllToAllOperator
  • CPU hash shuffle
  • GPU shuffle
  • sort / aggregate sampling
  • pull-based shuffle scheduler
  • push-based shuffle scheduler
  • split repartition scheduler

Why This Design

This PR keeps the external sub-progress interface simple and metrics-based, while preserving existing driver-side execution behavior for shuffle/sample paths that still need incremental wait-and-update helpers.

Conceptually:

Before:
Operator <-> ProgressManager <-> ProgressBar object

After:
Operator -> ProgressMetrics
Execution -> SubProgressUpdater -> ProgressMetrics
ProgressManager -> render ProgressMetrics

This is a middle ground that removes bar injection and keeps the metrics interface clean, without rewriting all driver-side progress update paths at once.

Related Issues

@400Ping 400Ping requested a review from a team as a code owner April 1, 2026 15:01
Comment thread python/ray/data/_internal/execution/interfaces/task_context.py Outdated
Comment thread python/ray/data/_internal/progress/tqdm_progress.py
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the sub-progress tracking system in Ray Data, transitioning from a direct progress bar dictionary to a more robust metrics-based approach using ProgressMetrics and SubProgressUpdater. The changes involve updating various physical operators (AllToAll, HashShuffle, GPUShuffle) and progress managers (Logging, Rich, Tqdm) to utilize the new SubProgressMixin. Key feedback includes addressing potential type mismatches and state inconsistencies in legacy setters, ensuring thread safety in the SubProgressUpdater.update method to prevent race conditions, and improving maintainability by using constants for hardcoded strings and name lengths.

Comment thread python/ray/data/_internal/execution/interfaces/task_context.py Outdated
Comment thread python/ray/data/_internal/progress/base_progress.py Outdated
Comment thread python/ray/data/_internal/gpu_shuffle/hash_shuffle.py Outdated
Comment thread python/ray/data/_internal/execution/operators/base_physical_operator.py Outdated
@400Ping 400Ping marked this pull request as draft April 1, 2026 16:50
@400Ping 400Ping marked this pull request as ready for review April 1, 2026 17:01
Comment thread python/ray/data/_internal/execution/operators/base_physical_operator.py Outdated
Comment thread python/ray/data/_internal/execution/operators/sub_progress.py Outdated
@ray-gardener ray-gardener Bot added data Ray Data-related issues community-contribution Contributed by the community labels Apr 1, 2026
@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions Bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Apr 16, 2026
@400Ping
Copy link
Copy Markdown
Contributor Author

400Ping commented Apr 16, 2026

Unstale

@github-actions github-actions Bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Apr 17, 2026
Comment thread python/ray/data/_internal/gpu_shuffle/hash_shuffle.py Outdated
@kyuds kyuds self-requested a review May 1, 2026 16:50
Copy link
Copy Markdown
Member

@kyuds kyuds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the delay @400Ping, but left some initial comments. If you could also rebase to latest master and resolve the merge conflicts, I'll do more testing on my side!

Comment thread python/ray/data/_internal/execution/operators/hash_shuffle.py Outdated
Comment thread python/ray/data/_internal/execution/operators/sub_progress.py Outdated
Comment thread python/ray/data/_internal/gpu_shuffle/hash_shuffle.py Outdated

class NoopSubProgressBar(BaseProgressBar):
"""Sub-Progress Bar for Noop (Disabled) Progress Manager"""
@dataclass(frozen=True)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats the purpose of keeping this class frozen?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is frozen because ProgressMetrics is meant to be an immutable snapshot, not a mutable progress bar object. SubProgressUpdater updates progress by replacing the snapshot, while progress managers only read/sync from it. This keeps rendering code from mutating operator-owned execution state directly.

Comment thread python/ray/data/_internal/progress/tqdm_progress.py Outdated
Comment thread python/ray/data/_internal/progress/rich_progress.py
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

Reviewed by Cursor Bugbot for commit f164692. Configure here.

Comment thread python/ray/data/_internal/progress/logging_progress.py
@400Ping 400Ping closed this May 7, 2026
@400Ping 400Ping reopened this May 7, 2026
400Ping added 6 commits May 7, 2026 13:06
Signed-off-by: 400Ping <jiekaichang@apache.org>
Signed-off-by: 400Ping <jiekaichang@apache.org>
Signed-off-by: 400Ping <jiekaichang@apache.org>
Signed-off-by: 400Ping <jiekaichang@apache.org>
Signed-off-by: 400Ping <jiekaichang@apache.org>
Signed-off-by: 400Ping <jiekaichang@apache.org>
@400Ping 400Ping force-pushed the data/decouple-SubProgressMixin branch from 8f24ad6 to 827b23a Compare May 7, 2026 05:08
@400Ping 400Ping closed this May 7, 2026
@400Ping
Copy link
Copy Markdown
Contributor Author

400Ping commented May 7, 2026

Opening a new on at #63195 , there seems to be a PR git ref error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] Decouple SubProgressMixin from progress bar

2 participants