FEAT-#6990: Implement lazy execution for the Ray virtual partitions. by AndreyPavlenko · Pull Request #6991 · modin-project/modin

AndreyPavlenko · 2024-03-01T20:32:34Z

What do these changes do?

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Implement lazy execution for the Ray virtual partitions. #6990
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

… partitions.

anmyachev

Judging by the annotations, we need to write a lot more tests to cover most of the changes.

anmyachev · 2024-05-06T10:31:42Z

    "RayWrapper",
    "MaterializationHook",
    "SignalActor",
+    "RayObjectRefTypes",


This item has been deleted

anmyachev · 2024-05-06T10:33:24Z

        """
        if not isinstance(obj_ids, Sequence):
-            obj_ids = list(obj_ids)
+            obj_ids = list(obj_ids) if isinstance(obj_ids, Iterable) else [obj_ids]


Can be deleted.

anmyachev · 2024-05-06T10:34:15Z


    varname = "MODIN_LAZY_EXECUTION"
-    choices = ("Auto", "On", "Off")
+    choices = ("Auto", "On", "Off", "Axis")


Why introduce a new mode?

anmyachev · 2024-05-06T10:51:12Z

+            try:
+                ref = ray.get(ref, timeout=0)
+            except ray.exceptions.GetTimeoutError:
+                return False


If an object has been calculated and placed in distributed storage, will materialization occur here?

If this approach can be effective, then it is worth considering the possibility of using it in other places.

anmyachev · 2024-05-06T10:55:02Z



-class SlicerHook(MaterializationHook):
+class SlicerHook(MaterializationHook, DeferredExecution):


What is the idea behind this change?

anmyachev · 2024-05-06T10:58:50Z

+from .partition import PandasOnRayDataframePartition
+
+
+class PandasOnRayDataframeVirtualPartition(BaseDataframeAxisPartition):


Why not such inheritance?

Suggested change

class PandasOnRayDataframeVirtualPartition(BaseDataframeAxisPartition):

class PandasOnRayDataframeVirtualPartition(PandasDataframeAxisPartition):

anmyachev · 2024-05-06T11:07:45Z

    _execution_wrapper = RayWrapper
    materialize_futures = RayWrapper.materialize

+    if LazyExecution.get() in ("On", "Axis"):


Whether to use this function or not is determined during the first import without the possibility of further replacement. As far as I remember, in all other places, functions are defined on each call.

anmyachev · 2024-05-06T11:09:38Z

+
+        @classmethod
+        @_inherit_docstrings(GenericRayDataframePartitionManager.get_indices)
+        def get_indices(cls, axis, partitions, index_func=None):


Have you tried making lazy changes to the already existing get_indices? (without overriding)

When we call this get_indices, do we trigger the entire lazy execution tree? If so, do we keep the result the consumers depend on?

E.g., if we had a lazy apply and computed indices, would we keep the result of the apply?

What this function is trying to do is to avoid the partitions concatenation. It could be possible in the case when all the partitions are the result of a deferred split operation. Look at the description of the find_non_split_block() function. There is an example of such an execution tree. If we can find in the tree the non-split partition, we can just get the index out of there and, thus, avoid the concatenation.

anmyachev · 2024-05-06T11:11:16Z

+        PandasOnRayDataframeColumnPartition,
+        PandasOnRayDataframeRowPartition,
+        PandasOnRayDataframeVirtualPartition,


Have you tried making changes to existing classes?

anmyachev · 2024-05-06T11:14:43Z

+    axis = 0
+
+    @remote_function
+    def _remote_concat(dfs):  # pragma: no cover  # noqa: GL08


Are you sure that the concat works as intended, given the message about the naming of the function arguments?

anmyachev · 2024-05-06T11:20:35Z


 import pandas
 import ray
+import ray.exceptions


It seems that a lot of the changes in this file are not directly affected by this pull request and therefore it would be great to move them into a separate pull request.

YarShev · 2024-05-23T18:34:33Z

 # governing permissions and limitations under the License.

 """Module houses class that implements ``GenericRayDataframePartitionManager`` using Ray."""
+import math


Suggested change

import math

import math

YarShev · 2024-05-23T18:35:51Z

-    PandasOnRayDataframeRowPartition,
-)
+
+if LazyExecution.get() in ("On", "Axis"):


This logic should probably be placed in modin/core/execution/ray/implementations/pandas_on_ray/partitioning/init.py.

YarShev · 2024-05-23T18:47:07Z

+# governing permissions and limitations under the License.
+
+"""Module houses classes responsible for storing a virtual partition and applying a function to it."""
+import math


Suggested change

import math

import math

YarShev · 2024-05-23T18:48:16Z

+    """
+
+    partition_type = PandasOnRayDataframePartition
+    instance_type = ray.ObjectRef


Suggested change

instance_type = ray.ObjectRef

@anmyachev, can this be removed?

YarShev · 2024-05-23T18:53:23Z

+        list of lengths or None
+            Estimated chunk lengths, that could be different form the real ones.
+        bool
+            Whether the specified partitions represent the full block or just the


Can you elaborate a little on this?

YarShev · 2024-05-23T19:54:33Z

+        manual_partition=False,
+        **kwargs,
+    ) -> Union[List[PandasOnRayDataframePartition], PandasOnRayDataframePartition]:
+        if not manual_partition:


Why does this parameter have effect only in case of False? Should we copy the related logic from the base class?

YarShev · 2024-05-24T15:38:46Z

+        lengths: Union[List[Union[ObjectRefType, int]], None],
+    ):
+        self.num_splits = num_splits
+        self.skip_chunks = set()


Let's put a comment what this is for.

YarShev · 2024-05-24T15:49:25Z

+                        PandasOnRayDataframeColumnPartition
+                        if self.axis
+                        else PandasOnRayDataframeRowPartition


Suggested change

PandasOnRayDataframeColumnPartition

if self.axis

else PandasOnRayDataframeRowPartition

PandasOnRayDataframeRowPartition

if self.axis

else PandasOnRayDataframeColumnPartition

Should this be so?

arunjose696 · 2024-06-05T09:00:47Z

            if isinstance(obj, DeferredExecution):
-                if out_pos := getattr(obj, "out_pos", None):
+                if obj.has_result:
+                    obj = obj.data


Suggested change

obj = obj.data

out_append(obj.data)

I think it would be better to append obj.data in this if branch and remove the continue statements in all the else statements.

If obj.data is a list, we need to deconstruct it either. Thus, we assign it to obj and go to the if isinstance(obj, ListOrTuple) check.

arunjose696 · 2024-06-05T09:01:15Z

+                    if obj.subscribers == 0:
+                        output[out_pos + 1] = 0
+                        result_consumers.remove(obj)
+                    continue


Suggested change

continue

arunjose696 · 2024-06-05T09:01:35Z

                    yield cls._deconstruct_chain(obj, output, stack, result_consumers)
                    out_append(_Tag.END)
-            elif isinstance(obj, ListOrTuple):
+                    continue


Suggested change

continue

arunjose696 · 2024-06-05T09:02:15Z

-            elif isinstance(obj, ListOrTuple):
+                    continue
+
+            if isinstance(obj, ListOrTuple):


Suggested change

if isinstance(obj, ListOrTuple):

elif isinstance(obj, ListOrTuple):

arunjose696 · 2024-06-05T09:06:54Z

+                    out_append(_Tag.REF)
+                    out_append(out_pos)
+                    output[out_pos] = out_pos
+                    if obj.subscribers == 0:
+                        output[out_pos + 1] = 0
+                        result_consumers.remove(obj)


As this code is duplicated

modin/modin/core/execution/ray/common/deferred_execution.py

Lines 326 to 333 in 92fe2f7

if de.subscribers == 0:

# We may have subscribed to the same node multiple times.

# It could happen, for example, if it's passed to the args

# multiple times, or it's one of the parent nodes and also

# passed to the args. In this case, there are no multiple

# subscribers, and we don't need to return the result.

output[out_pos + 1] = 0

result_consumers.remove(de)

and we have reason for this deconstruct_chain, could it be reused?

I don't think it makes sense to create a separate function just in order to reuse 3 lines of trivial code. Besides, it will cost a function call. Probably, a comment should be added here.

Yeah a comment should be sufficent.

AndreyPavlenko force-pushed the issue-6990 branch 2 times, most recently from bf2943d to ea540cc Compare March 1, 2024 20:38

github-advanced-security AI found potential problems Mar 1, 2024

View reviewed changes

Comment thread modin/core/execution/ray/implementations/pandas_on_ray/partitioning/virtual_partition.py Fixed

YarShev mentioned this pull request Mar 13, 2024

FEAT-#7004: use generators when returning from _deploy_ray_func remote function. #7005

Merged

7 tasks

AndreyPavlenko force-pushed the issue-6990 branch 12 times, most recently from 2e3390b to d98324d Compare March 16, 2024 16:54

github-advanced-security AI found potential problems Mar 16, 2024

View reviewed changes

Comment thread modin/core/execution/ray/implementations/pandas_on_ray/partitioning/virtual_partition.py Fixed

AndreyPavlenko force-pushed the issue-6990 branch 13 times, most recently from 128509d to 8ce0b34 Compare March 20, 2024 20:17

AndreyPavlenko force-pushed the issue-6990 branch 10 times, most recently from 8cc6583 to b09a944 Compare April 12, 2024 19:35

AndreyPavlenko marked this pull request as ready for review April 12, 2024 20:59

AndreyPavlenko requested review from a team, RehanSD, YarShev, anmyachev, dchigarev, devin-petersohn, mvashishtha and vnlitvinov as code owners April 12, 2024 20:59

FEAT-modin-project#6990: Implement lazy execution for the Ray virtual…

92fe2f7

… partitions.

AndreyPavlenko force-pushed the issue-6990 branch from b09a944 to 92fe2f7 Compare April 16, 2024 13:21

anmyachev reviewed May 6, 2024

View reviewed changes

YarShev reviewed May 23, 2024

View reviewed changes

YarShev reviewed May 24, 2024

View reviewed changes

arunjose696 reviewed Jun 5, 2024

View reviewed changes



		class SlicerHook(MaterializationHook):
		class SlicerHook(MaterializationHook, DeferredExecution):

		from .partition import PandasOnRayDataframePartition


		class PandasOnRayDataframeVirtualPartition(BaseDataframeAxisPartition):

	if isinstance(obj, ListOrTuple):
	elif isinstance(obj, ListOrTuple):

	if de.subscribers == 0:
	# We may have subscribed to the same node multiple times.
	# It could happen, for example, if it's passed to the args
	# multiple times, or it's one of the parent nodes and also
	# passed to the args. In this case, there are no multiple
	# subscribers, and we don't need to return the result.
	output[out_pos + 1] = 0
	result_consumers.remove(de)

Conversation

AndreyPavlenko commented Mar 1, 2024

What do these changes do?

Uh oh!

Uh oh!

Uh oh!

anmyachev left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arunjose696 Jun 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arunjose696 Jun 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

arunjose696 Jun 5, 2024 •

edited

Loading

arunjose696 Jun 5, 2024 •

edited

Loading