ray-project
diff --git a/‎doc/source/serve/model-multiplexing.md‎
Lines changed: 14 additions & 0 deletions b/‎doc/source/serve/model-multiplexing.md‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎doc/source/serve/monitoring.md‎
Lines changed: 18 additions & 0 deletions b/‎doc/source/serve/monitoring.md‎
Lines changed: 18 additions & 0 deletions
diff --git a/‎doc/source/serve/production-guide/fault-tolerance.md‎
Lines changed: 29 additions & 0 deletions b/‎doc/source/serve/production-guide/fault-tolerance.md‎
Lines changed: 29 additions & 0 deletions
diff --git a/‎python/ray/data/_internal/arrow_ops/transform_pyarrow.py‎
Lines changed: 47 additions & 9 deletions b/‎python/ray/data/_internal/arrow_ops/transform_pyarrow.py‎
Lines changed: 47 additions & 9 deletions
diff --git a/‎python/ray/data/_internal/execution/operators/zip_operator.py‎
Lines changed: 8 additions & 0 deletions b/‎python/ray/data/_internal/execution/operators/zip_operator.py‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎python/ray/data/_internal/split.py‎
Lines changed: 1 addition & 1 deletion b/‎python/ray/data/_internal/split.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎python/ray/data/tests/test_split.py‎
Lines changed: 8 additions & 8 deletions b/‎python/ray/data/tests/test_split.py‎
Lines changed: 8 additions & 8 deletions
diff --git a/‎python/ray/data/tests/test_zip.py‎
Lines changed: 35 additions & 0 deletions b/‎python/ray/data/tests/test_zip.py‎
Lines changed: 35 additions & 0 deletions
diff --git a/‎python/ray/data/tests/unit/test_transform_pyarrow.py‎
Lines changed: 74 additions & 0 deletions b/‎python/ray/data/tests/unit/test_transform_pyarrow.py‎
Lines changed: 74 additions & 0 deletions
diff --git a/‎python/ray/includes/gcs_client.pxi‎
Lines changed: 19 additions & 2 deletions b/‎python/ray/includes/gcs_client.pxi‎
Lines changed: 19 additions & 2 deletions
@@ -85,6 +85,20 @@ When using model composition, you can send requests from an upstream deployment
 :end-before: __serve_model_composition_example_end__
 ```
 
+## Configuring model ID matching timeout
+
+When a request arrives with a `serve_multiplexed_model_id`, the Serve router attempts to match it to a replica that already has the model loaded. If no matching replica becomes available within the timeout, the request falls back to the default routing strategy and is sent to any available replica, which then loads the model on demand.
+
+You can configure this timeout using the `RAY_SERVE_MULTIPLEXED_MODEL_ID_MATCHING_TIMEOUT_S` environment variable:
+
+```bash
+export RAY_SERVE_MULTIPLEXED_MODEL_ID_MATCHING_TIMEOUT_S=2.0
+```
+
+**Default**: `1.0` second. To avoid thundering herd problems when many requests for the same unloaded model arrive concurrently, the actual timeout is randomized between this value and `value * 2` (for example, 1.0–2.0 seconds by default).
+
+Increase this timeout if your models take a long time to load and you prefer to wait for a replica that already has the model loaded. Decrease it if you prefer faster fallback to any available replica.
+
 ## Using model multiplexing with batching
 
 You can combine model multiplexing with the `@serve.batch` decorator for efficient batched inference. When you use both features together, Ray Serve automatically splits batches by model ID to ensure each batch contains only requests for the same model. This prevents issues where a single batch would contain requests targeting different models.
 
@@ -571,6 +571,24 @@ Prometheus histograms aggregate data into predefined buckets, which can affect t
 For accurate percentile calculations, configure bucket boundaries that closely match your expected latency distribution. For example, if most requests complete in 10-100ms, use finer-grained buckets in that range.
 :::
 
+### Metrics export interval
+
+By default, Ray Serve batches its in-process metric updates (counters, gauges, histograms recorded by the router and replica) to reduce per-request overhead. You can configure how often Serve flushes these batched updates to the Ray metrics API using the `RAY_SERVE_METRICS_EXPORT_INTERVAL_MS` environment variable:
+
+```bash
+export RAY_SERVE_METRICS_EXPORT_INTERVAL_MS=500
+```
+
+**Default**: `100` milliseconds. Set to `0` to disable batching entirely and record every metric update eagerly. This interval applies to both the router and replica metric pipelines.
+
+Increasing this value reduces the overhead of recording metrics at the cost of less frequent updates. Decreasing it provides more up-to-date values but increases recording frequency.
+
+:::{note}
+`RAY_SERVE_METRICS_EXPORT_INTERVAL_MS` only controls Serve-side batching; it does **not** change how often Ray exports metrics to the Prometheus scrape endpoint. That is controlled separately by Ray Core's `metrics_report_interval_ms` system config (default `10000` ms), which determines how often each Ray process pushes its metrics to the metrics agent that Prometheus scrapes.
+
+The two settings compose: a Serve metric update is first buffered in the router/replica for up to `RAY_SERVE_METRICS_EXPORT_INTERVAL_MS`, then made available at the Prometheus endpoint on the next Ray Core export tick (`metrics_report_interval_ms`). Lowering only `RAY_SERVE_METRICS_EXPORT_INTERVAL_MS` without also lowering `metrics_report_interval_ms` does not make metrics appear in Prometheus any sooner. To change the Ray Core interval, pass it via system config when starting Ray, e.g. `ray start --head --system-config='{"metrics_report_interval_ms": 1000}'`.
+:::
+
 ### Request lifecycle and metrics
 
 The following diagram shows where metrics are captured along the request path:
 
@@ -686,5 +686,34 @@ Table:
 
 Note that the PID for the first ProxyActor has changed, indicating that it restarted.
 
+## Environment variables
+
+These environment variables control fault tolerance-related behavior. Set them before starting Ray.
+
+### `RAY_SERVE_KV_TIMEOUT_S`
+
+**Default**: None (no timeout)
+
+Ray Serve persists deployment configurations and state in the Global Control Store (GCS) using its internal KV interface. Each read and write to the GCS KV store uses this timeout. By default, no timeout is set and these operations block until the GCS responds. If the GCS becomes unavailable (for example, during a head node restart), Serve operations that depend on the KV store — such as fetching or updating deployment configs — hang until the GCS recovers.
+
+Setting this value causes those operations to fail fast with a timeout error instead of blocking indefinitely, allowing Serve to detect GCS failures and trigger recovery sooner.
+
+```bash
+export RAY_SERVE_KV_TIMEOUT_S=5
+```
+
+### `LISTEN_FOR_CHANGE_REQUEST_TIMEOUT_S_LOWER_BOUND` / `LISTEN_FOR_CHANGE_REQUEST_TIMEOUT_S_UPPER_BOUND`
+
+**Defaults**: `30` / `60` seconds
+
+Ray Serve uses a long-polling mechanism for replicas and proxies to receive configuration updates from the controller. Each long-poll request uses a random timeout between the lower and upper bounds to avoid thundering herd problems when many clients poll simultaneously.
+
+```bash
+export LISTEN_FOR_CHANGE_REQUEST_TIMEOUT_S_LOWER_BOUND=10
+export LISTEN_FOR_CHANGE_REQUEST_TIMEOUT_S_UPPER_BOUND=30
+```
+
+Decreasing these values makes replicas and proxies detect controller changes faster, which can speed up recovery after controller restarts. Increasing them reduces the frequency of long-poll requests to the controller.
+
 [KubeRay]: kuberay-index
 [external storage namespace]: kuberay-external-storage-namespace
@@ -75,28 +75,66 @@ def _create_empty_table(schema: "pyarrow.Schema"):
     return pa.table(arrays, schema=schema)
 
 
+def _has_unhashable_pandas_types(schema: "pyarrow.Schema") -> bool:
+    """Check if any column type becomes unhashable after to_pandas() conversion.
+
+    Nested PyArrow types (struct/list/large_list/fixed_size_list/map/union and
+    their view variants) convert to Python dicts/lists, and Ray's tensor and
+    Python-object extension types convert to numpy arrays / Python objects.
+    None of these are hashable by pandas' hash_pandas_object. We check the
+    schema upfront so the hash algorithm choice is deterministic per schema,
+    not per block data.
+    """
+    from ray.data._internal.object_extensions.arrow import ArrowPythonObjectType
+
+    tensor_types = get_arrow_extension_tensor_types()
+    for field in schema:
+        # `is_nested` covers struct/list/large_list/map/union and (on pyarrow
+        # 16+) list_view/large_list_view. It does NOT include fixed_size_list
+        # on older pyarrow (<10-ish), so check that explicitly.
+        if pyarrow.types.is_nested(field.type) or pyarrow.types.is_fixed_size_list(
+            field.type
+        ):
+            return True
+        if isinstance(field.type, tensor_types):
+            return True
+        if isinstance(field.type, ArrowPythonObjectType):
+            return True
+    return False
+
+
 def _hash_partition(
     table: "pyarrow.Table",
     num_partitions: int,
 ) -> np.ndarray:
-
-    # NOTE: We special casing-scenario of single column with integer type
+    # NOTE: We special-case single column with integer type,
     #       short-circuiting the need for hashing the column and instead
-    #       using values as is for partitioning
+    #       using values as-is for partitioning.
     if len(table.columns) == 1 and pyarrow.types.is_integer(table.column(0).type):
         target_column = table.column(0)
         partitions = (target_column.to_numpy() % num_partitions).astype(np.int64)
-    else:
-        # Otherwise fallback to invoking __hash__ on Pyarrow scalars filling out
-        # target table
+    elif _has_unhashable_pandas_types(table.schema):
+        # Struct/list/map columns become dicts/lists in pandas, which are
+        # unhashable. Use row-by-row hashing on PyArrow scalars instead.
         partitions = np.zeros((table.num_rows,), dtype=np.int64)
-
         for i in range(table.num_rows):
             _tuple = tuple(c[i] for c in table.columns)
             partitions[i] = hash(_tuple) % num_partitions
+    else:
+        # Use pandas' vectorized hash (xxhash-based) instead of a Python
+        # row-by-row loop.
+        import pandas as pd
+
+        # Use types_mapper=pd.ArrowDtype to keep Arrow-backed extension arrays
+        # in pandas. This avoids int64 -> float64 promotion for nullable integer
+        # columns, which would cause the same value to hash differently across
+        # blocks depending on whether the block contains nulls.
+        hashes = pd.util.hash_pandas_object(
+            table.to_pandas(types_mapper=pd.ArrowDtype), index=False
+        ).values
+        np.mod(hashes, num_partitions, out=hashes)
+        partitions = hashes
 
-    # Convert to ndarray to compute hash partition indices
-    # more efficiently
     return partitions
 
 
 
@@ -222,9 +222,17 @@ def _zip(
         # cumulative number of rows as that left block.
         # NOTE: _split_at_indices has a no-op fastpath if the blocks are already
         # aligned.
+        # Determine the ownership of the blocks being split, accounting for the
+        # potential swap above. We must not free blocks that are shared with
+        # other operators (e.g., when the input RefBundle has owns_blocks=False
+        # because it comes from a materialized dataset).
+        split_side_owned = all(
+            b.owns_blocks for b in (left_input if input_side_inverted else right_input)
+        )
         aligned_right_blocks_with_metadata = _split_at_indices(
             right_blocks_with_metadata,
             indices,
+            owned_by_consumer=split_side_owned,
             block_rows=right_block_rows,
         )
         del right_blocks_with_metadata
 
@@ -247,7 +247,7 @@ def _generate_global_split_results(
 def _split_at_indices(
     blocks_with_metadata: List[Tuple[ObjectRef[Block], BlockMetadata]],
     indices: List[int],
-    owned_by_consumer: bool = True,
+    owned_by_consumer: bool,
     block_rows: List[int] = None,
 ) -> Tuple[List[List[ObjectRef[Block]]], List[List[BlockMetadata]]]:
     """Split blocks at the provided indices.
 
@@ -624,35 +624,35 @@ def test_generate_global_split_results(ray_start_regular_shared_2_cpus):
 
 def test_private_split_at_indices(ray_start_regular_shared_2_cpus):
     inputs = _create_blocks_with_metadata([])
-    splits = list(zip(*_split_at_indices(inputs, [0])))
+    splits = list(zip(*_split_at_indices(inputs, [0], True)))
     verify_splits(splits, [[], []])
 
-    splits = list(zip(*_split_at_indices(inputs, [])))
+    splits = list(zip(*_split_at_indices(inputs, [], True)))
     verify_splits(splits, [[]])
 
     inputs = _create_blocks_with_metadata([[1], [2, 3], [4]])
 
-    splits = list(zip(*_split_at_indices(inputs, [1])))
+    splits = list(zip(*_split_at_indices(inputs, [1], True)))
     verify_splits(splits, [[[1]], [[2, 3], [4]]])
 
     inputs = _create_blocks_with_metadata([[1], [2, 3], [4]])
-    splits = list(zip(*_split_at_indices(inputs, [2])))
+    splits = list(zip(*_split_at_indices(inputs, [2], True)))
     verify_splits(splits, [[[1], [2]], [[3], [4]]])
 
     inputs = _create_blocks_with_metadata([[1], [2, 3], [4]])
-    splits = list(zip(*_split_at_indices(inputs, [1])))
+    splits = list(zip(*_split_at_indices(inputs, [1], True)))
     verify_splits(splits, [[[1]], [[2, 3], [4]]])
 
     inputs = _create_blocks_with_metadata([[1], [2, 3], [4]])
-    splits = list(zip(*_split_at_indices(inputs, [2, 2])))
+    splits = list(zip(*_split_at_indices(inputs, [2, 2], True)))
     verify_splits(splits, [[[1], [2]], [], [[3], [4]]])
 
     inputs = _create_blocks_with_metadata([[1], [2, 3], [4]])
-    splits = list(zip(*_split_at_indices(inputs, [])))
+    splits = list(zip(*_split_at_indices(inputs, [], True)))
     verify_splits(splits, [[[1], [2, 3], [4]]])
 
     inputs = _create_blocks_with_metadata([[1], [2, 3], [4]])
-    splits = list(zip(*_split_at_indices(inputs, [0, 4])))
+    splits = list(zip(*_split_at_indices(inputs, [0, 4], True)))
     verify_splits(splits, [[], [[1], [2, 3], [4]], []])
 
 
 
@@ -153,6 +153,41 @@ def foo(x):
     ), result
 
 
+def test_zip_does_not_free_shared_materialized_blocks(ray_start_regular_shared):
+    """Regression test: ZipOperator should not free blocks from a materialized
+    dataset that is shared with another consumer.
+
+    Previously, ZipOperator._zip() called _split_at_indices() without specifying
+    owned_by_consumer, which defaulted to True. This caused ray.internal.free()
+    to be called on blocks that were shared with other operators in the DAG,
+    leading to ObjectFreedError.
+    """
+    # Create a dataset with 3 blocks (rows [7, 7, 6]) and materialize it.
+    # The materialized blocks have owns_blocks=False.
+    ds = ray.data.range(20, override_num_blocks=3).materialize()
+    assert not ds._plan.execute().owns_blocks
+
+    # Consumer 1: a map_batches that uses the same materialized dataset.
+    mapped_ds = ds.map_batches(lambda batch: batch, batch_format="pandas")
+
+    # Consumer 2: zip the same materialized dataset with another dataset.
+    # This triggers _split_at_indices inside ZipOperator._zip().
+    # Use 2 blocks (rows [10, 10]) so that block boundaries are NOT aligned
+    # with ds's blocks (rows [7, 7, 6]). This forces actual block splitting
+    # (e.g., the first 10-row block gets split at row 7), which exercises
+    # the owned_by_consumer code path in _split_all_blocks.
+    other_ds = ray.data.range(20, override_num_blocks=2)
+    zipped = other_ds.zip(ds)
+
+    # Consuming the zipped result should not raise ObjectFreedError.
+    result = zipped.take_all()
+    assert len(result) == 20
+
+    # The mapped_ds should also work fine (blocks not freed by the zip).
+    result2 = mapped_ds.take_all()
+    assert len(result2) == 20
+
+
 if __name__ == "__main__":
     import sys
 
 
@@ -6,10 +6,12 @@
 import pandas as pd
 import pyarrow as pa
 import pytest
+from packaging.version import parse as parse_version
 
 from ray.data._internal.arrow_ops.transform_pyarrow import (
     MIN_PYARROW_VERSION_TYPE_PROMOTION,
     _align_struct_fields,
+    _has_unhashable_pandas_types,
     concat,
     hash_partition,
     shuffle,
@@ -144,6 +146,78 @@ def _concat_and_sort_partitions(parts: Iterable[pa.Table]) -> pa.Table:
     assert t == _concat_and_sort_partitions(_structs_partition_dict.values())
 
 
+@pytest.mark.parametrize(
+    "pa_type,expected",
+    [
+        # Nested types -> unhashable in pandas (convert to dict/list)
+        (pa.struct([("a", pa.int32())]), True),
+        (pa.list_(pa.int32()), True),
+        (pa.large_list(pa.int32()), True),
+        (pa.list_(pa.int32(), 3), True),  # fixed_size_list
+        (pa.map_(pa.string(), pa.int32()), True),
+        (pa.dense_union([pa.field("x", pa.int32())]), True),
+        # Ray extension types -> numpy arrays / arbitrary objects in pandas
+        (ArrowTensorTypeV2((2, 2), pa.int64()), True),
+        (ArrowPythonObjectType(), True),
+        # Hashable primitives -> must stay False so we keep the fast path
+        (pa.int32(), False),
+        (pa.float64(), False),
+        (pa.bool_(), False),
+        (pa.string(), False),
+        (pa.large_string(), False),
+        (pa.binary(), False),
+        (pa.decimal128(10, 2), False),
+        (pa.date32(), False),
+        (pa.timestamp("ns"), False),
+        (pa.dictionary(pa.int32(), pa.string()), False),
+    ],
+)
+def test_has_unhashable_pandas_types(pa_type, expected):
+    schema = pa.schema([("c", pa_type)])
+    assert _has_unhashable_pandas_types(schema) is expected
+
+
+@pytest.mark.skipif(
+    get_pyarrow_version() < parse_version("16.0.0"),
+    reason="list_view / large_list_view require pyarrow 16+",
+)
+def test_has_unhashable_pandas_types_list_views():
+    # Regression: list_view/large_list_view also convert to Python lists in
+    # pandas, so they must be flagged as unhashable like list/large_list.
+    for view_type in (pa.list_view(pa.int32()), pa.large_list_view(pa.int32())):
+        schema = pa.schema([("c", view_type)])
+        assert _has_unhashable_pandas_types(schema) is True
+
+
+def test_hash_partition_null_struct_consistent_across_blocks():
+    struct_t = pa.struct([("v", pa.int32())])
+    num_partitions = 8
+
+    all_null = pa.Table.from_pydict(
+        {"k": pa.array([None, None, None], type=struct_t), "idx": [0, 1, 2]}
+    )
+    mixed = pa.Table.from_pydict(
+        {
+            "k": pa.array([None, {"v": 1}, None], type=struct_t),
+            "idx": [10, 11, 12],
+        }
+    )
+
+    p1 = hash_partition(all_null, hash_cols=["k"], num_partitions=num_partitions)
+    p2 = hash_partition(mixed, hash_cols=["k"], num_partitions=num_partitions)
+
+    def null_partition_id(parts):
+        # Return the partition id holding null-key rows (there should be
+        # exactly one — identical null keys must co-locate).
+        null_pids = {
+            pid for pid, tbl in parts.items() if any(tbl["k"].is_null().to_pylist())
+        }
+        assert len(null_pids) == 1, null_pids
+        return next(iter(null_pids))
+
+    assert null_partition_id(p1) == null_partition_id(p2)
+
+
 def test_shuffle():
     t = pa.Table.from_pydict(
         {
 
@@ -639,9 +639,26 @@ cdef class InnerGcsClient:
             reason: int32_t,
             reason_message: c_string,
             deadline_timestamp_ms: int64_t):
-        """Send the DrainNode request to GCS.
+        """Send a DrainNode request to GCS to gracefully terminate a node.
 
-        This is only for testing.
+        Used by the `ray drain-node` CLI command and by autoscaler v2's
+        ray_stopper for idle and preemption-based node termination.
+
+        Args:
+            node_id: Binary node ID of the target node.
+            reason: A `DrainNodeReason` enum value. `IDLE_TERMINATION`
+                requests are rejectable by the raylet; `PREEMPTION`
+                requests are non-rejectable.
+            reason_message: Human-readable explanation, used for
+                observability.
+            deadline_timestamp_ms: Timestamp (ms) when the node will be
+                force-killed. Used as a hint so workloads can drain
+                before the deadline.
+
+        Returns:
+            Tuple of (is_accepted, rejection_reason_message). When
+            `is_accepted` is False, `rejection_reason_message` describes
+            why the raylet rejected the request.
         """
         cdef:
             int64_t timeout_ms = -1