Skip to content

fix(shuffle): tolerate non-UTF-8 bytes in get_string (lossy decode)#4524

Open
schenksj wants to merge 2 commits into
apache:mainfrom
schenksj:fix/get-string-lossy-utf8
Open

fix(shuffle): tolerate non-UTF-8 bytes in get_string (lossy decode)#4524
schenksj wants to merge 2 commits into
apache:mainfrom
schenksj:fix/get-string-lossy-utf8

Conversation

@schenksj
Copy link
Copy Markdown

Which issue does this PR close?

Closes #4521.

Rationale for this change

Spark's UnsafeRow.getUTF8String performs no UTF-8 validation, and cast(BinaryType -> StringType) is a zero-copy reinterpret — so a StringType column can legitimately hold arbitrary, non-UTF-8 bytes. Comet's native shuffle get_string decoded with from_utf8(..).unwrap(), which panics on such rows even though Spark itself treats them as opaque.

What changes are included in this PR?

get_string now decodes with from_utf8_lossy and returns Cow<str>: a zero-cost borrow for valid UTF-8, and a String with U+FFFD replacements for invalid bytes — defined behavior, no UB. This deliberately avoids from_utf8_unchecked, which constructs a &str from arbitrary bytes (UB per the Rust reference) and would propagate into downstream Arrow ops that internally call str::from_utf8_unchecked on the buffer. Callers pass the result through append_value (impl AsRef<str>), which Cow<str> satisfies.

How are these changes tested?

Added a standalone unit test get_string_tolerates_non_utf8_bytes that builds a Spark UnsafeRow with a StringType field holding 0xFF 0xFE 'A'. It panics without the fix (Result::unwrap() on a Utf8Error) and passes with it (lossy-decoded to U+FFFD U+FFFD A).

schenksj and others added 2 commits May 29, 2026 20:14
Spark's UnsafeRow.getUTF8String performs no UTF-8 validation, and
cast(BinaryType -> StringType) is a zero-copy reinterpret, so a StringType
column can legitimately hold arbitrary non-UTF-8 bytes. get_string decoded with
from_utf8(..).unwrap(), which panics on such rows even though Spark treats them
as opaque.

Use from_utf8_lossy (returning Cow<str>): a zero-cost borrow for valid UTF-8 and
a String with U+FFFD replacements otherwise -- defined behavior, no UB. Avoids
from_utf8_unchecked, which would construct a &str from arbitrary bytes (UB) and
propagate into downstream Arrow ops. Adds a standalone unit test that panics
without the fix and passes with it.

Closes apache#4521

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

native shuffle: get_string should not panic on non-UTF-8 bytes (use lossy decode)

1 participant