Skip to content

GH-50355: [C++][Gandiva] fix out-of-bounds read in utf8_length_ignore_invalid#50356

Open
Arawoof06 wants to merge 2 commits into
apache:mainfrom
Arawoof06:utf8-ignore-invalid-overread
Open

GH-50355: [C++][Gandiva] fix out-of-bounds read in utf8_length_ignore_invalid#50356
Arawoof06 wants to merge 2 commits into
apache:mainfrom
Arawoof06:utf8-ignore-invalid-overread

Conversation

@Arawoof06

@Arawoof06 Arawoof06 commented Jul 3, 2026

Copy link
Copy Markdown

Rationale for this change

utf8_length_ignore_invalid extends char_len while scanning the bytes after a lead byte and never rechecks the buffer end, so an input ending in a truncated multi-byte utf8 sequence (a 0xF0 lead byte followed by non-continuation bytes) reads past data_len. It is reached from untrusted string data through lpad/rpad, which count the input glyphs before padding. Reproduced against a verbatim copy of the function under AddressSanitizer with the 4-byte input {0xF0, 'a', 'a', 'a'} in an exactly-sized heap buffer, giving heap-buffer-overflow READ ... 0 bytes after 4-byte region.

What changes are included in this PR?

Stop the inner scan with a break when a byte after the lead byte is not a continuation byte, instead of incrementing char_len. Growing char_len on each stray byte kept extending the loop past data_len; breaking leaves char_len bounded so the outer i + char_len <= data_len check keeps every read in range. Valid input counts the same, because a well-formed glyph has only continuation bytes after its lead byte and never hits the break.

Are these changes tested?

Yes. Added TestStringOps.TestPadMalformedUtf8NoOverread, which runs lpad/rpad on the truncated multi-byte input placed in an exactly-sized heap buffer so the over-read trips ASAN, and asserts the full padded output. The existing pad tests still pass.

Are there any user-facing changes?

No.

This PR contains a "Critical Fix". It fixes an out-of-bounds read in the Gandiva utf8 length helper reachable from lpad/rpad on crafted string data.

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown

⚠️ GitHub issue #50355 has been automatically assigned in GitHub to PR creator.

@kou kou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it seems that CODEOWNERS configuration by GH-50144 doesn't work...

@dmitry-chirkov-dremio @lriggs @akravchukdremio @xxlaykxx Could you take a look at this?

Comment on lines +1627 to +1628
EXPECT_EQ(std::string(out_str + out_len - text_len, text_len),
std::string(text.begin(), text.end()));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check out_str data entirely instead of checking only part of out_str?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, now checks the whole out_str including the padding (" " + text for lpad).


out_str = rpad_utf8_int32_utf8(ctx_ptr, text.data(), text_len, 6, " ", 1, &out_len);
EXPECT_EQ(out_len, 9);
EXPECT_EQ(std::string(out_str, text_len), std::string(text.begin(), text.end()));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, rpad now asserts the full out_str (text + trailing spaces).

for (int j = 1; j < char_len; ++j) {
for (int j = 1; j < char_len && i + j < data_len; ++j) {
if ((data[i + j] & 0xC0) != 0x80) { // bytes following head-byte of glyph
char_len += 1;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, should we break here instead?

Suggested change
char_len += 1;
break;

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, switched to break. Once the scan stops growing char_len it never runs past the outer i + char_len <= data_len check, so the extra i + j < data_len guard isn't needed anymore. Valid input counts the same and the malformed case is clean under ASAN.

…rify full pad output

Signed-off-by: abdul rawoof <abdulr@bugqore.com>
@github-actions github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jul 5, 2026
@kou

kou commented Jul 5, 2026

Copy link
Copy Markdown
Member

Could you update the PR description?

I'll wait for a review from Gandiva developers before I merge this.

@Arawoof06

Copy link
Copy Markdown
Author

Updated the description to match the current fix (the break instead of the earlier bound check). Thanks for waiting on the Gandiva folks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants