GH-50355: [C++][Gandiva] fix out-of-bounds read in utf8_length_ignore_invalid by Arawoof06 · Pull Request #50356 · apache/arrow

Arawoof06 · 2026-07-03T09:13:04Z

Rationale for this change

utf8_length_ignore_invalid extends char_len while scanning the bytes after a lead byte and never rechecks the buffer end, so an input ending in a truncated multi-byte utf8 sequence (a 0xF0 lead byte followed by non-continuation bytes) reads past data_len. It is reached from untrusted string data through lpad/rpad, which count the input glyphs before padding. Reproduced against a verbatim copy of the function under AddressSanitizer with the 4-byte input {0xF0, 'a', 'a', 'a'} in an exactly-sized heap buffer, giving heap-buffer-overflow READ ... 0 bytes after 4-byte region.

What changes are included in this PR?

Stop the inner scan with a break when a byte after the lead byte is not a continuation byte, instead of incrementing char_len. Growing char_len on each stray byte kept extending the loop past data_len; breaking leaves char_len bounded so the outer i + char_len <= data_len check keeps every read in range. Valid input counts the same, because a well-formed glyph has only continuation bytes after its lead byte and never hits the break.

Are these changes tested?

Yes. Added TestStringOps.TestPadMalformedUtf8NoOverread, which runs lpad/rpad on the truncated multi-byte input placed in an exactly-sized heap buffer so the over-read trips ASAN, and asserts the full padded output. The existing pad tests still pass.

Are there any user-facing changes?

No.

This PR contains a "Critical Fix". It fixes an out-of-bounds read in the Gandiva utf8 length helper reachable from lpad/rpad on crafted string data.

GitHub Issue: [C++][Gandiva] Out-of-bounds read in utf8_length_ignore_invalid on truncated multi-byte input #50355

…ignore_invalid

github-actions · 2026-07-03T09:13:31Z

⚠️ GitHub issue #50355 has been automatically assigned in GitHub to PR creator.

kou

Hmm, it seems that CODEOWNERS configuration by GH-50144 doesn't work...

@dmitry-chirkov-dremio @lriggs @akravchukdremio @xxlaykxx Could you take a look at this?

kou · 2026-07-03T21:44:57Z

+  EXPECT_EQ(std::string(out_str + out_len - text_len, text_len),
+            std::string(text.begin(), text.end()));


Could you check out_str data entirely instead of checking only part of out_str?

Done, now checks the whole out_str including the padding (" " + text for lpad).

kou · 2026-07-03T21:45:04Z

+
+  out_str = rpad_utf8_int32_utf8(ctx_ptr, text.data(), text_len, 6, " ", 1, &out_len);
+  EXPECT_EQ(out_len, 9);
+  EXPECT_EQ(std::string(out_str, text_len), std::string(text.begin(), text.end()));


Same here, rpad now asserts the full out_str (text + trailing spaces).

kou · 2026-07-03T22:01:48Z

-    for (int j = 1; j < char_len; ++j) {
+    for (int j = 1; j < char_len && i + j < data_len; ++j) {
      if ((data[i + j] & 0xC0) != 0x80) {  // bytes following head-byte of glyph
        char_len += 1;


Hmm, should we break here instead?

Suggested change

char_len += 1;

break;

Good call, switched to break. Once the scan stops growing char_len it never runs past the outer i + char_len <= data_len check, so the extra i + j < data_len guard isn't needed anymore. Valid input counts the same and the malformed case is clean under ASAN.

…rify full pad output Signed-off-by: abdul rawoof <abdulr@bugqore.com>

kou · 2026-07-05T03:05:55Z

Could you update the PR description?

I'll wait for a review from Gandiva developers before I merge this.

Arawoof06 · 2026-07-05T12:53:25Z

Updated the description to match the current fix (the break instead of the earlier bound check). Thanks for waiting on the Gandiva folks.

apacheGH-50355: [C++][Gandiva] fix out-of-bounds read in utf8_length_…

e8dc201

…ignore_invalid

github-actions Bot added Component: C++ Component: Gandiva awaiting review Awaiting review labels Jul 3, 2026

kou reviewed Jul 3, 2026

View reviewed changes

apacheGH-50355: [C++][Gandiva] break on invalid continuation byte, ve…

e0f930d

…rify full pad output Signed-off-by: abdul rawoof <abdulr@bugqore.com>

github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jul 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-50355: [C++][Gandiva] fix out-of-bounds read in utf8_length_ignore_invalid#50356

GH-50355: [C++][Gandiva] fix out-of-bounds read in utf8_length_ignore_invalid#50356
Arawoof06 wants to merge 2 commits into
apache:mainfrom
Arawoof06:utf8-ignore-invalid-overread

Arawoof06 commented Jul 3, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 3, 2026

Uh oh!

kou left a comment

Uh oh!

kou Jul 3, 2026

Uh oh!

Arawoof06 Jul 4, 2026

Uh oh!

kou Jul 3, 2026

Uh oh!

Arawoof06 Jul 4, 2026

Uh oh!

kou Jul 3, 2026

Uh oh!

Arawoof06 Jul 4, 2026

Uh oh!

kou commented Jul 5, 2026

Uh oh!

Arawoof06 commented Jul 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		EXPECT_EQ(std::string(out_str + out_len - text_len, text_len),
		std::string(text.begin(), text.end()));

Uh oh!

Conversation

Arawoof06 commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions Bot commented Jul 3, 2026

Uh oh!

kou left a comment

Choose a reason for hiding this comment

Uh oh!

kou Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Arawoof06 Jul 4, 2026

Choose a reason for hiding this comment

Uh oh!

kou Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Arawoof06 Jul 4, 2026

Choose a reason for hiding this comment

Uh oh!

kou Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Arawoof06 Jul 4, 2026

Choose a reason for hiding this comment

Uh oh!

kou commented Jul 5, 2026

Uh oh!

Arawoof06 commented Jul 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Arawoof06 commented Jul 3, 2026 •

edited

Loading