GH-50355: [C++][Gandiva] fix out-of-bounds read in utf8_length_ignore_invalid#50356
GH-50355: [C++][Gandiva] fix out-of-bounds read in utf8_length_ignore_invalid#50356Arawoof06 wants to merge 2 commits into
Conversation
|
|
kou
left a comment
There was a problem hiding this comment.
Hmm, it seems that CODEOWNERS configuration by GH-50144 doesn't work...
@dmitry-chirkov-dremio @lriggs @akravchukdremio @xxlaykxx Could you take a look at this?
| EXPECT_EQ(std::string(out_str + out_len - text_len, text_len), | ||
| std::string(text.begin(), text.end())); |
There was a problem hiding this comment.
Could you check out_str data entirely instead of checking only part of out_str?
There was a problem hiding this comment.
Done, now checks the whole out_str including the padding (" " + text for lpad).
|
|
||
| out_str = rpad_utf8_int32_utf8(ctx_ptr, text.data(), text_len, 6, " ", 1, &out_len); | ||
| EXPECT_EQ(out_len, 9); | ||
| EXPECT_EQ(std::string(out_str, text_len), std::string(text.begin(), text.end())); |
There was a problem hiding this comment.
Same here, rpad now asserts the full out_str (text + trailing spaces).
| for (int j = 1; j < char_len; ++j) { | ||
| for (int j = 1; j < char_len && i + j < data_len; ++j) { | ||
| if ((data[i + j] & 0xC0) != 0x80) { // bytes following head-byte of glyph | ||
| char_len += 1; |
There was a problem hiding this comment.
Hmm, should we break here instead?
| char_len += 1; | |
| break; |
There was a problem hiding this comment.
Good call, switched to break. Once the scan stops growing char_len it never runs past the outer i + char_len <= data_len check, so the extra i + j < data_len guard isn't needed anymore. Valid input counts the same and the malformed case is clean under ASAN.
…rify full pad output Signed-off-by: abdul rawoof <abdulr@bugqore.com>
|
Could you update the PR description? I'll wait for a review from Gandiva developers before I merge this. |
|
Updated the description to match the current fix (the |
Rationale for this change
utf8_length_ignore_invalidextendschar_lenwhile scanning the bytes after a lead byte and never rechecks the buffer end, so an input ending in a truncated multi-byte utf8 sequence (a0xF0lead byte followed by non-continuation bytes) reads pastdata_len. It is reached from untrusted string data throughlpad/rpad, which count the input glyphs before padding. Reproduced against a verbatim copy of the function under AddressSanitizer with the 4-byte input{0xF0, 'a', 'a', 'a'}in an exactly-sized heap buffer, givingheap-buffer-overflow READ ... 0 bytes after 4-byte region.What changes are included in this PR?
Stop the inner scan with a
breakwhen a byte after the lead byte is not a continuation byte, instead of incrementingchar_len. Growingchar_lenon each stray byte kept extending the loop pastdata_len; breaking leaveschar_lenbounded so the outeri + char_len <= data_lencheck keeps every read in range. Valid input counts the same, because a well-formed glyph has only continuation bytes after its lead byte and never hits thebreak.Are these changes tested?
Yes. Added
TestStringOps.TestPadMalformedUtf8NoOverread, which runslpad/rpadon the truncated multi-byte input placed in an exactly-sized heap buffer so the over-read trips ASAN, and asserts the full padded output. The existing pad tests still pass.Are there any user-facing changes?
No.
This PR contains a "Critical Fix". It fixes an out-of-bounds read in the Gandiva utf8 length helper reachable from
lpad/rpadon crafted string data.