Skip to content

Commit 0db489f

Browse files
hsbtclaude
andcommitted
Clarify the C1 comment and strengthen the multibyte test
Reword the comment to explain that the UTF-8 guard avoids splitting multibyte sequences, and assert preservation with U+0400, whose continuation byte falls in the C1 byte range. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent cc62ee8 commit 0db489f

2 files changed

Lines changed: 7 additions & 4 deletions

File tree

lib/rubygems/text.rb

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,9 @@ module Gem::Text
1010
def clean_text(text)
1111
text = text.gsub(/[\000-\b\v-\f\016-\037\177]/, ".")
1212

13-
# C1 control characters (U+0080-U+009F) only occur in UTF-8 text and must
14-
# be matched as codepoints so that multibyte characters are preserved.
13+
# Match C1 control characters (U+0080-U+009F) as codepoints. This requires
14+
# a valid UTF-8 string so the regexp does not split a multibyte sequence;
15+
# strings in other encodings are left unchanged.
1516
if text.encoding == Encoding::UTF_8 && text.valid_encoding?
1617
text = text.gsub(/[\u0080-\u009f]/, ".")
1718
end

test/rubygems/test_gem_text.rb

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -107,8 +107,10 @@ def test_clean_text_strips_c1_control_characters
107107
end
108108

109109
def test_clean_text_preserves_multibyte_characters
110-
text = [0xe9, 0x85].pack("U*") # U+00E9 kept, NEL (U+0085) stripped
111-
assert_equal [0xe9, 0x2e].pack("U*"), clean_text(text)
110+
# U+0400 encodes to bytes D0 80, whose 0x80 continuation byte must not be
111+
# mistaken for a C1 control byte. NEL (U+0085) is stripped.
112+
text = [0x400, 0x85].pack("U*")
113+
assert_equal [0x400, 0x2e].pack("U*"), clean_text(text)
112114
end
113115

114116
def test_clean_text_passes_through_non_unicode_encodings

0 commit comments

Comments
 (0)