Clarify the C1 comment and strengthen the multibyte test

hsbt · claude · hsbt · commit 0db489ff2e0b · 2026-06-04T18:39:44.000+09:00
Reword the comment to explain that the UTF-8 guard avoids splitting
multibyte sequences, and assert preservation with U+0400, whose
continuation byte falls in the C1 byte range.

Co-Authored-By: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/lib/rubygems/text.rb b/lib/rubygems/text.rb
@@ -10,8 +10,9 @@ module Gem::Text
   def clean_text(text)
     text = text.gsub(/[\000-\b\v-\f\016-\037\177]/, ".")
 
-    # C1 control characters (U+0080-U+009F) only occur in UTF-8 text and must
-    # be matched as codepoints so that multibyte characters are preserved.
+    # Match C1 control characters (U+0080-U+009F) as codepoints. This requires
+    # a valid UTF-8 string so the regexp does not split a multibyte sequence;
+    # strings in other encodings are left unchanged.
     if text.encoding == Encoding::UTF_8 && text.valid_encoding?
       text = text.gsub(/[\u0080-\u009f]/, ".")
     end
diff --git a/test/rubygems/test_gem_text.rb b/test/rubygems/test_gem_text.rb
@@ -107,8 +107,10 @@ def test_clean_text_strips_c1_control_characters
   end
 
   def test_clean_text_preserves_multibyte_characters
-    text = [0xe9, 0x85].pack("U*") # U+00E9 kept, NEL (U+0085) stripped
-    assert_equal [0xe9, 0x2e].pack("U*"), clean_text(text)
+    # U+0400 encodes to bytes D0 80, whose 0x80 continuation byte must not be
+    # mistaken for a C1 control byte. NEL (U+0085) is stripped.
+    text = [0x400, 0x85].pack("U*")
+    assert_equal [0x400, 0x2e].pack("U*"), clean_text(text)
   end
 
   def test_clean_text_passes_through_non_unicode_encodings