Unicode whitespace stripping in string literal line continuation#6860
Unicode whitespace stripping in string literal line continuation#6860ytmimi merged 5 commits intorust-lang:mainfrom
Conversation
| // Strip line breaks. | ||
| // With this regex applied, all remaining whitespaces are significant | ||
| let strip_line_breaks_re = Regex::new(r"([^\\](\\\\)*)\\[\n\r][[:space:]]*").unwrap(); | ||
| let strip_line_breaks_re = Regex::new(r"([^\\](\\\\)*)\\[\n\r][ \t\n\r\x0B\x0C]*").unwrap(); |
There was a problem hiding this comment.
I don't think this change is necessary: the docs say [[:space:]] maps exactly to this set of characters:
[[:space:]] whitespace ([\t\n\v\f\r ])
There was a problem hiding this comment.
Yes, this is already the ASCII part of Pattern_White_Space:
https://doc.rust-lang.org/reference/whitespace.html
Is it worth adding the Unicode characters in Pattern_White_Space as well?
Or are they already replaced with ASCII spaces somewhere else?
I've noticed on related PRs that rustfmt is correctly removing most Unicode whitespace, but I'm not sure if it's catching it all. If there isn't a test for it already, I can get an Outreachy applicant to add one?
| grapheme | ||
| .chars() | ||
| .all(|c| matches!(c, ' ' | '\t' | '\n' | '\r' | '\x0B' | '\x0C')) |
There was a problem hiding this comment.
I think char::is_ascii_whitespace does the same thing
| grapheme | |
| .chars() | |
| .all(|c| matches!(c, ' ' | '\t' | '\n' | '\r' | '\x0B' | '\x0C')) | |
| grapheme.chars().all(char::is_ascii_whitespace) |
EDIT: no, it doesn't include \x0B
Rust uses the WhatWG Infra Standard’s definition of ASCII whitespace. There are several other definitions in wide use. For instance, the POSIX locale includes U+000B VERTICAL TAB as well as all the above characters, but—from the very same specification—the default rule for “field splitting” in the Bourne shell considers only SPACE, HORIZONTAL TAB, and LINE FEED as whitespace.
There was a problem hiding this comment.
Might be worth leaving a comment in the code explaining why we're not just using char::is_whitespace here and why we're explicitly matching on these characters.
would it be better to write this as:
grapheme
.chars()
.all(|c| c.is_whitespace() || matches!(c, '\x0B' | '\x0C'))There was a problem hiding this comment.
Hi @ytmimi ,I checked the docs for char::is_whitespace and it uses Unicode White_Space which is broader than what we want here. For example it would also match \u{A0} (non-breaking space) which the Rust language does not consider whitespace. Also \x0B and \x0C are already included in char::is_whitespace so the || matches!(c, '\x0B' | '\x0C') part would be redundant. I think keeping the explicit match is more correct here since we want to match exactly the Rust language's whitespace definition. I will add the comment you suggested though to explain why we're not using char::is_whitespace.
There was a problem hiding this comment.
Thanks for clearly explaining that. Makes sense to me. I appreciate you leaving the comment in the code too!
|
|
||
| fn main() { | ||
| let str = "who is olaf\ | ||
| \u{00A0}This is a rust code example to show a bug in Unicode Pattern WhiteSpace"; |
There was a problem hiding this comment.
Because this string uses an Unicode escape sequence, rustfmt will never see the actual Unicode whitespace character.
Please create a test file with some whitespace characters that are Unicode Whitespace, but not Pattern_White_Space:
- https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt
- https://doc.rust-lang.org/reference/whitespace.html
You might need to use a character picker or Rust code to write the Unicode whitespace to the file.
|
@teor2345 @matthewhughes934 , I Reverted the regex change and added an actual unicode character to the test |
|
The |
|
@afurm the test already starts with a unicode character ,the non-breaking space Unicode character appears as the first character after the line continuation in tests/source/string_lit_unicode_ws.rs, and since source and target files are identical, rustfmt correctly preserves it without stripping it. |
Fixed a bug in src/string.rs where [[:space:]] in the line splitting regex and char::is_whitespace in the is_whitespace function were using Unicode whitespace definitions instead of ASCII whitespace , also added a test