Skip to content

Unicode whitespace stripping in string literal line continuation#6860

Merged
ytmimi merged 5 commits intorust-lang:mainfrom
JojoFlex1:fix-unicode-whitespace
Apr 21, 2026
Merged

Unicode whitespace stripping in string literal line continuation#6860
ytmimi merged 5 commits intorust-lang:mainfrom
JojoFlex1:fix-unicode-whitespace

Conversation

@JojoFlex1
Copy link
Copy Markdown
Contributor

Fixed a bug in src/string.rs where [[:space:]] in the line splitting regex and char::is_whitespace in the is_whitespace function were using Unicode whitespace definitions instead of ASCII whitespace , also added a test

@rustbot rustbot added the S-waiting-on-review Status: awaiting review from the assignee but also interested parties. label Apr 10, 2026
Comment thread src/string.rs Outdated
// Strip line breaks.
// With this regex applied, all remaining whitespaces are significant
let strip_line_breaks_re = Regex::new(r"([^\\](\\\\)*)\\[\n\r][[:space:]]*").unwrap();
let strip_line_breaks_re = Regex::new(r"([^\\](\\\\)*)\\[\n\r][ \t\n\r\x0B\x0C]*").unwrap();
Copy link
Copy Markdown
Contributor

@matthewhughes934 matthewhughes934 Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

View changes since the review

I don't think this change is necessary: the docs say [[:space:]] maps exactly to this set of characters:

[[:space:]] whitespace ([\t\n\v\f\r ])

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is already the ASCII part of Pattern_White_Space:
https://doc.rust-lang.org/reference/whitespace.html

Is it worth adding the Unicode characters in Pattern_White_Space as well?
Or are they already replaced with ASCII spaces somewhere else?

I've noticed on related PRs that rustfmt is correctly removing most Unicode whitespace, but I'm not sure if it's catching it all. If there isn't a test for it already, I can get an Outreachy applicant to add one?

Comment thread src/string.rs
Comment on lines +363 to +365
grapheme
.chars()
.all(|c| matches!(c, ' ' | '\t' | '\n' | '\r' | '\x0B' | '\x0C'))
Copy link
Copy Markdown
Contributor

@matthewhughes934 matthewhughes934 Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

View changes since the review

I think char::is_ascii_whitespace does the same thing

Suggested change
grapheme
.chars()
.all(|c| matches!(c, ' ' | '\t' | '\n' | '\r' | '\x0B' | '\x0C'))
grapheme.chars().all(char::is_ascii_whitespace)

EDIT: no, it doesn't include \x0B

Rust uses the WhatWG Infra Standard’s definition of ASCII whitespace. There are several other definitions in wide use. For instance, the POSIX locale includes U+000B VERTICAL TAB as well as all the above characters, but—from the very same specification—the default rule for “field splitting” in the Bourne shell considers only SPACE, HORIZONTAL TAB, and LINE FEED as whitespace.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth leaving a comment in the code explaining why we're not just using char::is_whitespace here and why we're explicitly matching on these characters.

would it be better to write this as:

grapheme
    .chars()
    .all(|c| c.is_whitespace() || matches!(c, '\x0B' | '\x0C'))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ytmimi ,I checked the docs for char::is_whitespace and it uses Unicode White_Space which is broader than what we want here. For example it would also match \u{A0} (non-breaking space) which the Rust language does not consider whitespace. Also \x0B and \x0C are already included in char::is_whitespace so the || matches!(c, '\x0B' | '\x0C') part would be redundant. I think keeping the explicit match is more correct here since we want to match exactly the Rust language's whitespace definition. I will add the comment you suggested though to explain why we're not using char::is_whitespace.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clearly explaining that. Makes sense to me. I appreciate you leaving the comment in the code too!

Comment thread tests/source/string_lit_unicode_ws.rs Outdated

fn main() {
let str = "who is olaf\
\u{00A0}This is a rust code example to show a bug in Unicode Pattern WhiteSpace";
Copy link
Copy Markdown

@teor2345 teor2345 Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

View changes since the review

Because this string uses an Unicode escape sequence, rustfmt will never see the actual Unicode whitespace character.

Please create a test file with some whitespace characters that are Unicode Whitespace, but not Pattern_White_Space:

You might need to use a character picker or Rust code to write the Unicode whitespace to the file.

@JojoFlex1
Copy link
Copy Markdown
Contributor Author

@teor2345 @matthewhughes934 , I Reverted the regex change and added an actual unicode character to the test

@afurm
Copy link
Copy Markdown

afurm commented Apr 13, 2026

The is_whitespace check now only handles ASCII whitespace, but the line continuation logic in string.rs that calls this — does it also correctly handle the case where a non-ASCII whitespace character appears as the first grapheme after the backslash, rather than as trailing whitespace to strip?

@JojoFlex1
Copy link
Copy Markdown
Contributor Author

@afurm the test already starts with a unicode character ,the non-breaking space Unicode character appears as the first character after the line continuation in tests/source/string_lit_unicode_ws.rs, and since source and target files are identical, rustfmt correctly preserves it without stripping it.

@ytmimi ytmimi merged commit b47f06f into rust-lang:main Apr 21, 2026
26 checks passed
@rustbot rustbot added release-notes Needs an associated changelog entry and removed S-waiting-on-review Status: awaiting review from the assignee but also interested parties. labels Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-notes Needs an associated changelog entry

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants