Unicode whitespace stripping in string literal line continuation by JojoFlex1 · Pull Request #6860 · rust-lang/rustfmt

JojoFlex1 · 2026-04-10T03:13:03Z

Fixed a bug in src/string.rs where [[:space:]] in the line splitting regex and char::is_whitespace in the is_whitespace function were using Unicode whitespace definitions instead of ASCII whitespace , also added a test

matthewhughes934 · 2026-04-10T17:05:05Z

    // Strip line breaks.
    // With this regex applied, all remaining whitespaces are significant
-    let strip_line_breaks_re = Regex::new(r"([^\\](\\\\)*)\\[\n\r][[:space:]]*").unwrap();
+    let strip_line_breaks_re = Regex::new(r"([^\\](\\\\)*)\\[\n\r][ \t\n\r\x0B\x0C]*").unwrap();


View changes since the review

I don't think this change is necessary: the docs say [[:space:]] maps exactly to this set of characters:

[[:space:]] whitespace ([\t\n\v\f\r ])

Yes, this is already the ASCII part of Pattern_White_Space:
https://doc.rust-lang.org/reference/whitespace.html

Is it worth adding the Unicode characters in Pattern_White_Space as well?
Or are they already replaced with ASCII spaces somewhere else?

I've noticed on related PRs that rustfmt is correctly removing most Unicode whitespace, but I'm not sure if it's catching it all. If there isn't a test for it already, I can get an Outreachy applicant to add one?

matthewhughes934 · 2026-04-10T17:06:29Z

+    grapheme
+        .chars()
+        .all(|c| matches!(c, ' ' | '\t' | '\n' | '\r' | '\x0B' | '\x0C'))


View changes since the review

I think char::is_ascii_whitespace does the same thing

Suggested change

grapheme

.chars()

.all(|c| matches!(c, ' ' | '\t' | '\n' | '\r' | '\x0B' | '\x0C'))

grapheme.chars().all(char::is_ascii_whitespace)

EDIT: no, it doesn't include \x0B

Rust uses the WhatWG Infra Standard’s definition of ASCII whitespace. There are several other definitions in wide use. For instance, the POSIX locale includes U+000B VERTICAL TAB as well as all the above characters, but—from the very same specification—the default rule for “field splitting” in the Bourne shell considers only SPACE, HORIZONTAL TAB, and LINE FEED as whitespace.

Might be worth leaving a comment in the code explaining why we're not just using char::is_whitespace here and why we're explicitly matching on these characters.

would it be better to write this as:

grapheme .chars() .all(|c| c.is_whitespace() || matches!(c, '\x0B' | '\x0C'))

Hi @ytmimi ,I checked the docs for char::is_whitespace and it uses Unicode White_Space which is broader than what we want here. For example it would also match \u{A0} (non-breaking space) which the Rust language does not consider whitespace. Also \x0B and \x0C are already included in char::is_whitespace so the || matches!(c, '\x0B' | '\x0C') part would be redundant. I think keeping the explicit match is more correct here since we want to match exactly the Rust language's whitespace definition. I will add the comment you suggested though to explain why we're not using char::is_whitespace.

Thanks for clearly explaining that. Makes sense to me. I appreciate you leaving the comment in the code too!

teor2345 · 2026-04-10T23:30:24Z

+
+fn main() {
+    let str = "who is olaf\
+\u{00A0}This is a rust code example to show a bug in Unicode Pattern WhiteSpace";


View changes since the review

Because this string uses an Unicode escape sequence, rustfmt will never see the actual Unicode whitespace character.

Please create a test file with some whitespace characters that are Unicode Whitespace, but not Pattern_White_Space:

https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

https://doc.rust-lang.org/reference/whitespace.html

You might need to use a character picker or Rust code to write the Unicode whitespace to the file.

JojoFlex1 · 2026-04-11T04:52:19Z

@teor2345 @matthewhughes934 , I Reverted the regex change and added an actual unicode character to the test

afurm · 2026-04-13T16:04:10Z

The is_whitespace check now only handles ASCII whitespace, but the line continuation logic in string.rs that calls this — does it also correctly handle the case where a non-ASCII whitespace character appears as the first grapheme after the backslash, rather than as trailing whitespace to strip?

JojoFlex1 · 2026-04-14T04:20:23Z

@afurm the test already starts with a unicode character ,the non-breaking space Unicode character appears as the first character after the line continuation in tests/source/string_lit_unicode_ws.rs, and since source and target files are identical, rustfmt correctly preserves it without stripping it.

Unicode whitespace stripping in string literal line continuation

dd9e6b6

rustbot added the S-waiting-on-review Status: awaiting review from the assignee but also interested parties. label Apr 10, 2026

Fix formatting of is_whitespace function

ed1a424

JojoFlex1 mentioned this pull request Apr 10, 2026

Outreachy tracking issue for Rust whitespace check bugs rustfoundation/interop-initiative#53

Closed

13 tasks

matthewhughes934 reviewed Apr 10, 2026

View reviewed changes

teor2345 reviewed Apr 10, 2026

View reviewed changes

Revert regex change and update test with Unicode character

27f4182

JojoFlex1 added 2 commits April 21, 2026 10:48

Added explanation why char::is_whitespace is not used

051ca3b

comment

c66396a

ytmimi approved these changes Apr 21, 2026

View reviewed changes

ytmimi merged commit b47f06f into rust-lang:main Apr 21, 2026
26 checks passed

rustbot added release-notes Needs an associated changelog entry and removed S-waiting-on-review Status: awaiting review from the assignee but also interested parties. labels Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode whitespace stripping in string literal line continuation#6860

Unicode whitespace stripping in string literal line continuation#6860
ytmimi merged 5 commits intorust-lang:mainfrom
JojoFlex1:fix-unicode-whitespace

JojoFlex1 commented Apr 10, 2026

Uh oh!

matthewhughes934 Apr 10, 2026 •

edited by rustbot

Loading

Uh oh!

teor2345 Apr 10, 2026

Uh oh!

matthewhughes934 Apr 10, 2026 •

edited

Loading

Uh oh!

ytmimi Apr 21, 2026

Uh oh!

JojoFlex1 Apr 21, 2026

Uh oh!

ytmimi Apr 21, 2026

Uh oh!

teor2345 Apr 10, 2026 •

edited by rustbot

Loading

Uh oh!

JojoFlex1 commented Apr 11, 2026

Uh oh!

afurm commented Apr 13, 2026

Uh oh!

JojoFlex1 commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

JojoFlex1 commented Apr 10, 2026

Uh oh!

matthewhughes934 Apr 10, 2026 • edited by rustbot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

teor2345 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

matthewhughes934 Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ytmimi Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

JojoFlex1 Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ytmimi Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

teor2345 Apr 10, 2026 • edited by rustbot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JojoFlex1 commented Apr 11, 2026

Uh oh!

afurm commented Apr 13, 2026

Uh oh!

JojoFlex1 commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

matthewhughes934 Apr 10, 2026 •

edited by rustbot

Loading

matthewhughes934 Apr 10, 2026 •

edited

Loading

teor2345 Apr 10, 2026 •

edited by rustbot

Loading