Skip to content

Commit 7d59295

Browse files
Rollup merge of rust-lang#155028 - Brace1000:whitespace-tests, r=chenyukang
tests: add whitespace tests for vertical tab behavior This PR adds two small tests to highlight how vertical tab (\x0B) is handled differently across Rust's whitespace definitions. The Rust lexer treats vertical tab as whitespace (Unicode Pattern_White_Space), while `split_ascii_whitespace` follows the WhatWG Infra Standard and does not include vertical tab. These tests make that difference visible and easier to understand. See: rustfoundation/interop-initiative#53
2 parents 93637f3 + 43f045c commit 7d59295

5 files changed

Lines changed: 118 additions & 0 deletions

File tree

tests/ui/README.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1588,6 +1588,21 @@ Tests on various well-formedness checks, e.g. [Type-checking normal functions](h
15881588

15891589
Tests on `where` clauses. See [Where clauses | Reference](https://doc.rust-lang.org/reference/items/generics.html#where-clauses).
15901590

1591+
## `tests/ui/whitespace/`
1592+
1593+
Tests for whitespace handling in the Rust lexer. The Rust language
1594+
defines whitespace as Unicode Pattern_White_Space, which is not the
1595+
same as what the standard library gives you:
1596+
1597+
- `is_ascii_whitespace` follows the WhatWG Infra Standard and skips
1598+
vertical tab (`\x0B`)
1599+
- `is_whitespace` matches Unicode White_Space, which is a broader set
1600+
1601+
These tests make that gap visible and check that the lexer accepts
1602+
all 11 Pattern_White_Space characters correctly.
1603+
1604+
See: https://github.com/rustfoundation/interop-initiative/issues/53
1605+
15911606
## `tests/ui/windows-subsystem/`: `#![windows_subsystem = ""]`
15921607

15931608
See [the `windows_subsystem` attribute](https://doc.rust-lang.org/reference/runtime.html#the-windows_subsystem-attribute).
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
//@ run-pass
2+
// This test checks that split_ascii_whitespace does NOT split on
3+
// vertical tab (\x0B), because the standard library uses the WhatWG
4+
// Infra Standard definition of ASCII whitespace, which excludes
5+
// vertical tab.
6+
//
7+
// See: https://github.com/rust-lang/rust-project-goals/issues/53
8+
9+
fn main() {
10+
let s = "a\x0Bb";
11+
12+
let parts: Vec<&str> = s.split_ascii_whitespace().collect();
13+
14+
assert_eq!(parts.len(), 1,
15+
"vertical tab should not be treated as ASCII whitespace");
16+
17+
let s2 = "a b";
18+
let parts2: Vec<&str> = s2.split_ascii_whitespace().collect();
19+
assert_eq!(parts2.len(), 2,
20+
"regular space should split correctly");
21+
22+
}
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
// This test ensures that the Rust lexer rejects invalid whitespace
2+
// characters such as ZERO WIDTH SPACE.
3+
4+
//@ check-fail
5+
6+
fn main() {
7+
let x = 5;
8+
let y = 10;
9+
10+
let a=​x + y;
11+
//~^ ERROR unknown start of token
12+
//~| HELP invisible characters like
13+
}
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
error: unknown start of token: \u{200b}
2+
--> $DIR/invalid_whitespace.rs:10:11
3+
|
4+
LL | let a=​x + y;
5+
| ^
6+
|
7+
= help: invisible characters like '\u{200b}' are not usually visible in text editors
8+
9+
error: aborting due to 1 previous error
10+
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
//@ run-pass
2+
// ignore-tidy-tab
3+
//
4+
// Tests that the Rust lexer accepts Unicode Pattern_White_Space characters.
5+
//
6+
// Worth noting: the Rust reference defines whitespace as Pattern_White_Space,
7+
// which is not the same as what is_ascii_whitespace or is_whitespace give you.
8+
//
9+
// is_ascii_whitespace follows WhatWG and skips vertical tab (\x0B).
10+
// is_whitespace uses Unicode White_Space, which is a broader set.
11+
//
12+
// The 11 characters that actually count as whitespace in Rust source:
13+
// \x09 \x0A \x0B \x0C \x0D \x20 \u{85} \u{200E} \u{200F} \u{2028} \u{2029}
14+
//
15+
// Ref: https://github.com/rustfoundation/interop-initiative/issues/53
16+
17+
#[rustfmt::skip]
18+
fn main() {
19+
// tab (\x09) between let and the name
20+
let _ws1 = 1_i32;
21+
22+
// vertical tab (\x0B) between let and the name
23+
// this is the one is_ascii_whitespace gets wrong
24+
let _ws2 = 2_i32;
25+
26+
// form feed (\x0C) between let and the name
27+
let _ws3 = 3_i32;
28+
29+
// plain space (\x20), here just so every character is represented
30+
let _ws4 = 4_i32;
31+
32+
// NEL (\u{85}) between let and the name
33+
let…_ws5 = 5_i32;
34+
35+
// left-to-right mark (\u{200E}) between let and the name
36+
let‎_ws6 = 6_i32;
37+
38+
// right-to-left mark (\u{200F}) between let and the name
39+
let‏_ws7 = 7_i32;
40+
41+
// \x0A, \x0D, \u{2028}, \u{2029} are also Pattern_White_Space but they
42+
// act as line endings, so you can't stick them in the middle of a statement.
43+
// The lexer still handles them correctly at line boundaries.
44+
45+
// These are Unicode White_Space but NOT Pattern_White_Space:
46+
// \u{A0} no-break space \u{1680} ogham space mark
47+
// \u{2000} en quad \u{2001} em quad
48+
// \u{2002} en space \u{2003} em space
49+
// \u{2004} three-per-em space \u{2005} four-per-em space
50+
// \u{2006} six-per-em space \u{2007} figure space
51+
// \u{2008} punctuation space \u{2009} thin space
52+
// \u{200A} hair space \u{202F} narrow no-break space
53+
// \u{205F} medium math space \u{3000} ideographic space
54+
55+
// add them up so the compiler doesn't complain about unused variables
56+
let _sum = _ws1 + _ws2 + _ws3 + _ws4 + _ws5 + _ws6 + _ws7;
57+
println!("{}", _sum);
58+
}

0 commit comments

Comments
 (0)