Skip to content

Commit 441ce50

Browse files
authored
Correct VS15 handling for emojis (#9)
1 parent 1a215eb commit 441ce50

8 files changed

Lines changed: 461 additions & 381 deletions

File tree

README.md

Lines changed: 20 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ A high-performance Go package for measuring the monospace display width of strin
55
[![Documentation](https://pkg.go.dev/badge/github.com/clipperhouse/displaywidth.svg)](https://pkg.go.dev/github.com/clipperhouse/displaywidth)
66
[![Test](https://github.com/clipperhouse/displaywidth/actions/workflows/gotest.yml/badge.svg)](https://github.com/clipperhouse/displaywidth/actions/workflows/gotest.yml)
77
[![Fuzz](https://github.com/clipperhouse/displaywidth/actions/workflows/gofuzz.yml/badge.svg)](https://github.com/clipperhouse/displaywidth/actions/workflows/gofuzz.yml)
8+
89
## Install
910
```bash
1011
go get github.com/clipperhouse/displaywidth
@@ -34,26 +35,40 @@ func main() {
3435

3536
For most purposes, you should use the `String` or `Bytes` methods.
3637

38+
3739
### Options
3840

39-
You can specify East Asian Width settings. If unspecified, the default is `EastAsianWidth: false`.
41+
You can specify East Asian Width settings. When false (default),
42+
[East Asian Ambiguous characters](https://www.unicode.org/reports/tr11/#Ambiguous)
43+
are treated as width 1. When true, East Asian Ambiguous characters are treated
44+
as width 2.
4045

4146
```go
42-
options := displaywidth.Options{
47+
myOptions := displaywidth.Options{
4348
EastAsianWidth: true,
4449
}
4550

46-
width := options.String("Hello, 世界!")
51+
width := myOptions.String("Hello, 世界!")
4752
fmt.Println(width)
4853
```
4954

50-
## Details
55+
## Technical details
5156

5257
This package implements the Unicode East Asian Width standard
5358
([UAX #11](https://www.unicode.org/reports/tr11/)), and handles
5459
[version selectors](https://en.wikipedia.org/wiki/Variation_Selectors_(Unicode_block)),
5560
and [regional indicator pairs](https://en.wikipedia.org/wiki/Regional_indicator_symbol)
56-
(flags). We cover much of [Unicode TR51](https://unicode.org/reports/tr51/).
61+
(flags). We implement [Unicode TR51](https://unicode.org/reports/tr51/).
62+
63+
`clipperhouse/displaywidth`, `mattn/go-runewidth`, and `rivo/uniseg` will
64+
give the same outputs for most real-world text. See extensive details in the
65+
[compatibility analysis](comparison/COMPATIBILITY_ANALYSIS.md).
66+
67+
If you wish to investigate the core logic, see the `lookupProperties` and `width`
68+
functions in [width.go](width.go#L135). The essential trie generation logic is in
69+
`buildPropertyBitmap` in [unicode.go](internal/gen/unicode.go#L317).
70+
71+
I (@clipperhouse) am keeping an eye on [emerging standards and test suites](https://www.jeffquast.com/post/state-of-terminal-emulation-2025/).
5772

5873
## Prior Art
5974

@@ -106,12 +121,3 @@ BenchmarkRune_ASCII/mattn/go-runewidth-8 266.4 ns/op 480.
106121
BenchmarkRune_Emoji/clipperhouse/displaywidth-8 1384 ns/op 523.02 MB/s 0 B/op 0 allocs/op
107122
BenchmarkRune_Emoji/mattn/go-runewidth-8 2273 ns/op 318.45 MB/s 0 B/op 0 allocs/op
108123
```
109-
110-
## Compatibility
111-
112-
`clipperhouse/displaywidth`, `mattn/go-runewidth`, and `rivo/uniseg` should give the
113-
same outputs for real-world text. See [comparison/README.md](comparison/README.md).
114-
115-
If you wish to investigate the core logic, see the `lookupProperties` and `width`
116-
functions in [width.go](width.go#L112). The core of the trie generation logic is in
117-
`BuildPropertyBitmap` in [unicode.go](internal/gen/unicode.go#L309).
Lines changed: 53 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -1,71 +1,97 @@
11
# Compatibility Analysis: displaywidth, go-runewidth, and uniseg
22

3-
> **Generated by:** Cursor IDE using Claude Sonnet 4.5, with a few edits by me (@clipperhouse)
4-
5-
## Summary
3+
> Generated by Cursor IDE using Claude Sonnet 4.5, and edited by @clipperhouse
64
75
This document summarizes the compatibility findings between three Go libraries for Unicode string width calculation:
86

97
- [clipperhouse/displaywidth](https://github.com/clipperhouse/displaywidth) (this package)
108
- [mattn/go-runewidth](https://github.com/mattn/go-runewidth)
119
- [rivo/uniseg](https://github.com/rivo/uniseg)
1210

13-
### 1. Regional Indicator Pairs (Flags)
11+
## Basic Unicode Categories
12+
13+
Most Unicode categories show good compatibility.
14+
15+
| Category | displaywidth | go-runewidth | uniseg |
16+
|----------|--------------|--------------|---------|
17+
| ASCII | ✅ Compatible | ✅ Compatible | ✅ Compatible |
18+
| Latin Extended | ✅ Compatible | ✅ Compatible | ✅ Compatible |
19+
| CJK (Chinese/Japanese/Korean) | ✅ Compatible | ✅ Compatible | ✅ Compatible |
20+
| Arabic | ✅ Compatible | ✅ Compatible | ✅ Compatible |
21+
| Combining Marks | ✅ Compatible | ✅ Compatible | ✅ Compatible |
22+
| Zero-Width Characters | ✅ Compatible | ✅ Compatible | ✅ Compatible |
23+
24+
## Emojis
25+
26+
Regular emojis (😀, 🚀, 🎉, etc.) behave identically:
27+
28+
| Library | Regular Emoji Width |
29+
|---------|---------------------|
30+
| **displaywidth** | Always 2 |
31+
| **go-runewidth** | Always 2 |
32+
| **uniseg** | Always 2 |
33+
34+
### Regional Indicator Pairs (Flags)
1435

1536
Regional indicator pairs (flags like 🇺🇸) are composed of two Regional Indicator symbols.
1637

1738
| Library | Behavior |
1839
|--------|----------|
19-
| **displaywidth** | Width 2 per flag (always, per Unicode TR51) |
20-
| **go-runewidth** | Width 1 per flag (always) |
21-
| **uniseg** | Width 2 per flag (always) |
40+
| **displaywidth** | Width 2 per flag |
41+
| **go-runewidth** | Width 1 per flag |
42+
| **uniseg** | Width 2 per flag |
2243

2344
**Example:** `🇺🇸🇯🇵🇬🇧` (3 flags)
2445
- displaywidth: 6 columns (2+2+2)
2546
- go-runewidth: 3 columns (1+1+1)
2647
- uniseg: 6 columns (2+2+2)
2748

28-
### 2. Variation Selectors
49+
## Variation Selectors
2950

30-
**Variation selectors (VS15, VS16) affect emoji presentation:**
51+
VS15 and VS16 from [Unicode TR51](https://unicode.org/reports/tr51/#Emoji_Variation_Sequences)
3152

32-
| Library | Behavior |
33-
|---------|----------|
34-
| **displaywidth** | Treats variation selectors as part of emoji (width 2) |
35-
| **go-runewidth** | Treats variation selectors as separate characters (width 1 each) |
36-
| **uniseg** | Treats variation selectors as part of emoji (width 2) |
53+
| Library | VS16 (U+FE0F) | VS15 (U+FE0E) |
54+
|---------|---------------|---------------|
55+
| **displaywidth** | Forces emoji presentation (width 2) | No effect, preserves base width |
56+
| **go-runewidth** | Treated as separate character (width 1) | Treated as separate character (width 1) |
57+
| **uniseg** | Treated as part of emoji (width 2) | Forces width 1 |
3758

3859
**Example:** `☺️⌛︎❤️` (3 emoji with variation selectors)
39-
- displaywidth: 5 columns
60+
- displaywidth: 6 columns
4061
- go-runewidth: 4 columns
4162
- uniseg: 5 columns
4263

43-
### 3. Keycap Sequences
64+
I would appear to me (@clipperhouse) that the handling of VS15 is not widely
65+
agreed upon. Some libraries and standards (such as wcwidth) interpret it as
66+
"always narrow to width 1". Others (such as this library) interpret it as
67+
"no effect on width, use the base character width".
68+
69+
Here is [a conversation on GitHub](https://github.com/contour-terminal/contour/discussions/1178#discussioncomment-6778716) and an [explanation from Grok](https://grok.com/share/bGVnYWN5LWNvcHk%3D_274f540c-c9a6-47c7-9d4f-47697ed20032).
4470

45-
**Keycap sequences (digit/symbol + VS16 + combining enclosing keycap):**
71+
## Keycap Sequences
4672

4773
Keycap sequences like 1️⃣ are formed by: base character + variation selector (U+FE0F) + combining enclosing keycap (U+20E3).
4874

4975
| Library | Behavior | Width per Keycap |
5076
|---------|----------|------------------|
51-
| **displaywidth** | Treats as single wide emoji | 2 columns |
52-
| **go-runewidth** | Treats base character only | 1 column |
53-
| **uniseg** | Treats as single narrow character | 1 column |
77+
| **displaywidth** | Treats as emoji | 2 columns |
78+
| **go-runewidth** | Treats base character | 1 column |
79+
| **uniseg** | Treats as base character | 1 column |
5480

5581
**Example:** `1️⃣#️⃣` (2 keycap sequences)
5682
- displaywidth: 4 columns (2 per keycap)
5783
- go-runewidth: 2 columns (1 per keycap)
5884
- uniseg: 2 columns (1 per keycap)
5985

60-
### 4. East Asian Ambiguous Width
86+
## East Asian Ambiguous Width
6187

62-
**East Asian Ambiguous characters** (★, °, ±, etc.) can be rendered as either narrow (1 column) or wide (2 columns) depending on configuration.
88+
[East Asian Ambiguous characters](https://www.unicode.org/reports/tr11/#Ambiguous) (★, °, ±, etc.) can be rendered as either narrow (1 column) or wide (2 columns) depending on configuration.
6389

6490
| Library | Default | With EastAsianWidth=true |
6591
|---------|---------|--------------------------|
6692
| **displaywidth** | Width 1 | Width 2 |
6793
| **go-runewidth** | Width 1 | Width 2 |
68-
| **uniseg** | Width 1 | Width 2 (usually) |
94+
| **uniseg** | Width 1 | Width 2 |
6995

7096
**Example:** `★°±` (3 ambiguous characters)
7197
- displaywidth default: 3 columns
@@ -75,29 +101,6 @@ Keycap sequences like 1️⃣ are formed by: base character + variation selector
75101
- uniseg default: 3 columns
76102
- uniseg with EastAsianAmbiguousWidth=2: 5 columns (usually)
77103

78-
### 5. Regular Emojis
79-
80-
**Regular emojis (😀, 🚀, 🎉, etc.) behave identically:**
81-
82-
| Library | Regular Emoji Width |
83-
|---------|---------------------|
84-
| **displaywidth** | Always 2 |
85-
| **go-runewidth** | Always 2 |
86-
| **uniseg** | Always 2 |
87-
88-
### 6. Basic Unicode Categories
89-
90-
Most Unicode categories show good compatibility.
91-
92-
| Category | displaywidth | go-runewidth | uniseg |
93-
|----------|--------------|--------------|---------|
94-
| ASCII | ✅ Compatible | ✅ Compatible | ✅ Compatible |
95-
| Latin Extended | ✅ Compatible | ✅ Compatible | ✅ Compatible |
96-
| CJK (Chinese/Japanese/Korean) | ✅ Compatible | ✅ Compatible | ✅ Compatible |
97-
| Arabic | ✅ Compatible | ✅ Compatible | ✅ Compatible |
98-
| Combining Marks | ✅ Compatible | ✅ Compatible | ✅ Compatible |
99-
| Zero-Width Characters | ✅ Compatible | ✅ Compatible | ✅ Compatible |
100-
101104
## Detailed Test Results
102105

103106
> Run `go test -v` in the `comparison/` directory to see comprehensive behavior comparisons between libraries.
@@ -122,3 +125,7 @@ Most Unicode categories show good compatibility.
122125
- displaywidth: 6 columns (2 per emoji)
123126
- go-runewidth: 6 columns (2 per emoji)
124127
- uniseg: 6 columns (2 per emoji)
128+
129+
## Further Reading
130+
131+
- [State of Terminal Emulation 2025](https://www.jeffquast.com/post/state-of-terminal-emulation-2025/)

comparison/behavior_test.go

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ func TestLibraryBehaviorComparison(t *testing.T) {
7373
"displaywidth_default": 14, // 2 per emoji (properly handles Unicode 16.0)
7474
"displaywidth_options{}": 14,
7575
"go-runewidth_default": 7, // go-runewidth may not fully support Unicode 16.0 yet (treats as width 1)
76-
"uniseg_default": 7, // uniseg may not fully support Unicode 16.0 yet (treats as width 1)
76+
"uniseg_default": 7, // uniseg may not fully support Unicode 16.0 yet (treats as width 1)
7777
},
7878
},
7979

@@ -92,14 +92,15 @@ func TestLibraryBehaviorComparison(t *testing.T) {
9292
},
9393

9494
// Variation selectors
95+
// ☺️ (U+263A + VS16) = width 2, ⌛︎ (U+231B + VS15) = width 2 (VS15 is no-op), ❤️ (U+2764 + VS16) = width 2
9596
{
9697
name: "Variation selectors",
9798
input: "☺️⌛︎❤️",
9899
expected: map[string]int{
99-
"displaywidth_default": 5,
100-
"displaywidth_options{}": 5,
100+
"displaywidth_default": 6, // 2 + 2 + 2 (VS15 is no-op per Unicode TR51)
101+
"displaywidth_options{}": 6,
101102
"go-runewidth_default": 4,
102-
"uniseg_default": 5,
103+
"uniseg_default": 5, // uniseg still treats VS15 as width 1
103104
},
104105
},
105106

internal/gen/trie.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ func GenerateTrie(data *UnicodeData) (*triegen.Trie, error) {
2929
continue
3030
}
3131

32-
props := BuildPropertyBitmap(r, data)
32+
props := buildPropertyBitmap(r, data)
3333

3434
// Only insert characters with non-default properties
3535
if props != 0 {

internal/gen/unicode.go

Lines changed: 19 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -36,23 +36,22 @@ type PropertyDefinition struct {
3636
// The order matters - it defines the bit positions (via iota).
3737
var PropertyDefinitions = []PropertyDefinition{
3838
{"Zero_Width", "Always 0 width, includes combining marks, control characters, non-printable, etc"},
39-
{"Always_Wide", "Always 2 wide"},
39+
{"East_Asian_Wide", "Always 2 wide (East Asian Wide F/W)"},
4040
{"East_Asian_Ambiguous", "Width depends on EastAsianWidth option"},
41-
{"Always_Narrow", "VARIATION SELECTOR-15 (U+FE0E) requests text presentation (width 1); not in the trie, see [width]"},
41+
{"Emoji", "Extended_Pictographic + Emoji_Presentation"},
4242
}
4343

4444
// these constants are used to build the property bitmap, internally.
4545
// the external properties are above. Keep them in the same order!
4646
const (
4747
// ZWSP, ZWJ, ZWNJ, etc.
4848
zero_Width property = iota + 1
49-
// F, W
50-
always_Wide
51-
// A
49+
// F, W (East Asian Wide)
50+
east_Asian_Wide
51+
// A (East Asian Ambiguous)
5252
east_Asian_Ambiguous
53-
// VS15 requests text presentation (width 1)
54-
// not used in the trie, but noted here for reference
55-
// always_Narrow
53+
// Extended_Pictographic + Emoji_Presentation but not East Asian Wide
54+
emoji
5655
)
5756

5857
// ParseUnicodeData downloads and parses all required Unicode data files
@@ -314,8 +313,7 @@ func extractRunesFromRangeTable(table *unicode.RangeTable, target map[rune]bool)
314313
}
315314
}
316315

317-
// BuildPropertyBitmap creates a properties bitmap for a given rune
318-
func BuildPropertyBitmap(r rune, data *UnicodeData) property {
316+
func buildPropertyBitmap(r rune, data *UnicodeData) property {
319317
if data.CombiningMarks[r] {
320318
return zero_Width
321319
}
@@ -326,23 +324,26 @@ func BuildPropertyBitmap(r rune, data *UnicodeData) property {
326324
return zero_Width
327325
}
328326

329-
// East Asian Width
330-
// Only store properties that affect width calculation
327+
// As a practical matter, we probably don't need separate properties for
328+
// Emoji and East Asian Wide, as I believe they lead to the same
329+
// result. I made this distinction for VS15 handling. However,
330+
// eventually I came to the conclusion that VS15 is a no-op for width
331+
// calculation. Keeping the distinction for now.
332+
333+
if data.ExtendedPictographic[r] && data.EmojiPresentation[r] {
334+
return emoji
335+
}
336+
331337
if eaw, exists := data.EastAsianWidth[r]; exists {
332338
switch eaw {
333339
case "F", "W":
334-
return always_Wide
340+
return east_Asian_Wide
335341
case "A":
336342
return east_Asian_Ambiguous
337343
// H (Halfwidth), Na (Narrow), and N (Neutral) are not stored
338344
// as they all result in width 1 (default behavior)
339345
}
340346
}
341347

342-
// Emoji properties
343-
if data.ExtendedPictographic[r] && data.EmojiPresentation[r] {
344-
return always_Wide
345-
}
346-
347348
return 0
348349
}

0 commit comments

Comments
 (0)