Skip to content

Commit 8a1f5c3

Browse files
fcouryclaude
andcommitted
feat: add Unicode utilities and documentation
- Added unicode_utils.rs module with display width calculations and coordinate conversions - Added comprehensive Unicode handling documentation - Added unicode_demo.js plugin example showing Unicode text manipulation - Added Unicode-specific tests for cursor movement, text editing, and display These additions provide the foundation for proper multi-byte character support throughout the editor and plugin system. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
1 parent 03393ca commit 8a1f5c3

7 files changed

Lines changed: 1486 additions & 0 deletions

File tree

docs/unicode-handling.md

Lines changed: 277 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,277 @@
1+
# Unicode Handling in Red Editor
2+
3+
This guide explains how the Red editor handles Unicode and multi-byte characters, including implementation details for core developers and plugin authors.
4+
5+
## Overview
6+
7+
Red editor provides comprehensive Unicode support including:
8+
- Proper rendering of wide characters (CJK, emoji)
9+
- Grapheme cluster-aware cursor movement
10+
- Correct display column calculations
11+
- Plugin APIs for Unicode-aware text manipulation
12+
13+
## Three Coordinate Systems
14+
15+
Red uses three different coordinate systems for text positioning:
16+
17+
### 1. Byte Offsets
18+
- Position in the UTF-8 encoded byte stream
19+
- Used internally by the rope data structure
20+
- Rarely exposed to plugins or users
21+
- Example: "你好" is 6 bytes (3 bytes per character)
22+
23+
### 2. Character Indices
24+
- Position by Unicode scalar values (Rust's `char`)
25+
- Used by buffer operations and plugin APIs
26+
- What `x` represents in cursor positions
27+
- Example: "你好" is 2 characters
28+
29+
### 3. Display Columns
30+
- Visual position in the terminal
31+
- Accounts for character display width
32+
- Used for rendering and visual alignment
33+
- Example: "你好" takes 4 display columns (2 per character)
34+
35+
## Core Implementation
36+
37+
### Unicode Utilities Module (`src/unicode_utils.rs`)
38+
39+
The core Unicode handling is implemented in the `unicode_utils` module:
40+
41+
```rust
42+
// Calculate display width of a string
43+
pub fn display_width(s: &str) -> usize
44+
45+
// Convert between coordinate systems
46+
pub fn char_to_column(line: &str, char_pos: usize) -> usize
47+
pub fn column_to_char(line: &str, column: usize) -> usize
48+
pub fn byte_to_char(s: &str, byte_pos: usize) -> usize
49+
pub fn char_to_byte(s: &str, char_pos: usize) -> usize
50+
51+
// Grapheme cluster operations
52+
pub fn grapheme_count(s: &str) -> usize
53+
pub fn next_grapheme_boundary(s: &str, char_pos: usize) -> Option<usize>
54+
pub fn prev_grapheme_boundary(s: &str, char_pos: usize) -> Option<usize>
55+
```
56+
57+
### Cursor Movement
58+
59+
Cursor movement respects grapheme boundaries:
60+
61+
```rust
62+
// In editor.rs
63+
Action::MoveLeft => {
64+
if self.cx > 0 {
65+
let line = self.current_line_contents();
66+
if let Some(prev) = prev_grapheme_boundary(&line, self.cx) {
67+
self.cx = prev;
68+
}
69+
}
70+
}
71+
```
72+
73+
This ensures that multi-codepoint sequences (like 👨‍👩‍👧‍👦) move as single units.
74+
75+
### Rendering
76+
77+
The rendering system accounts for character display width:
78+
79+
```rust
80+
// In editor/rendering.rs
81+
for grapheme in line.graphemes(true) {
82+
let width = display_width(grapheme);
83+
if width == 0 {
84+
// Zero-width character (e.g., combining marks)
85+
continue;
86+
}
87+
// Render with proper spacing for wide characters
88+
}
89+
```
90+
91+
## Plugin API
92+
93+
### Text Manipulation
94+
95+
Plugin text operations use character indices:
96+
97+
```javascript
98+
// Insert at character position 5
99+
red.insertText(5, 0, "Hello");
100+
101+
// Delete 3 characters starting at position 10
102+
red.deleteText(10, 0, 3);
103+
104+
// Replace 2 characters with new text
105+
red.replaceText(8, 0, 2, "世界");
106+
```
107+
108+
### Cursor Positioning
109+
110+
Plugins can work with both character positions and display columns:
111+
112+
```javascript
113+
// Character-based positioning
114+
red.setCursorPosition(7, 0);
115+
const pos = await red.getCursorPosition(); // {x: 7, y: 0}
116+
117+
// Display column-based positioning
118+
red.setCursorDisplayColumn(10, 0);
119+
const col = await red.getCursorDisplayColumn(); // 10
120+
```
121+
122+
### Unicode Helper Methods
123+
124+
New helper methods for Unicode handling:
125+
126+
```javascript
127+
// Get display width of text
128+
const width = await red.getTextDisplayWidth("你好"); // Returns 4
129+
130+
// Convert between character index and display column
131+
const displayCol = await red.charIndexToDisplayColumn(5, 0);
132+
const charIndex = await red.displayColumnToCharIndex(10, 0);
133+
```
134+
135+
## Common Scenarios
136+
137+
### Working with Mixed-Width Text
138+
139+
When aligning text in columns, use display width calculations:
140+
141+
```javascript
142+
async function alignText(red, text, targetWidth) {
143+
const width = await red.getTextDisplayWidth(text);
144+
const padding = targetWidth - width;
145+
return text + ' '.repeat(Math.max(0, padding));
146+
}
147+
```
148+
149+
### Finding Character Boundaries
150+
151+
When moving through text, respect grapheme boundaries:
152+
153+
```javascript
154+
// Move cursor right by one visual character
155+
const pos = await red.getCursorPosition();
156+
red.execute('MoveRight'); // Handles grapheme boundaries
157+
```
158+
159+
### Handling User Input
160+
161+
When processing user input with Unicode:
162+
163+
```javascript
164+
red.on('buffer:changed', async (event) => {
165+
const line = await red.getBufferText(event.cursor.y, event.cursor.y + 1);
166+
const displayWidth = await red.getTextDisplayWidth(line);
167+
red.log(`Line ${event.cursor.y} is ${displayWidth} columns wide`);
168+
});
169+
```
170+
171+
## Best Practices
172+
173+
### For Core Development
174+
175+
1. **Always use grapheme boundaries** for cursor movement
176+
2. **Test with complex Unicode** including:
177+
- ZWJ sequences: 👨‍👩‍👧‍👦
178+
- Combining marks: é (e + ́)
179+
- Wide characters: 你好
180+
- RTL text: مرحبا
181+
182+
3. **Preserve text integrity** - never split grapheme clusters
183+
4. **Use unicode_utils functions** instead of implementing your own
184+
185+
### For Plugin Development
186+
187+
1. **Understand the coordinate systems**:
188+
- Use character indices for text manipulation
189+
- Use display columns for visual alignment
190+
191+
2. **Test with Unicode content**:
192+
```javascript
193+
const testCases = [
194+
"Hello", // ASCII
195+
"你好", // CJK
196+
"👋🌍", // Emoji
197+
"café", // Combining chars
198+
"👨‍👩‍👧‍👦" // ZWJ sequence
199+
];
200+
```
201+
202+
3. **Handle edge cases**:
203+
- Empty strings
204+
- Lines with only wide characters
205+
- Mixed-width content
206+
207+
4. **Use the helper methods**:
208+
```javascript
209+
// Don't manually calculate display width
210+
const width = await red.getTextDisplayWidth(text);
211+
212+
// Don't assume 1 char = 1 column
213+
const col = await red.charIndexToDisplayColumn(x, y);
214+
```
215+
216+
## Testing
217+
218+
### Unit Tests
219+
220+
Test files for Unicode handling:
221+
- `tests/unicode.rs` - Basic Unicode operations
222+
- `tests/visual_unicode.rs` - Visual mode with Unicode
223+
- `tests/plugin_unicode.rs` - Plugin API with Unicode
224+
225+
### Manual Testing
226+
227+
1. Create a file with diverse Unicode content
228+
2. Test cursor movement through all characters
229+
3. Test selection across grapheme boundaries
230+
4. Test plugin operations on Unicode text
231+
232+
## Troubleshooting
233+
234+
### Common Issues
235+
236+
1. **Cursor jumps unexpectedly**
237+
- Check if you're mixing character indices and display columns
238+
- Ensure grapheme boundaries are respected
239+
240+
2. **Text alignment breaks**
241+
- Use `getTextDisplayWidth()` instead of string length
242+
- Account for zero-width characters
243+
244+
3. **Plugin operations fail on Unicode**
245+
- Verify you're using character indices, not byte offsets
246+
- Test with the `unicode:test-helpers` command
247+
248+
### Debug Commands
249+
250+
Use the Unicode demo plugin to debug issues:
251+
```
252+
:unicode:cursor-info # Show current position details
253+
:unicode:test-helpers # Test coordinate conversions
254+
```
255+
256+
## Performance Considerations
257+
258+
- Display width calculation is O(n) - cache results when possible
259+
- Grapheme segmentation allocates memory - reuse iterators
260+
- Coordinate conversions scan the string - minimize conversions
261+
262+
## Future Improvements
263+
264+
Potential enhancements for Unicode support:
265+
- Bidirectional text (RTL) support
266+
- Vertical text layout for CJK
267+
- Unicode normalization options
268+
- Configurable emoji presentation
269+
- Performance optimizations for long lines
270+
271+
## References
272+
273+
- [Unicode Standard](https://unicode.org/)
274+
- [Unicode Text Segmentation](https://unicode.org/reports/tr29/)
275+
- [East Asian Width](https://unicode.org/reports/tr11/)
276+
- [Rust unicode-width crate](https://docs.rs/unicode-width/)
277+
- [Rust unicode-segmentation crate](https://docs.rs/unicode-segmentation/)

0 commit comments

Comments
 (0)