Skip to content

Commit 8bc1877

Browse files
Fix byte/character offset confusion in formatter for multi-byte UTF-8 (#105)
Formatting proto files with multi-byte UTF-8 characters (Cyrillic, etc.) was non-idempotent, adding empty `// ` comment lines on each format operation. ## Root Cause `offset_to_position` in `src/formatter/clang.rs` converted clang-format's byte offsets to LSP positions using byte arithmetic: ```rust let character = offset - last_newline; // treats byte offset as character offset ``` This fails for multi-byte UTF-8. Example: byte offset 134 in Cyrillic text → calculated position 119 → should be 77 UTF-16 code units. ## Changes - **Fixed offset calculation**: Count UTF-16 code units from last newline instead of byte arithmetic ```rust let text_after_newline = &up_to_offset[last_newline..]; let character = text_after_newline.encode_utf16().count(); ``` - **Added tests**: `test_offset_to_position_cyrillic` (unit) and `test_textedit_from_clang_output_cyrillic` (integration) with multi-byte UTF-8 input <!-- START COPILOT ORIGINAL PROMPT --> <details> <summary>Original prompt</summary> > > ---- > > *This section details on the original issue you should resolve* > > <issue_title>Formatting inserts a new empty // comment line every time (non-idempotent formatting)</issue_title> > <issue_description>Hi! I just so happen to stumble upon this tricky bug in formatting. > ## What happened > I have this example: > > > > ```proto > message Test { > // Обратная совместимость: если true, применяет фильтры enabled_not_false и removed_not_true. > int32 x = 1; > } > ``` > > When applying a formatting it seem to try to split the comment in two and spread among two lines. But in the end it just adds a new line with empty comment: > ```proto > message Test { > // Обратная совместимость: если true, применяет фильтры enabled_not_false и removed_not_true. > // > int32 x = 1; > } > ``` > > Further formatting just add empty line comments. > ```proto > message Test { > // Обратная совместимость: если true, применяет фильтры enabled_not_false и removed_not_true. > // > // > // > int32 x = 1; > } > ``` > > ### Environment > > - OS: Linux fedora 42 > - Neovim: NVIM v0.11.1 > - protols: 0.13.2 > - clang-format: clang-format version 20.1.8 (Fedora 20.1.8-4.fc42) > - Formatting trigger: Neovim LSP (`vim.lsp.buf.format()`), also happens on `:w` (format-on-save enabled) > > ### Video Example > https://github.com/user-attachments/assets/f33be03e-78e6-45db-8c83-89ae15a31d0b > </issue_description> > > ## Comments on the Issue (you are @copilot in this section) > > <comments> > </comments> > </details> <!-- START COPILOT CODING AGENT SUFFIX --> - Fixes #104 <!-- START COPILOT CODING AGENT TIPS --> --- ✨ Let Copilot coding agent [set things up for you](https://github.com/coder3101/protols/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot) — coding agent works faster and does higher quality work when set up for your repo. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: asharkhan3101 <140482588+asharkhan3101@users.noreply.github.com>
1 parent f1ce7e5 commit 8bc1877

5 files changed

Lines changed: 73 additions & 1 deletion

src/formatter/clang.rs

Lines changed: 45 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,11 @@ impl Replacement<'_> {
4646
let up_to_offset = &content[..offset];
4747
let line = up_to_offset.matches('\n').count();
4848
let last_newline = up_to_offset.rfind('\n').map_or(0, |pos| pos + 1);
49-
let character = offset - last_newline;
49+
50+
// LSP uses UTF-16 code units for character positions
51+
// Count UTF-16 code units from last newline to offset
52+
let text_after_newline = &up_to_offset[last_newline..];
53+
let character = text_after_newline.encode_utf16().count();
5054

5155
Some(Position {
5256
line: line as u32,
@@ -178,4 +182,44 @@ mod test {
178182
})
179183
}
180184
}
185+
186+
#[test]
187+
fn test_offset_to_position_cyrillic() {
188+
// Test with Cyrillic characters (multi-byte UTF-8)
189+
let c = include_str!("input/test_cyrillic.proto");
190+
// Byte offset 134 corresponds to UTF-16 code unit 77 from the start of line 1
191+
// (the comment line contains multi-byte UTF-8 characters)
192+
let pos = vec![0, 15, 134];
193+
for i in pos {
194+
with_settings!({description => c, info => &i}, {
195+
assert_yaml_snapshot!(Replacement::offset_to_position(i, c));
196+
})
197+
}
198+
}
199+
200+
#[test]
201+
fn test_textedit_from_clang_output_cyrillic() {
202+
// Test that the complete flow works with Cyrillic characters
203+
// This simulates what clang-format would output for the Cyrillic comment
204+
let content = include_str!("input/test_cyrillic.proto");
205+
let xml_output = r#"<?xml version='1.0'?>
206+
<replacements xml:space='preserve' incomplete_format='false'>
207+
<replacement offset='134' length='1'>
208+
// </replacement>
209+
</replacements>"#;
210+
211+
let r = Replacements::from_str(xml_output).unwrap();
212+
assert_eq!(r.replacements.len(), 1);
213+
214+
let replacement = &r.replacements[0];
215+
assert_eq!(replacement.offset, 134);
216+
assert_eq!(replacement.length, 1);
217+
218+
let text_edit = replacement.as_text_edit(content).unwrap();
219+
// The edit should be at line 1, character 77 (not 119)
220+
assert_eq!(text_edit.range.start.line, 1);
221+
assert_eq!(text_edit.range.start.character, 77);
222+
assert_eq!(text_edit.range.end.line, 1);
223+
assert_eq!(text_edit.range.end.character, 78);
224+
}
181225
}
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
message Test {
2+
// Обратная совместимость: если true, применяет фильтры enabled_not_false и removed_not_true.
3+
int32 x = 1;
4+
}
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
source: src/formatter/clang.rs
3+
description: "message Test {\n // Обратная совместимость: если true, применяет фильтры enabled_not_false и removed_not_true.\n int32 x = 1;\n}\n"
4+
expression: "Replacement::offset_to_position(i, c)"
5+
info: 15
6+
---
7+
line: 1
8+
character: 0
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
source: src/formatter/clang.rs
3+
description: "message Test {\n // Обратная совместимость: если true, применяет фильтры enabled_not_false и removed_not_true.\n int32 x = 1;\n}\n"
4+
expression: "Replacement::offset_to_position(i, c)"
5+
info: 134
6+
---
7+
line: 1
8+
character: 77
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
source: src/formatter/clang.rs
3+
description: "message Test {\n // Обратная совместимость: если true, применяет фильтры enabled_not_false и removed_not_true.\n int32 x = 1;\n}\n"
4+
expression: "Replacement::offset_to_position(i, c)"
5+
info: 0
6+
---
7+
line: 0
8+
character: 0

0 commit comments

Comments
 (0)