fix(edit): 修复 LLM 转义错误时 loose_escape 回退路径写入错误内容的问题#136
Open
fym998 wants to merge 6 commits into
Open
Conversation
When the LLM miscounts backslash escapes (common in LaTeX, JSON nesting), the exact match fails and the loose_escape regex recovers the matched position. However the fallback path was writing the LLM's original newString verbatim, which carried the same escaping errors as the old_string. This silently corrupted files by introducing doubled or missing backslashes on every replacement that went through the loose_escape path without a successful llm_escape_correction round-trip. Add fixNewStringEscaping() which tokenizes strings into backslash runs and text segments, aligns old_string with the regex-matched text to compute per-run backslash count ratios, and applies the same ratios to newString. When newString has more backslash runs than old_string, the last ratio is reused (the escaping error is typically uniform). Also update replacementOldString to use the matched text so both sides are consistent, matching the pattern already established by tab_correction.
…back
Add two tests that exercise the loose_escape fallback path without an
LLM client (so correctEscapedStringsWithLLM is skipped), confirming
that the escaping correction is applied to newString:
- Over-escaped LaTeX commands: \\alpha → \alpha
- Over-escaped LaTeX accent: H\\\"{o}tel → H\"{o}tel
(both backslash and quote doubled by LLM)
Both cases verify that the resulting file content uses correctly
escaped single-backslash LaTeX, not the LLM's original
multi-backslash new_string.
Allow loose_escape newString correction to handle cases where old_string escapes a character but the matched file text has no backslash, such as " matching a literal quote. Keep the correction aligned with the loose_escape regex semantics and preserve reuse of the last ratio for extra new_string backslash runs. Add regression coverage for quote escapes collapsing to zero backslashes and for extra new_string backslash runs reusing the last correction ratio.
Contributor
Author
|
bug 的触发条件可能写得不准确(AI 写的),我自用的时候也是有时触发,有时又不触发,难以复现,我再检查一下 |
Contributor
Author
|
我刚刚发现一个优化点:本地算法可以在很多情况下取代 LLM 矫正,从而减少成本和等待时间 |
Run local newString escaping correction before invoking LLM correction for unique loose_escape matches. Only fall back to LLM correction when deterministic ratio inference is ambiguous, such as mixed escaping ratios with extra new_string slash runs. Update edit handler tests to assert deterministic cases avoid LLM calls, and add LLM fallback coverage for mixed escaping in LaTeX, JS unicode escapes, and JSON strings.
…ring when escaping is ambiguous and LLM unavailable When deterministic escape correction fails (mixed/inconsistent ratios) and no LLM is available to disambiguate, return a clear error instead of silently writing the uncorrected (potentially over-escaped) newString to the file. This is safer because the LLM can then re-read the file with exact escaping or fall back to the Bash tool.
Contributor
Author
确定性修正算法
|
Contributor
Author
我还是没搞懂之前为什么会触发,因为触发很少,记录太多,找不着了。可能是过于复杂的 LaTeX 给 DS 整不会了,LLM correction 也修正不过来;也可能是当时对话意外中断导致的(未验证)。无论怎样,这个 PR 都能覆盖这些情况。 |
Contributor
Author
|
我知道了,LLM correction 是在我遇到这个问题之后、修复之前加上的。不过我认为这个PR还是很有价值的,至少有修复残余错误、减少 LLM 调用开销、增强确定性/可解释性这三方面的改进。 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
问题
LLM 在处理 LaTeX、JSON、Unicode 转义等场景时经常多数/少数反斜杠(如
\alpha写成\\alpha),导致精确匹配失败进入loose_escape回退路径。该路径虽然能定位到正确的匹配位置,但直接使用 LLM 原始newString(同样带有转义错误)写入文件,造成静默数据损坏——每次此类替换都会在文件中 引入错误的反斜杠数量。解决方案
新增
fixNewStringEscaping()确定性修正:通过比对oldString与匹配文本中各反斜杠组的数量比例,推导出newString的正确转义形式。覆盖 uniform ratio(均匀比例,如 LaTeX 命令)、zero ratio(引号转义→无转义)等情况。优先确定性修正,LLM 兜底,节约成本:先用确定性算法修正。仅在转义比例不一致(mixed ratios,如同一字 符串中不同位置的反斜杠错误倍数不同)时,才调用 LLM 辅助判断。这样可以节约又贵又慢的 LLM 调用。
无法修正时明确报错而非静默写入:当确定性修正失败且 LLM 不可用时,返回明确错误信息,引导 LLM 重新读取文件或改用 Bash 工具。
测试覆盖(36 个测试全部通过)
loose_escapeloose_escapeloose_escapeloose_escapeloose_escapeok: false