Skip to content

fix(edit): 修复 LLM 转义错误时 loose_escape 回退路径写入错误内容的问题#136

Open
fym998 wants to merge 6 commits into
lessweb:mainfrom
fym998:fix/loose-escape-newstring-correction
Open

fix(edit): 修复 LLM 转义错误时 loose_escape 回退路径写入错误内容的问题#136
fym998 wants to merge 6 commits into
lessweb:mainfrom
fym998:fix/loose-escape-newstring-correction

Conversation

@fym998
Copy link
Copy Markdown
Contributor

@fym998 fym998 commented May 30, 2026

问题

LLM 在处理 LaTeX、JSON、Unicode 转义等场景时经常多数/少数反斜杠(如 \alpha 写成 \\alpha ),导致精确匹配失败进入 loose_escape 回退路径。该路径虽然能定位到正确的匹配位置,但直接使用 LLM 原始 newString(同样带有转义错误)写入文件,造成静默数据损坏——每次此类替换都会在文件中 引入错误的反斜杠数量。

解决方案

  1. 新增 fixNewStringEscaping() 确定性修正:通过比对 oldString 与匹配文本中各反斜杠组的数量比例,推导出 newString 的正确转义形式。覆盖 uniform ratio(均匀比例,如 LaTeX 命令)、zero ratio(引号转义→无转义)等情况。

  2. 优先确定性修正,LLM 兜底,节约成本:先用确定性算法修正。仅在转义比例不一致(mixed ratios,如同一字 符串中不同位置的反斜杠错误倍数不同)时,才调用 LLM 辅助判断。这样可以节约又贵又慢的 LLM 调用。

  3. 无法修正时明确报错而非静默写入:当确定性修正失败且 LLM 不可用时,返回明确错误信息,引导 LLM 重新读取文件或改用 Bash 工具。

测试覆盖(36 个测试全部通过)

场景 路径
LaTeX 命令 over-escaping(uniform ratio) 确定性修正 → loose_escape
Unicode 转义 over-escaping(uniform ratio) 确定性修正 → loose_escape
JSON 字符串转义 确定性修正 → loose_escape
引号转义→无转义(zero ratio) 确定性修正 → loose_escape
新串多出反斜杠组(reuse last ratio) 确定性修正 → loose_escape
LaTeX 混合转义(mixed ratios) 确定性失败 → LLM 修正
JS Unicode 混合转义 确定性失败 → LLM 修正
JSON 混合转义 确定性失败 → LLM 修正
混合转义 + 无 LLM 确定性失败 → 返回错误 ok: false

fym998 added 3 commits May 30, 2026 12:57
When the LLM miscounts backslash escapes (common in LaTeX, JSON
nesting), the exact match fails and the loose_escape regex recovers
the matched position.  However the fallback path was writing the
LLM's original newString verbatim, which carried the same escaping
errors as the old_string.  This silently corrupted files by
introducing doubled or missing backslashes on every replacement that
went through the loose_escape path without a successful
llm_escape_correction round-trip.

Add fixNewStringEscaping() which tokenizes strings into backslash runs
and text segments, aligns old_string with the regex-matched text to
compute per-run backslash count ratios, and applies the same ratios to
newString.  When newString has more backslash runs than old_string,
the last ratio is reused (the escaping error is typically uniform).

Also update replacementOldString to use the matched text so both sides
are consistent, matching the pattern already established by
tab_correction.
…back

Add two tests that exercise the loose_escape fallback path without an
LLM client (so correctEscapedStringsWithLLM is skipped), confirming
that the escaping correction is applied to newString:

- Over-escaped LaTeX commands: \\alpha → \alpha
- Over-escaped LaTeX accent: H\\\"{o}tel → H\"{o}tel
  (both backslash and quote doubled by LLM)

Both cases verify that the resulting file content uses correctly
escaped single-backslash LaTeX, not the LLM's original
multi-backslash new_string.
Allow loose_escape newString correction to handle cases where old_string escapes a character but the matched file text has no backslash, such as " matching a literal quote. Keep the correction aligned with the loose_escape regex semantics and preserve reuse of the last ratio for extra new_string backslash runs.

Add regression coverage for quote escapes collapsing to zero backslashes and for extra new_string backslash runs reusing the last correction ratio.
@fym998
Copy link
Copy Markdown
Contributor Author

fym998 commented May 30, 2026

bug 的触发条件可能写得不准确(AI 写的),我自用的时候也是有时触发,有时又不触发,难以复现,我再检查一下

@fym998 fym998 marked this pull request as draft May 30, 2026 06:40
@fym998
Copy link
Copy Markdown
Contributor Author

fym998 commented May 30, 2026

我刚刚发现一个优化点:本地算法可以在很多情况下取代 LLM 矫正,从而减少成本和等待时间

fym998 added 2 commits May 30, 2026 15:08
Run local newString escaping correction before invoking LLM correction for unique loose_escape matches. Only fall back to LLM correction when deterministic ratio inference is ambiguous, such as mixed escaping ratios with extra new_string slash runs.

Update edit handler tests to assert deterministic cases avoid LLM calls, and add LLM fallback coverage for mixed escaping in LaTeX, JS unicode escapes, and JSON strings.
…ring when escaping is ambiguous and LLM unavailable

When deterministic escape correction fails (mixed/inconsistent ratios) and no
LLM is available to disambiguate, return a clear error instead of silently
writing the uncorrected (potentially over-escaped) newString to the file.

This is safer because the LLM can then re-read the file with exact escaping
or fall back to the Bash tool.
@fym998
Copy link
Copy Markdown
Contributor Author

fym998 commented May 30, 2026

确定性修正算法 fixNewStringEscaping

分三步:

第一步:收集转义比例 collectLooseEscapeRatios(oldString, matchedText)

同步遍历 oldStringmatchedText,以非反斜杠字符为"锚点":

oldString:   \\alpha + \"x\"
matchedText:  \alpha + "x"

反斜杠组 #1:  old=2, matched=1 → 比例 = 1/2 = 0.5
反斜杠组 #2:  old=1, matched=0 → 比例 = 0/1 = 0
              → ratios = [0.5, 0]
  • 非反斜杠字符必须逐字匹配(锚点),否则返回 null(结构不一致)
  • 末尾反斜杠组两边数量必须相等,否则返回 null
  • 有任何残留字符也返回 null

第二步:分词 tokenizeLooseEscaping(newString)

newString 切分成交替的 slash(连续反斜杠)和 text(其他文本)token:

newString:  \\beta + \"y\" + \\gamma

→ [slash(2), text("beta + "), slash(1), text("\"y\" + "), slash(2), text("gamma")]

第三步:逐 token 应用比例

遍历 token 序列,对 slash 类型用对应序号的比例修正:

slash(2) × ratios[0]=0.5 → round(1) → \         (1个反斜杠)
slash(1) × ratios[1]=0   → round(0) → (空)       (引号转义归零)
slash(2) × ratios[?]     → 超出比例表...

此时判断 ratios 是否均匀:
  [0.5, 0] → 不均匀 → canReuseLastRatio = false
  → 返回 ok: false(交给 LLM 处理)

如果比例均匀(如 LaTeX 命令全是 \\\\\\,比例全是 0.5),则 canReuseLastRatio = true,多余的反斜杠组可以复用最后一个比例:

oldString:  \\alpha + \\beta       ratios = [0.5, 0.5]  ← 均匀
newString:  \\delta + \\epsilon + \\gamma

slash(2) × 0.5 → 1 个
slash(2) × 0.5 → 1 个
slash(2) × 0.5 → 1 个(复用 lastRatio)
→ \delta + \epsilon + \gamma  ✓

决策链

findLooseEscapeMatches 找到唯一匹配 (score=1)
        │
        ▼
fixNewStringEscaping() 确定性修正
   ├─ 成功 (ok:true)  → 直接用修正后的 newString 写入,matched_via="loose_escape"
   └─ 失败 (ok:false,比例不均匀或结构不一致)
        │
        ▼
   correctEscapedStringsWithLLM() LLM 辅助修正
      ├─ 成功 → matched_via="llm_escape_correction"
      └─ 失败/不可用 → 返回错误(不静默写入)

@fym998 fym998 changed the title fix(edit): 修复 loose_escape 回退路径中 newString 转义未被矫正的问题 fix(edit): 修复 LLM 转义错误时 loose_escape 回退路径写入错误内容的问题 May 30, 2026
@fym998 fym998 marked this pull request as ready for review May 30, 2026 07:24
@fym998
Copy link
Copy Markdown
Contributor Author

fym998 commented May 30, 2026

bug 的触发条件可能写得不准确(AI 写的),我自用的时候也是有时触发,有时又不触发,难以复现,我再检查一下

我还是没搞懂之前为什么会触发,因为触发很少,记录太多,找不着了。可能是过于复杂的 LaTeX 给 DS 整不会了,LLM correction 也修正不过来;也可能是当时对话意外中断导致的(未验证)。无论怎样,这个 PR 都能覆盖这些情况。

@fym998
Copy link
Copy Markdown
Contributor Author

fym998 commented May 30, 2026

我知道了,LLM correction 是在我遇到这个问题之后、修复之前加上的。不过我认为这个PR还是很有价值的,至少有修复残余错误、减少 LLM 调用开销、增强确定性/可解释性这三方面的改进。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant