Skip to content

Add consistency report and project glossary#82

Merged
jserv merged 1 commit into
mainfrom
refine
May 6, 2026
Merged

Add consistency report and project glossary#82
jserv merged 1 commit into
mainfrom
refine

Conversation

@jserv
Copy link
Copy Markdown
Contributor

@jserv jserv commented May 6, 2026

A real-world deployment study [1] reported mainland-Chinese terms slipping past the linter in published zh-TW articles, blockquote citation contexts producing ~50 false positives across a 72-article corpus, and ASCII quotes auto-converted to 「」 inside YAML frontmatter breaking downstream parsers.

User-facing additions:

  • '--consistency' reports mixed regional usage of one concept (both 線程 and 執行緒 in the same document). Groups by the rule's "english" anchor; skips TM-suppressed terms.
  • '--exempt-blockquotes' (CLI + '[markdown]' config) excludes pulldown-cmark 'Tag::BlockQuote' ranges from scanning. Off by default: adopted blockquote prose is real content.
  • YAML frontmatter preserves ASCII '"' / ''' scalar delimiters. Body prose still converts to 「」.
  • '[glossary]' section in '.zhtw-mcp.toml': banned / preferred / proper_nouns lists. Banned terms inject synthetic Errors that TM cannot downgrade; proper_nouns suppress matching issues; both honor exclusion zones.
  • Per-rule 'editorial_confidence' (low / medium / high) flows through issue inflation into MCP explain output. Low forces auto_fix_safe = false and needs_review = true. 優化, 算法, 場景 tagged low because both regional forms are valid zh-TW.

Calque-audit refinements:

  • 消息 gains positional_clues; 好消息 / 壞消息 / 消息來源 no longer fire.
  • Symmetric 元資料 rule mirrors 元數據 — both use to: [] plus english: "metadata", surfacing the English original as the preferred form. 詮釋資料 and 後設資料 (NAER terminology bank) remain unflagged as acceptable zh-TW alternatives.
  • Real-world regression fixture pins the 14 documented blind-spot terms.

[1] https://ai-muninn.com/zh-TW/blog/zhtw-mcp-calque-blindspot-sweep


Summary by cubic

Adds a document-wide terminology consistency report and a project glossary to enforce preferred terms. Reduces false positives in blockquote citations and preserves ASCII quotes in YAML frontmatter.

  • New Features

    • --consistency: groups by a rule’s english anchor to catch mixed regional terms in one doc; ignores TM‑suppressed issues.
    • Project glossary in .zhtw-mcp.toml ([glossary]): banned (always error), preferred (guides suggestions), proper_nouns (suppress); all honor exclusion zones.
    • Blockquote exemption: --exempt-blockquotes and [markdown].exempt_blockquotes = true exclude Markdown blockquotes from scanning (off by default).
    • Per‑rule editorial_confidence (low/medium/high): included in diagnostics; low forces auto_fix_safe = false and needs_review = true.
  • Bug Fixes

    • YAML frontmatter now preserves ASCII " and ' scalar delimiters; body prose still converts to 「」.
    • Calque rules: added positional clues for 消息 (no more false positives like 好消息/壞消息/消息來源); added a symmetric 元資料 rule mirroring 元數據 with english: "metadata" while keeping 詮釋資料/後設資料 acceptable.

Written for commit 269c8dd. Summary will update on new commits.

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 issues found across 29 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/rules/glossary.rs">

<violation number="1" location="src/rules/glossary.rs:86">
P1: Skipping synthetic banned issues when a regular issue already exists can let TM downgrade the only report, breaking the intended `banned > TM` precedence.</violation>
</file>

<file name="src/engine/consistency.rs">

<violation number="1" location="src/engine/consistency.rs:148">
P1: Do not bypass group matching when only one term group exists; it can select unrelated glossary terms and create false consistency diagnostics.</violation>
</file>

<file name="src/mcp/tools.rs">

<violation number="1" location="src/mcp/tools.rs:853">
P1: Fix-mode ordering lets TM downgrade glossary-banned synthetic errors, breaking the intended `banned > TM` precedence.</violation>

<violation number="2" location="src/mcp/tools.rs:1083">
P2: `tools/list` schema was not updated for the new `exempt_blockquotes`/`glossary`/`consistency` arguments, causing API contract drift.</violation>
</file>

<file name="src/main.rs">

<violation number="1" location="src/main.rs:869">
P2: The new `--exempt-blockquotes` mode is not represented in scan-cache keys, so cached results can be incorrect when toggling the flag.</violation>

<violation number="2" location="src/main.rs:1194">
P1: Cache-hit paths can drop source text needed by the new glossary/consistency features, causing missed findings.</violation>
</file>

<file name="tests/realworld_calques.rs">

<violation number="1" location="tests/realworld_calques.rs:62">
P2: Match on containment here so the collocation regression is caught even when the scanner reports the full phrase instead of the bare term.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread src/rules/glossary.rs Outdated
Comment thread src/engine/consistency.rs Outdated
Comment thread src/mcp/tools.rs
Comment thread src/main.rs
Comment thread src/mcp/tools.rs
Comment thread src/main.rs
if params.relaxed {
cfg = cfg.with_relaxed();
}
if params.exempt_blockquotes {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The new --exempt-blockquotes mode is not represented in scan-cache keys, so cached results can be incorrect when toggling the flag.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/main.rs, line 869:

<comment>The new `--exempt-blockquotes` mode is not represented in scan-cache keys, so cached results can be incorrect when toggling the flag.</comment>

<file context>
@@ -825,6 +866,9 @@ fn run_lint_batch(params: &LintBatchParams<'_>) -> Result<()> {
     if params.relaxed {
         cfg = cfg.with_relaxed();
     }
+    if params.exempt_blockquotes {
+        cfg = cfg.with_exempt_blockquotes(true);
+    }
</file context>

Comment thread tests/realworld_calques.rs Outdated
A real-world deployment study [1] reported mainland-Chinese terms
slipping past the linter in published zh-TW articles, blockquote
citation contexts producing ~50 false positives across a 72-article
corpus, and ASCII quotes auto-converted to 「」 inside YAML
frontmatter breaking downstream parsers.

User-facing additions:
- '--consistency' reports mixed regional usage of one concept (both
  線程 and 執行緒 in the same document).  Groups by the rule's "english"
  anchor; skips TM-suppressed terms.
- '--exempt-blockquotes' (CLI + '[markdown]' config) excludes
  pulldown-cmark 'Tag::BlockQuote' ranges from scanning. Off by default:
  adopted blockquote prose is real content.
- YAML frontmatter preserves ASCII '"' / ''' scalar delimiters.
  Body prose still converts to 「」.
- '[glossary]' section in '.zhtw-mcp.toml': banned / preferred /
  proper_nouns lists.  Banned terms inject synthetic Errors that TM
  cannot downgrade; proper_nouns suppress matching issues; both honor
  exclusion zones.
- Per-rule 'editorial_confidence' (low / medium / high) flows through
  issue inflation into MCP explain output.  Low forces
  auto_fix_safe = false and needs_review = true.  優化, 算法, 場景
  tagged low because both regional forms are valid zh-TW.

Calque-audit refinements:
- 消息 gains positional_clues; 好消息 / 壞消息 / 消息來源 no longer
  fire.
- Symmetric 元資料 rule mirrors 元數據 — both use to: [] plus english:
  "metadata", surfacing the English original as the preferred form.
  詮釋資料 and 後設資料 (NAER terminology bank) remain unflagged as
  acceptable zh-TW alternatives.
- Real-world regression fixture pins the 14 documented blind-spot terms.

[1] https://ai-muninn.com/zh-TW/blog/zhtw-mcp-calque-blindspot-sweep
@jserv jserv merged commit cbddee7 into main May 6, 2026
4 checks passed
@jserv jserv deleted the refine branch May 6, 2026 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant