Fix translation pipeline corrupting HTML tags and bold headings in non-English pages#43890
Fix translation pipeline corrupting HTML tags and bold headings in non-English pages#43890ff1451 wants to merge 1 commit intogithub:mainfrom
Conversation
- Unescape entity-encoded HTML tags (<code> → <code>) in translated content when the same tag appears as raw HTML in the English source - Remove bare code-fence wrapping from bold heading lines (**...**) that the translation pipeline incorrectly wraps in fenced code blocks
|
👋 Hey there spelunker. It looks like you've modified some files that we can't accept as contributions:
You'll need to revert all of the files you changed that match that list using GitHub Desktop or The complete list of files we can't accept are:
We also can't accept contributions to files in the content directory with frontmatter |
How to review these changes 👓Thank you for your contribution. To review these changes, choose one of the following options: A Hubber will need to deploy your changes internally to review. Table of review linksNote: Please update the URL for your staging server or codespace. This pull request contains code changes, so we will not generate a table of review links. 🤖 This comment is automatically generated. |
There was a problem hiding this comment.
Pull request overview
Fixes Crowdin-related corruption in non-English markdown by restoring intended inline HTML rendering and preventing bold “heading” lines from being mis-rendered as fenced code blocks.
Changes:
- Add a correction step to selectively unescape entity-encoded HTML tags in translations based on tag names present in the English source.
- Add a correction step to remove bare (no-language) fenced code blocks that wrap a single bold line.
- Add unit tests covering both behaviors and key non-regression cases.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/languages/lib/correct-translation-content.ts | Adds the new unescape + “bare fence around bold line” corrections in the translation-fix pipeline. |
| src/languages/tests/correct-translation-content.ts | Adds unit tests validating the new corrections and non-regression behavior. |
| if (englishContent && content.includes('<')) { | ||
| const englishTagNames = new Set( | ||
| [...englishContent.matchAll(/<([a-z][a-z0-9]*)/gi)].map((m) => m[1].toLowerCase()), | ||
| ) | ||
| if (englishTagNames.size > 0) { | ||
| content = content.replace( | ||
| /<(\/?[a-z][a-z0-9]*)(\s[^<>]*?)?>/gi, | ||
| (match, tag: string, attrs = '') => { | ||
| const baseName = tag.replace(/^\//, '').toLowerCase() | ||
| return englishTagNames.has(baseName) ? `<${tag}${attrs}>` : match | ||
| }, | ||
| ) |
There was a problem hiding this comment.
This unescapes and re-enables arbitrary attributes from translated content for any tag name found in the English source (including event-handler attributes like onload/onclick). Since the markdown pipeline allows dangerous HTML (rehype-raw) and the UI renders HTML via dangerouslySetInnerHTML, it would be safer to restrict which tags/attributes can be unescaped (for example, allowlist tags and allow only safe attributes such as href on <a>).
| test('unescapes entity-encoded HTML tags when English source has matching raw HTML', () => { | ||
| const english = | ||
| '<td><code><a href="https://github.com/actions/runner-images/blob/main/images/ubuntu/Ubuntu2404-Readme.md">ubuntu-latest</a></code></td>' | ||
|
|
||
| expect(fix('<code>ubuntu-latest</code>', 'ko', english)).toBe( | ||
| '<code>ubuntu-latest</code>', | ||
| ) | ||
| expect( | ||
| fix('<a href="https://example.com">ubuntu-latest</a>', 'ko', english), | ||
| ).toBe('<a href="https://example.com">ubuntu-latest</a>') | ||
| expect( | ||
| fix( | ||
| '<code><a href="https://github.com/actions/runner-images/blob/main/images/ubuntu/Ubuntu2404-Readme.md">ubuntu-latest</a></code>', | ||
| 'ko', | ||
| english, | ||
| ), | ||
| ).toBe( | ||
| '<code><a href="https://github.com/actions/runner-images/blob/main/images/ubuntu/Ubuntu2404-Readme.md">ubuntu-latest</a></code>', | ||
| ) | ||
| }) |
There was a problem hiding this comment.
The new HTML-unescape behavior has good basic coverage, but it doesn't exercise attributes that contain encoded </> sequences (which can break naive tag-matching regexes). Adding a test like <a title="a > b">...</a> would help prevent regressions and validate the intended parsing behavior.
| content = content.replace( | ||
| /<(\/?[a-z][a-z0-9]*)(\s[^<>]*?)?>/gi, | ||
| (match, tag: string, attrs = '') => { | ||
| const baseName = tag.replace(/^\//, '').toLowerCase() | ||
| return englishTagNames.has(baseName) ? `<${tag}${attrs}>` : match | ||
| }, |
There was a problem hiding this comment.
The attrs capture in the entity-tag regex ((\s[^<>]*?)?) will stop at the first > it sees. If an attribute value itself contains an encoded >/< (e.g. title="a > b"), the match can terminate early and the replacement will corrupt the tag/content. Consider switching to an attribute pattern that respects quoted strings, or using an HTML parser to unescape tags safely.
|
@ff1451 Thanks for trying to fix this, unfortunately this isn't a file that open source contributors can change, and the docs team isn't actually responsible for for translations. That's handled by an outside team, and we can't make changes to their files. I notice that you opened an issue, though, so I'm going to go look at that and see if it's something I can submit as a bug report. |
Why:
Closes: #43888
What's being changed:
Two rendering bugs affect all non-English translations caused by the Crowdin translation pipeline corrupting content in
src/languages/lib/correct-translation-content.ts.Bug 1: Raw HTML tags visible in tables
Inline HTML elements such as
<code><a href="...">ubuntu-latest</a></code>inside table<td>cells are displayed as literal text instead of being rendered as styled code elements. Crowdin entity-encodes angle brackets, turning<code>into<code>, which the correction pipeline was not unescaping.Bug 2: Bold headings rendered inside code blocks
Bold heading lines (e.g.
**Needed for downloading actions:**) between code blocks are wrapped in bare code fences by the translation pipeline, causing them to render as code blocks instead of formatted headings.Fix:
<code>→<code>) when the same tag name appears as raw HTML in the English source, to avoid incorrectly expanding intentional<entities.**...**).Check off the following: