Skip to content

Fix translation pipeline corrupting HTML tags and bold headings in non-English pages#43890

Closed
ff1451 wants to merge 1 commit intogithub:mainfrom
ff1451:fix/translation-html-code-block-rendering
Closed

Fix translation pipeline corrupting HTML tags and bold headings in non-English pages#43890
ff1451 wants to merge 1 commit intogithub:mainfrom
ff1451:fix/translation-html-code-block-rendering

Conversation

@ff1451
Copy link
Copy Markdown

@ff1451 ff1451 commented Apr 20, 2026

Why:

Closes: #43888

What's being changed:

Two rendering bugs affect all non-English translations caused by the Crowdin translation pipeline corrupting content in src/languages/lib/correct-translation-content.ts.

Bug 1: Raw HTML tags visible in tables

Inline HTML elements such as <code><a href="...">ubuntu-latest</a></code> inside table <td> cells are displayed as literal text instead of being rendered as styled code elements. Crowdin entity-encodes angle brackets, turning <code> into &lt;code&gt;, which the correction pipeline was not unescaping.

Bug 2: Bold headings rendered inside code blocks

Bold heading lines (e.g. **Needed for downloading actions:**) between code blocks are wrapped in bare code fences by the translation pipeline, causing them to render as code blocks instead of formatted headings.

Fix:

  1. Unescape entity-encoded HTML tags (&lt;code&gt;<code>) when the same tag name appears as raw HTML in the English source, to avoid incorrectly expanding intentional &lt; entities.
  2. Strip bare code-fence wrapping from lines that contain only a bold heading (**...**).

Check off the following:

  • A subject matter expert (SME) has reviewed the technical accuracy of the content in this PR. In most cases, the author can be the SME. Open source contributions may require an SME review from GitHub staff.
  • The changes in this PR meet the docs fundamentals that are required for all content.
  • All CI checks are passing and the changes look good in the review environment.

- Unescape entity-encoded HTML tags (&lt;code&gt; → <code>) in translated
  content when the same tag appears as raw HTML in the English source
- Remove bare code-fence wrapping from bold heading lines (**...**) that
  the translation pipeline incorrectly wraps in fenced code blocks
Copilot AI review requested due to automatic review settings April 20, 2026 17:27
@github-actions
Copy link
Copy Markdown
Contributor

👋 Hey there spelunker. It looks like you've modified some files that we can't accept as contributions:

  • src/languages/lib/correct-translation-content.ts
  • src/languages/tests/correct-translation-content.ts

You'll need to revert all of the files you changed that match that list using GitHub Desktop or git checkout origin/main <file name>. Once you get those files reverted, we can continue with the review process. :octocat:

The complete list of files we can't accept are:

  • .devcontainer/**
  • .github/**
  • data/reusables/rai/**
  • Dockerfile*
  • src/**
  • package*.json
  • content/actions/how-tos/security-for-github-actions/security-hardening-your-deployments/**

We also can't accept contributions to files in the content directory with frontmatter contentType: rai.

@github-actions
Copy link
Copy Markdown
Contributor

How to review these changes 👓

Thank you for your contribution. To review these changes, choose one of the following options:

A Hubber will need to deploy your changes internally to review.

Table of review links

Note: Please update the URL for your staging server or codespace.

This pull request contains code changes, so we will not generate a table of review links.

🤖 This comment is automatically generated.

@github-actions github-actions bot added the triage Do not begin working on this issue until triaged by the team label Apr 20, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes Crowdin-related corruption in non-English markdown by restoring intended inline HTML rendering and preventing bold “heading” lines from being mis-rendered as fenced code blocks.

Changes:

  • Add a correction step to selectively unescape entity-encoded HTML tags in translations based on tag names present in the English source.
  • Add a correction step to remove bare (no-language) fenced code blocks that wrap a single bold line.
  • Add unit tests covering both behaviors and key non-regression cases.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/languages/lib/correct-translation-content.ts Adds the new unescape + “bare fence around bold line” corrections in the translation-fix pipeline.
src/languages/tests/correct-translation-content.ts Adds unit tests validating the new corrections and non-regression behavior.

Comment on lines +894 to +905
if (englishContent && content.includes('&lt;')) {
const englishTagNames = new Set(
[...englishContent.matchAll(/<([a-z][a-z0-9]*)/gi)].map((m) => m[1].toLowerCase()),
)
if (englishTagNames.size > 0) {
content = content.replace(
/&lt;(\/?[a-z][a-z0-9]*)(\s[^<>]*?)?&gt;/gi,
(match, tag: string, attrs = '') => {
const baseName = tag.replace(/^\//, '').toLowerCase()
return englishTagNames.has(baseName) ? `<${tag}${attrs}>` : match
},
)
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unescapes and re-enables arbitrary attributes from translated content for any tag name found in the English source (including event-handler attributes like onload/onclick). Since the markdown pipeline allows dangerous HTML (rehype-raw) and the UI renders HTML via dangerouslySetInnerHTML, it would be safer to restrict which tags/attributes can be unescaped (for example, allowlist tags and allow only safe attributes such as href on <a>).

Copilot uses AI. Check for mistakes.
Comment on lines +1359 to +1378
test('unescapes entity-encoded HTML tags when English source has matching raw HTML', () => {
const english =
'<td><code><a href="https://github.com/actions/runner-images/blob/main/images/ubuntu/Ubuntu2404-Readme.md">ubuntu-latest</a></code></td>'

expect(fix('&lt;code&gt;ubuntu-latest&lt;/code&gt;', 'ko', english)).toBe(
'<code>ubuntu-latest</code>',
)
expect(
fix('&lt;a href="https://example.com"&gt;ubuntu-latest&lt;/a&gt;', 'ko', english),
).toBe('<a href="https://example.com">ubuntu-latest</a>')
expect(
fix(
'&lt;code&gt;&lt;a href="https://github.com/actions/runner-images/blob/main/images/ubuntu/Ubuntu2404-Readme.md"&gt;ubuntu-latest&lt;/a&gt;&lt;/code&gt;',
'ko',
english,
),
).toBe(
'<code><a href="https://github.com/actions/runner-images/blob/main/images/ubuntu/Ubuntu2404-Readme.md">ubuntu-latest</a></code>',
)
})
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new HTML-unescape behavior has good basic coverage, but it doesn't exercise attributes that contain encoded &lt;/&gt; sequences (which can break naive tag-matching regexes). Adding a test like &lt;a title="a &gt; b"&gt;...&lt;/a&gt; would help prevent regressions and validate the intended parsing behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +899 to +904
content = content.replace(
/&lt;(\/?[a-z][a-z0-9]*)(\s[^<>]*?)?&gt;/gi,
(match, tag: string, attrs = '') => {
const baseName = tag.replace(/^\//, '').toLowerCase()
return englishTagNames.has(baseName) ? `<${tag}${attrs}>` : match
},
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The attrs capture in the entity-tag regex ((\s[^<>]*?)?) will stop at the first &gt; it sees. If an attribute value itself contains an encoded &gt;/&lt; (e.g. title="a &gt; b"), the match can terminate early and the replacement will corrupt the tag/content. Consider switching to an attribute pattern that respects quoted strings, or using an HTML parser to unescape tags safely.

Copilot uses AI. Check for mistakes.
@Sharra-writes
Copy link
Copy Markdown
Contributor

@ff1451 Thanks for trying to fix this, unfortunately this isn't a file that open source contributors can change, and the docs team isn't actually responsible for for translations. That's handled by an outside team, and we can't make changes to their files. I notice that you opened an issue, though, so I'm going to go look at that and see if it's something I can submit as a bug report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

triage Do not begin working on this issue until triaged by the team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Non-English translations render escaped HTML as text and wrap bold headings in code blocks

3 participants