Detect modern HTML5 <meta charset> in attached Javadoc by chagong · Pull Request #5087 · eclipse-jdt/eclipse.jdt.core

chagong · 2026-05-29T06:33:48Z

What this fixes

Reported downstream in microsoft/vscode-java-pack#1429: when a project's
default file encoding does not match the encoding of an attached Javadoc
JAR, non-ASCII Javadoc text is shown as mojibake in hovers and the Javadoc
view.

The root cause is in JavaElement#getURLContents (the helper used by
BinaryType#getJavadocContents to fetch attached Javadoc HTML over
jar: / http(s): URLs). It only recognized the legacy XHTML form

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

via a hand-rolled byte-level scan (META_START, META_END, CHARSET,
CHARSET_HTML5, getIndexOf, isSameCharacter). Modern Javadoc
generators emit the HTML5 form

<meta charset="UTF-8">

which was never matched, so the bytes were decoded with the project's
default platform encoding. For an ISO-8859-1 project consuming UTF-8
Javadoc this produced garbled output.

Change

Replace the byte-array scan in JavaElement with a small regex that:

Decodes the first 4 KiB of the response as UTF-8 (ASCII-compatible
for the header bytes we care about, and Javadoc <head> is always
well within that budget).
Iterates <meta ...> tags via META_TAG_PATTERN.
Extracts the first charset=... value via META_CHARSET_PATTERN,
which uses a negative lookbehind (?<![A-Za-z0-9_-]) so it does
not match unrelated attributes like data-charset or
x-charset.

This handles both the legacy http-equiv form and the modern HTML5
<meta charset="…"> form with a single code path, and is more robust
against attribute order, single vs. double quotes, and whitespace.

Test

Added AttachedJavadocTests#testAttachedJavadocWithModernMetaCharset,
which:

Forces the project's default encoding to ISO-8859-1 (and restores it).
Attaches a new UTF8doc3 Javadoc directory whose HTML uses
<meta charset="UTF-8"> plus an adversarial
<meta name="x" data-charset="ISO-8859-1"> to verify the
lookbehind.
Calls IType#getAttachedJavadoc(...) and asserts the returned text
equals the UTF-8 reference file (containing non-ASCII characters,
e.g. こんにちは世界).

Local run of the full AttachedJavadocTests suite under Tycho against
the modified core: Tests run: 46, Failures: 0, Errors: 0.

Notes

Signed off per the Eclipse Contributor Agreement.
No public API changes; only the package-private JavaElement helper
and its private constants are touched.

Copilot

Pull request overview

Fixes incorrect decoding of attached Javadoc HTML when the project default encoding differs from the Javadoc content by improving charset detection from <meta ... charset=...> in JavaElement#getURLContents.

Changes:

Replace the legacy byte-array scan for charset detection with a regex-based <meta> parsing approach using a UTF-8 decoded prefix.
Add a new attached-Javadoc test that forces the project encoding to ISO-8859-1 and validates UTF-8 Javadoc is decoded correctly.
Add new UTF-8 Javadoc fixture files under UTF8doc3.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
org.eclipse.jdt.core/model/org/eclipse/jdt/internal/core/JavaElement.java	Implements regex-based meta charset extraction from the first 4 KiB of HTML to choose the correct decoding.
org.eclipse.jdt.core.tests.model/src/org/eclipse/jdt/core/tests/model/AttachedJavadocTests.java	Updates an existing test to restore project encoding and adds a new regression test for charset detection.
org.eclipse.jdt.core.tests.model/workspace/AttachedJavadocProject/UTF8doc3/p/TestBug394382.html	Adds a new Javadoc HTML fixture intended to exercise charset detection.
org.eclipse.jdt.core.tests.model/workspace/AttachedJavadocProject/UTF8doc3/p/TestBug394382.txt	Adds the expected extracted Javadoc text (includes non-ASCII content).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

chagong · 2026-06-08T01:31:05Z

@iloveeclipse can you take a look?

iloveeclipse · 2026-06-09T04:42:22Z

can you take a look?

Looking only on commits list, I see two merge commits. Please don’t use merge, it is not allowed in Eclipse repositories, use rebase.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

The attached-Javadoc reader in JavaElement#getURLContents only recognized the legacy XHTML form <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> via a hand-rolled byte-level scan. Modern Javadoc (HTML5) emits <meta charset="UTF-8"> which was never matched, so the contents were decoded using the project's default platform encoding. When the project encoding differed from the Javadoc encoding (e.g. ISO-8859-1 project consuming a UTF-8 Javadoc JAR), non-ASCII characters were rendered as mojibake in hovers and the Javadoc view. Replace the byte-array scan with a small regex that decodes the first 4 KiB of the HTML head as UTF-8 (which is ASCII-compatible for the header bytes we care about) and extracts the first 'charset=' value from any <meta> tag. A negative lookbehind on [A-Za-z0-9_-] guards against unrelated attributes such as data-charset. Add testAttachedJavadocWithModernMetaCharset, which forces the project default encoding to ISO-8859-1 and verifies that a UTF-8 Javadoc carrying both <meta charset="UTF-8"> and an adversarial data-charset attribute is decoded as UTF-8. Reported in microsoft/vscode-java-pack#1429. Signed-off-by: Changyong Gong <chagon@microsoft.com>

…and typo - Use possessive quantifiers in META_TAG_PATTERN and META_CHARSET_PATTERN to eliminate polynomial backtracking flagged by CodeQL - Wrap charset reset in try/finally so classpath restoration always executes even if setDefaultCharset throws - Fix remaining 'Shouldhave' typo in testBug426058 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

iloveeclipse requested a review from Copilot June 2, 2026 05:11

Copilot started reviewing on behalf of iloveeclipse June 2, 2026 05:11 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

iloveeclipse requested a review from Copilot June 9, 2026 04:40

Copilot started reviewing on behalf of iloveeclipse June 9, 2026 04:40 View session

Copilot AI reviewed Jun 9, 2026

View reviewed changes

Comment thread org.eclipse.jdt.core.tests.model/src/org/eclipse/jdt/core/tests/model/AttachedJavadocTests.java

Comment thread org.eclipse.jdt.core.tests.model/src/org/eclipse/jdt/core/tests/model/AttachedJavadocTests.java

github-advanced-security AI found potential problems Jun 9, 2026

View reviewed changes

Comment thread org.eclipse.jdt.core/model/org/eclipse/jdt/internal/core/JavaElement.java Fixed

Comment thread org.eclipse.jdt.core/model/org/eclipse/jdt/internal/core/JavaElement.java

chagong force-pushed the fix/attached-javadoc-modern-meta-charset branch 2 times, most recently from 033c543 to f66390b Compare June 11, 2026 06:21

iloveeclipse requested a review from Copilot June 11, 2026 06:45

Copilot started reviewing on behalf of iloveeclipse June 11, 2026 06:45 View session

Copilot AI reviewed Jun 11, 2026

View reviewed changes

Comment thread org.eclipse.jdt.core.tests.model/src/org/eclipse/jdt/core/tests/model/AttachedJavadocTests.java

chagong force-pushed the fix/attached-javadoc-modern-meta-charset branch 2 times, most recently from e44a340 to 85ee630 Compare June 17, 2026 05:54

chagong and others added 3 commits June 17, 2026 14:37

test: cover HTML5 meta charset and address review nits

f2bd340

chagong force-pushed the fix/attached-javadoc-modern-meta-charset branch from 85ee630 to 84c65d4 Compare June 17, 2026 06:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Detect modern HTML5 <meta charset> in attached Javadoc#5087

Detect modern HTML5 <meta charset> in attached Javadoc#5087
chagong wants to merge 3 commits into
eclipse-jdt:masterfrom
chagong:fix/attached-javadoc-modern-meta-charset

chagong commented May 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chagong commented Jun 8, 2026

Uh oh!

iloveeclipse commented Jun 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

chagong commented May 29, 2026

What this fixes

Change

Test

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chagong commented Jun 8, 2026

Uh oh!

iloveeclipse commented Jun 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants