Skip to content

Detect modern HTML5 <meta charset> in attached Javadoc#5087

Open
chagong wants to merge 3 commits into
eclipse-jdt:masterfrom
chagong:fix/attached-javadoc-modern-meta-charset
Open

Detect modern HTML5 <meta charset> in attached Javadoc#5087
chagong wants to merge 3 commits into
eclipse-jdt:masterfrom
chagong:fix/attached-javadoc-modern-meta-charset

Conversation

@chagong

@chagong chagong commented May 29, 2026

Copy link
Copy Markdown

What this fixes

Reported downstream in microsoft/vscode-java-pack#1429: when a project's
default file encoding does not match the encoding of an attached Javadoc
JAR, non-ASCII Javadoc text is shown as mojibake in hovers and the Javadoc
view.

The root cause is in JavaElement#getURLContents (the helper used by
BinaryType#getJavadocContents to fetch attached Javadoc HTML over
jar: / http(s): URLs). It only recognized the legacy XHTML form

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

via a hand-rolled byte-level scan (META_START, META_END, CHARSET,
CHARSET_HTML5, getIndexOf, isSameCharacter). Modern Javadoc
generators emit the HTML5 form

<meta charset="UTF-8">

which was never matched, so the bytes were decoded with the project's
default platform encoding. For an ISO-8859-1 project consuming UTF-8
Javadoc this produced garbled output.

Change

Replace the byte-array scan in JavaElement with a small regex that:

  1. Decodes the first 4 KiB of the response as UTF-8 (ASCII-compatible
    for the header bytes we care about, and Javadoc <head> is always
    well within that budget).
  2. Iterates <meta ...> tags via META_TAG_PATTERN.
  3. Extracts the first charset=... value via META_CHARSET_PATTERN,
    which uses a negative lookbehind (?<![A-Za-z0-9_-]) so it does
    not match unrelated attributes like data-charset or
    x-charset.

This handles both the legacy http-equiv form and the modern HTML5
<meta charset="…"> form with a single code path, and is more robust
against attribute order, single vs. double quotes, and whitespace.

Test

Added AttachedJavadocTests#testAttachedJavadocWithModernMetaCharset,
which:

  • Forces the project's default encoding to ISO-8859-1 (and restores it).
  • Attaches a new UTF8doc3 Javadoc directory whose HTML uses
    <meta charset="UTF-8"> plus an adversarial
    <meta name="x" data-charset="ISO-8859-1"> to verify the
    lookbehind.
  • Calls IType#getAttachedJavadoc(...) and asserts the returned text
    equals the UTF-8 reference file (containing non-ASCII characters,
    e.g. こんにちは世界).

Local run of the full AttachedJavadocTests suite under Tycho against
the modified core: Tests run: 46, Failures: 0, Errors: 0.

Notes

  • Signed off per the Eclipse Contributor Agreement.
  • No public API changes; only the package-private JavaElement helper
    and its private constants are touched.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes incorrect decoding of attached Javadoc HTML when the project default encoding differs from the Javadoc content by improving charset detection from <meta ... charset=...> in JavaElement#getURLContents.

Changes:

  • Replace the legacy byte-array scan for charset detection with a regex-based <meta> parsing approach using a UTF-8 decoded prefix.
  • Add a new attached-Javadoc test that forces the project encoding to ISO-8859-1 and validates UTF-8 Javadoc is decoded correctly.
  • Add new UTF-8 Javadoc fixture files under UTF8doc3.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
org.eclipse.jdt.core/model/org/eclipse/jdt/internal/core/JavaElement.java Implements regex-based meta charset extraction from the first 4 KiB of HTML to choose the correct decoding.
org.eclipse.jdt.core.tests.model/src/org/eclipse/jdt/core/tests/model/AttachedJavadocTests.java Updates an existing test to restore project encoding and adds a new regression test for charset detection.
org.eclipse.jdt.core.tests.model/workspace/AttachedJavadocProject/UTF8doc3/p/TestBug394382.html Adds a new Javadoc HTML fixture intended to exercise charset detection.
org.eclipse.jdt.core.tests.model/workspace/AttachedJavadocProject/UTF8doc3/p/TestBug394382.txt Adds the expected extracted Javadoc text (includes non-ASCII content).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@chagong

chagong commented Jun 8, 2026

Copy link
Copy Markdown
Author

@iloveeclipse can you take a look?

@iloveeclipse

Copy link
Copy Markdown
Member

can you take a look?

Looking only on commits list, I see two merge commits. Please don’t use merge, it is not allowed in Eclipse repositories, use rebase.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

@chagong chagong force-pushed the fix/attached-javadoc-modern-meta-charset branch 2 times, most recently from 033c543 to f66390b Compare June 11, 2026 06:21
@iloveeclipse iloveeclipse requested a review from Copilot June 11, 2026 06:45

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

@chagong chagong force-pushed the fix/attached-javadoc-modern-meta-charset branch 2 times, most recently from e44a340 to 85ee630 Compare June 17, 2026 05:54
chagong and others added 3 commits June 17, 2026 14:37
The attached-Javadoc reader in JavaElement#getURLContents only recognized
the legacy XHTML form
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
via a hand-rolled byte-level scan. Modern Javadoc (HTML5) emits
    <meta charset="UTF-8">
which was never matched, so the contents were decoded using the project's
default platform encoding. When the project encoding differed from the
Javadoc encoding (e.g. ISO-8859-1 project consuming a UTF-8 Javadoc JAR),
non-ASCII characters were rendered as mojibake in hovers and the Javadoc
view.

Replace the byte-array scan with a small regex that decodes the first
4 KiB of the HTML head as UTF-8 (which is ASCII-compatible for the
header bytes we care about) and extracts the first 'charset=' value from
any <meta> tag. A negative lookbehind on [A-Za-z0-9_-] guards against
unrelated attributes such as data-charset.

Add testAttachedJavadocWithModernMetaCharset, which forces the project
default encoding to ISO-8859-1 and verifies that a UTF-8 Javadoc carrying
both <meta charset="UTF-8"> and an adversarial data-charset attribute is
decoded as UTF-8.

Reported in microsoft/vscode-java-pack#1429.

Signed-off-by: Changyong Gong <chagon@microsoft.com>
…and typo

- Use possessive quantifiers in META_TAG_PATTERN and META_CHARSET_PATTERN
  to eliminate polynomial backtracking flagged by CodeQL
- Wrap charset reset in try/finally so classpath restoration always
  executes even if setDefaultCharset throws
- Fix remaining 'Shouldhave' typo in testBug426058

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@chagong chagong force-pushed the fix/attached-javadoc-modern-meta-charset branch from 85ee630 to 84c65d4 Compare June 17, 2026 06:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants