Detect modern HTML5 <meta charset> in attached Javadoc#5087
Open
chagong wants to merge 3 commits into
Open
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes incorrect decoding of attached Javadoc HTML when the project default encoding differs from the Javadoc content by improving charset detection from <meta ... charset=...> in JavaElement#getURLContents.
Changes:
- Replace the legacy byte-array scan for charset detection with a regex-based
<meta>parsing approach using a UTF-8 decoded prefix. - Add a new attached-Javadoc test that forces the project encoding to ISO-8859-1 and validates UTF-8 Javadoc is decoded correctly.
- Add new UTF-8 Javadoc fixture files under
UTF8doc3.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| org.eclipse.jdt.core/model/org/eclipse/jdt/internal/core/JavaElement.java | Implements regex-based meta charset extraction from the first 4 KiB of HTML to choose the correct decoding. |
| org.eclipse.jdt.core.tests.model/src/org/eclipse/jdt/core/tests/model/AttachedJavadocTests.java | Updates an existing test to restore project encoding and adds a new regression test for charset detection. |
| org.eclipse.jdt.core.tests.model/workspace/AttachedJavadocProject/UTF8doc3/p/TestBug394382.html | Adds a new Javadoc HTML fixture intended to exercise charset detection. |
| org.eclipse.jdt.core.tests.model/workspace/AttachedJavadocProject/UTF8doc3/p/TestBug394382.txt | Adds the expected extracted Javadoc text (includes non-ASCII content). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Author
|
@iloveeclipse can you take a look? |
Member
Looking only on commits list, I see two merge commits. Please don’t use merge, it is not allowed in Eclipse repositories, use rebase. |
033c543 to
f66390b
Compare
e44a340 to
85ee630
Compare
The attached-Javadoc reader in JavaElement#getURLContents only recognized
the legacy XHTML form
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
via a hand-rolled byte-level scan. Modern Javadoc (HTML5) emits
<meta charset="UTF-8">
which was never matched, so the contents were decoded using the project's
default platform encoding. When the project encoding differed from the
Javadoc encoding (e.g. ISO-8859-1 project consuming a UTF-8 Javadoc JAR),
non-ASCII characters were rendered as mojibake in hovers and the Javadoc
view.
Replace the byte-array scan with a small regex that decodes the first
4 KiB of the HTML head as UTF-8 (which is ASCII-compatible for the
header bytes we care about) and extracts the first 'charset=' value from
any <meta> tag. A negative lookbehind on [A-Za-z0-9_-] guards against
unrelated attributes such as data-charset.
Add testAttachedJavadocWithModernMetaCharset, which forces the project
default encoding to ISO-8859-1 and verifies that a UTF-8 Javadoc carrying
both <meta charset="UTF-8"> and an adversarial data-charset attribute is
decoded as UTF-8.
Reported in microsoft/vscode-java-pack#1429.
Signed-off-by: Changyong Gong <chagon@microsoft.com>
…and typo - Use possessive quantifiers in META_TAG_PATTERN and META_CHARSET_PATTERN to eliminate polynomial backtracking flagged by CodeQL - Wrap charset reset in try/finally so classpath restoration always executes even if setDefaultCharset throws - Fix remaining 'Shouldhave' typo in testBug426058 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
85ee630 to
84c65d4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this fixes
Reported downstream in microsoft/vscode-java-pack#1429: when a project's
default file encoding does not match the encoding of an attached Javadoc
JAR, non-ASCII Javadoc text is shown as mojibake in hovers and the Javadoc
view.
The root cause is in
JavaElement#getURLContents(the helper used byBinaryType#getJavadocContentsto fetch attached Javadoc HTML overjar:/http(s):URLs). It only recognized the legacy XHTML formvia a hand-rolled byte-level scan (
META_START,META_END,CHARSET,CHARSET_HTML5,getIndexOf,isSameCharacter). Modern Javadocgenerators emit the HTML5 form
which was never matched, so the bytes were decoded with the project's
default platform encoding. For an ISO-8859-1 project consuming UTF-8
Javadoc this produced garbled output.
Change
Replace the byte-array scan in
JavaElementwith a small regex that:for the header bytes we care about, and Javadoc
<head>is alwayswell within that budget).
<meta ...>tags viaMETA_TAG_PATTERN.charset=...value viaMETA_CHARSET_PATTERN,which uses a negative lookbehind
(?<![A-Za-z0-9_-])so it doesnot match unrelated attributes like
data-charsetorx-charset.This handles both the legacy
http-equivform and the modern HTML5<meta charset="…">form with a single code path, and is more robustagainst attribute order, single vs. double quotes, and whitespace.
Test
Added
AttachedJavadocTests#testAttachedJavadocWithModernMetaCharset,which:
ISO-8859-1(and restores it).UTF8doc3Javadoc directory whose HTML uses<meta charset="UTF-8">plus an adversarial<meta name="x" data-charset="ISO-8859-1">to verify thelookbehind.
IType#getAttachedJavadoc(...)and asserts the returned textequals the UTF-8 reference file (containing non-ASCII characters,
e.g.
こんにちは世界).Local run of the full
AttachedJavadocTestssuite under Tycho againstthe modified core:
Tests run: 46, Failures: 0, Errors: 0.Notes
JavaElementhelperand its private constants are touched.