-
Notifications
You must be signed in to change notification settings - Fork 333
Fix charset decoding for server.request.body.files_content in commons-fileupload #11212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
gh-worker-dd-mergequeue-cf854d
merged 16 commits into
master
from
alejandro.gonzalez/APPSEC-61875-files-content-encoding
Apr 28, 2026
Merged
Changes from all commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
045ea63
fix(appsec): use per-part charset for files_content in commons-fileup…
jandro996 733393c
fix(appsec): use CharsetDecoder with REPORT for charset fallback in F…
jandro996 4738eed
refactor(appsec): extract MultipartContentDecoder to internal-api for…
jandro996 21b3820
test(appsec): add missing corner cases to MultipartContentDecoderTest
jandro996 b7a1a33
test(appsec): trim redundant decoder tests from FileItemContentReader…
jandro996 e74778b
fix(appsec): use machine default charset as fallback in MultipartCont…
jandro996 55044ff
test(appsec): migrate MultipartContentDecoderTest from Groovy/Spock t…
jandro996 6f0a3ec
fix(appsec): strip surrounding quotes from charset parameter in Multi…
jandro996 beea424
fix(appsec): replace String#split with char loop in MultipartContentD…
jandro996 228e1c4
Merge branch 'master' into alejandro.gonzalez/APPSEC-61875-files-cont…
jandro996 aa1b0b6
refactor(appsec): avoid toLowerCase allocation in MultipartContentDec…
jandro996 fd0ed70
chore: add CODEOWNERS entry for internal-api/datadog/trace/api/http
jandro996 22536b1
Use REPLACE for malformed bytes so truncation preserves declared charset
jandro996 bbaf10e
Remove dead catch block from MultipartContentDecoder.decodeBytes
jandro996 5d471a6
Fix charset parameter name matching to require exact boundary
jandro996 33f9eb1
Restore required catch for checked CharacterCodingException
jandro996 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
74 changes: 74 additions & 0 deletions
74
internal-api/src/main/java/datadog/trace/api/http/MultipartContentDecoder.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,74 @@ | ||
| package datadog.trace.api.http; | ||
|
|
||
| import java.nio.ByteBuffer; | ||
| import java.nio.charset.CharacterCodingException; | ||
| import java.nio.charset.Charset; | ||
| import java.nio.charset.CodingErrorAction; | ||
|
|
||
| /** Decodes multipart file content bytes to String using the per-part Content-Type charset. */ | ||
| public final class MultipartContentDecoder { | ||
|
|
||
| public static String decodeBytes(byte[] buf, int length, String contentType) { | ||
| Charset charset = extractCharset(contentType); | ||
| if (charset == null) charset = Charset.defaultCharset(); | ||
| try { | ||
| return charset | ||
| .newDecoder() | ||
| .onMalformedInput(CodingErrorAction.REPLACE) | ||
| .onUnmappableCharacter(CodingErrorAction.REPLACE) | ||
| .decode(ByteBuffer.wrap(buf, 0, length)) | ||
| .toString(); | ||
| } catch (CharacterCodingException e) { | ||
| // unreachable: CodingErrorAction.REPLACE never throws CharacterCodingException | ||
| throw new IllegalStateException(e); | ||
| } | ||
| } | ||
|
|
||
| public static Charset extractCharset(String contentType) { | ||
| if (contentType == null) return null; | ||
| int searchFrom = 0; | ||
| while (true) { | ||
| int idx = indexOfIgnoreAsciiCase(contentType, "charset=", searchFrom); | ||
| if (idx < 0) return null; | ||
| // Require a parameter boundary before "charset=" so "xcharset=..." is not matched | ||
| if (idx == 0 || contentType.charAt(idx - 1) == ';' || contentType.charAt(idx - 1) == ' ') { | ||
| int nameStart = idx + 8; | ||
| int end = contentType.length(); | ||
| for (int i = nameStart; i < end; i++) { | ||
| char c = contentType.charAt(i); | ||
| if (c == ';' || c == ',' || c == ' ') { | ||
| end = i; | ||
| break; | ||
| } | ||
| } | ||
| String name = contentType.substring(nameStart, end).trim(); | ||
| if (name.length() > 1 && name.charAt(0) == '"' && name.charAt(name.length() - 1) == '"') { | ||
| name = name.substring(1, name.length() - 1); | ||
| } | ||
| try { | ||
| return Charset.forName(name); | ||
| } catch (IllegalArgumentException e) { | ||
| return null; | ||
| } | ||
| } | ||
| searchFrom = idx + 1; | ||
| } | ||
| } | ||
|
|
||
| private static int indexOfIgnoreAsciiCase(String s, String needle, int fromIndex) { | ||
| int sLen = s.length(); | ||
| int nLen = needle.length(); | ||
| outer: | ||
| for (int i = fromIndex, max = sLen - nLen; i <= max; i++) { | ||
| for (int j = 0; j < nLen; j++) { | ||
| if (Character.toLowerCase(s.charAt(i + j)) != needle.charAt(j)) { | ||
| continue outer; | ||
| } | ||
| } | ||
| return i; | ||
| } | ||
| return -1; | ||
| } | ||
|
|
||
| private MultipartContentDecoder() {} | ||
| } | ||
151 changes: 151 additions & 0 deletions
151
internal-api/src/test/java/datadog/trace/api/http/MultipartContentDecoderTest.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,151 @@ | ||
| package datadog.trace.api.http; | ||
|
|
||
| import static org.junit.jupiter.api.Assertions.assertEquals; | ||
| import static org.junit.jupiter.api.Assertions.assertNull; | ||
|
|
||
| import java.nio.charset.StandardCharsets; | ||
| import org.junit.jupiter.api.Test; | ||
| import org.junit.jupiter.params.ParameterizedTest; | ||
| import org.junit.jupiter.params.provider.CsvSource; | ||
| import org.junit.jupiter.params.provider.NullAndEmptySource; | ||
| import org.junit.jupiter.params.provider.ValueSource; | ||
|
|
||
| public class MultipartContentDecoderTest { | ||
|
|
||
| @Test | ||
| void decodeBytesUsesDeclaredUtf8Charset() { | ||
| String text = "héllo wörld"; | ||
| byte[] bytes = text.getBytes(StandardCharsets.UTF_8); | ||
| assertEquals( | ||
| text, | ||
| MultipartContentDecoder.decodeBytes(bytes, bytes.length, "text/plain; charset=UTF-8")); | ||
| } | ||
|
|
||
| @Test | ||
| void decodeBytesUsesDeclaredIso88591Charset() { | ||
| String text = "café"; | ||
| byte[] bytes = text.getBytes(StandardCharsets.ISO_8859_1); | ||
| assertEquals( | ||
| text, | ||
| MultipartContentDecoder.decodeBytes(bytes, bytes.length, "text/plain; charset=ISO-8859-1")); | ||
| } | ||
|
|
||
| @Test | ||
| void decodeBytesDefaultsToMachineDefaultWhenNoCharset() { | ||
| String text = "hello world"; | ||
| byte[] bytes = text.getBytes(StandardCharsets.UTF_8); | ||
| assertEquals(text, MultipartContentDecoder.decodeBytes(bytes, bytes.length, "text/plain")); | ||
| } | ||
|
|
||
| @Test | ||
| void decodeBytesDefaultsToMachineDefaultWhenNullContentType() { | ||
| String text = "hello world"; | ||
| byte[] bytes = text.getBytes(StandardCharsets.UTF_8); | ||
| assertEquals(text, MultipartContentDecoder.decodeBytes(bytes, bytes.length, null)); | ||
| } | ||
|
|
||
| @Test | ||
| void decodeBytesRespectsLengthParameter() { | ||
| byte[] bytes = "hello world".getBytes(StandardCharsets.UTF_8); | ||
| assertEquals("hello", MultipartContentDecoder.decodeBytes(bytes, 5, null)); | ||
| } | ||
|
|
||
| @Test | ||
| void decodeBytesReturnsEmptyStringForZeroLength() { | ||
| assertEquals("", MultipartContentDecoder.decodeBytes(new byte[16], 0, null)); | ||
| } | ||
|
|
||
| @Test | ||
| void decodeBytesReplacesMalformedBytesWithReplacementCharacterUsingDeclaredCharset() { | ||
| // 0xE9 (ISO-8859-1 'é') is not valid UTF-8; REPLACE substitutes U+FFFD | ||
| byte[] bytes = "café".getBytes(StandardCharsets.ISO_8859_1); | ||
| assertEquals( | ||
| "caf�", | ||
| MultipartContentDecoder.decodeBytes(bytes, bytes.length, "text/plain; charset=UTF-8")); | ||
| } | ||
|
|
||
| @Test | ||
| void decodeBytesHandlesTruncationAtMultibyteCharacterBoundary() { | ||
| // "€" encodes as 3 bytes in UTF-8: E2 82 AC | ||
| byte[] complete = "hello€".getBytes(StandardCharsets.UTF_8); // 8 bytes | ||
| // Pass only 6 bytes: "hello" + first byte of "€" (incomplete sequence) | ||
| String result = MultipartContentDecoder.decodeBytes(complete, 6, "text/plain; charset=UTF-8"); | ||
| // Incomplete sequence → U+FFFD with declared charset, not fallback to JVM default | ||
| assertEquals("hello�", result); | ||
| } | ||
|
|
||
| @ParameterizedTest | ||
| @NullAndEmptySource | ||
| void extractCharsetReturnsNullForNullOrEmptyContentType(String contentType) { | ||
| assertNull(MultipartContentDecoder.extractCharset(contentType)); | ||
| } | ||
|
|
||
| @ParameterizedTest | ||
| @ValueSource(strings = {"text/plain", "image/jpeg", "application/octet-stream"}) | ||
| void extractCharsetReturnsNullForContentTypeWithoutCharset(String contentType) { | ||
| assertNull(MultipartContentDecoder.extractCharset(contentType)); | ||
| } | ||
|
|
||
| @Test | ||
| void extractCharsetReturnsNullForInvalidCharsetName() { | ||
| assertNull(MultipartContentDecoder.extractCharset("text/plain; charset=NOTACHARSET")); | ||
| } | ||
|
|
||
| @ParameterizedTest | ||
| @ValueSource( | ||
| strings = { | ||
| "text/plain; CHARSET=UTF-8", | ||
| "text/plain; Charset=UTF-8", | ||
| "text/plain; charset=utf-8" | ||
| }) | ||
| void extractCharsetIsCaseInsensitive(String contentType) { | ||
| assertEquals("UTF-8", MultipartContentDecoder.extractCharset(contentType).name()); | ||
| } | ||
|
|
||
| @ParameterizedTest | ||
| @CsvSource({"text/plain; charset=UTF-8, UTF-8", "text/xml; charset=ISO-8859-1, ISO-8859-1"}) | ||
| void extractCharsetFromStandardContentType(String contentType, String expectedCharset) { | ||
| assertEquals(expectedCharset, MultipartContentDecoder.extractCharset(contentType).name()); | ||
| } | ||
|
|
||
| @Test | ||
| void extractCharsetIgnoresSubstringMatchInParameterName() { | ||
| // "xcharset=UTF-16" must not match; the real "charset=UTF-8" that follows must be used | ||
| assertEquals( | ||
| "UTF-8", | ||
| MultipartContentDecoder.extractCharset("text/plain; xcharset=UTF-16; charset=UTF-8") | ||
| .name()); | ||
| } | ||
|
|
||
| @Test | ||
| void extractCharsetReturnsNullWhenOnlySubstringMatchExists() { | ||
| assertNull(MultipartContentDecoder.extractCharset("text/plain; xcharset=UTF-8")); | ||
| } | ||
|
|
||
| @Test | ||
| void extractCharsetHandlesAdditionalParameters() { | ||
| assertEquals( | ||
| "UTF-8", | ||
| MultipartContentDecoder.extractCharset("text/plain; charset=UTF-8; boundary=something") | ||
| .name()); | ||
| } | ||
|
|
||
| @ParameterizedTest | ||
| @CsvSource({ | ||
| "text/plain; charset=\"UTF-8\", UTF-8", | ||
| "text/xml; charset=\"ISO-8859-1\", ISO-8859-1" | ||
| }) | ||
| void extractCharsetHandlesQuotedCharsetValue(String contentType, String expectedCharset) { | ||
| assertEquals(expectedCharset, MultipartContentDecoder.extractCharset(contentType).name()); | ||
| } | ||
|
|
||
| @Test | ||
| void decodeBytesUsesQuotedDeclaredCharset() { | ||
| String text = "café"; | ||
| byte[] bytes = text.getBytes(StandardCharsets.ISO_8859_1); | ||
| assertEquals( | ||
| text, | ||
| MultipartContentDecoder.decodeBytes( | ||
| bytes, bytes.length, "text/plain; charset=\"ISO-8859-1\"")); | ||
| } | ||
| } |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this risk a NullPointerException elsewhere?
Should we fallback to a default Charset instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now it's only used in decodeBytes method, where that is done
if (charset == null) charset = Charset.defaultCharset();Probably It makes more sense to return it where you mention, not sure if in the future it will be necessary to know exactly if we are not able to get the charset
I could change it in my next PR related with this topic if you want :)