Skip to content

Commit b2c588e

Browse files
committed
feat: charset-aware, binary-safe body previews in request/response logging
Body previews in the default sync and async instrumentation steps decoded every captured payload as UTF-8. That produced mojibake for a text body declared in another charset (a latin-1 response logged `café` as `caf?`) and a run of U+FFFD replacement characters for a binary body (gzip, an image), which is both unreadable and capable of corrupting log viewers. Add a shared `BodyPreview` renderer that the two steps delegate to: - Text bodies are decoded with the charset declared on the body's `MediaType`, falling back to UTF-8 when the media type omits a charset or names one the JVM cannot resolve. - A short text/binary sniff (NUL byte, or a high ratio of non-whitespace control bytes in a leading sample) classifies binary bodies, which are rendered as a size-only `[binary N bytes]` summary instead of being decoded as text. Decoding still substitutes the replacement character for malformed input rather than throwing, so a snapshot truncated mid-multibyte-sequence by the bounded capture cannot crash the log line. The bounded/streaming capture behaviour of the loggable bodies is unchanged.
1 parent 0f09ae9 commit b2c588e

5 files changed

Lines changed: 352 additions & 23 deletions

File tree

Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
/*
2+
* Copyright (c) 2026 dexpace and Omar Aljarrah
3+
*
4+
* Licensed under the MIT License. See LICENSE in the project root.
5+
* SPDX-License-Identifier: MIT
6+
*/
7+
8+
package org.dexpace.sdk.core.http.pipeline.steps
9+
10+
import org.dexpace.sdk.core.http.common.MediaType
11+
import java.nio.charset.Charset
12+
13+
/**
14+
* Renders captured request/response body bytes into a short, log-safe preview string.
15+
*
16+
* The default instrumentation steps previously decoded every preview as UTF-8, which produced
17+
* mojibake for a text body declared in another charset (e.g. ISO-8859-1) and a stream of
18+
* `U+FFFD` replacement characters for a binary body (a gzip payload, an image, …). This helper
19+
* makes previews:
20+
*
21+
* - **Charset-aware** — a text body is decoded with the charset declared on its [MediaType]
22+
* (`Content-Type: text/plain;charset=ISO-8859-1`). When the media type omits a charset, or
23+
* names one the JVM does not recognise, the preview falls back to [DEFAULT_CHARSET] (UTF-8),
24+
* matching the HTTP default for most text types and the previous behaviour.
25+
* - **Binary-safe** — a body that does not look like text is never decoded. Instead it is
26+
* summarised as `[binary N bytes]`, so an operator sees a stable, greppable marker rather
27+
* than a line of replacement characters that can corrupt log viewers.
28+
*
29+
* Decoding itself never throws: [Charset]-based decoding substitutes the replacement character
30+
* for malformed input, so a snapshot that ends mid-multibyte-sequence (a real possibility on a
31+
* bounded capture) still yields a usable preview rather than crashing the log line.
32+
*
33+
* This is an `internal` seam shared by [DefaultInstrumentationStep] and
34+
* [DefaultAsyncInstrumentationStep]; it has no public API surface.
35+
*/
36+
internal object BodyPreview {
37+
/** Charset used when a text body declares no charset, or names an unknown one. */
38+
internal val DEFAULT_CHARSET: Charset = Charsets.UTF_8
39+
40+
/**
41+
* Number of leading bytes inspected by [isProbablyText]. A small fixed sample keeps the
42+
* heuristic O(1) regardless of body size while still catching the common binary signatures
43+
* (NUL bytes, control-byte runs) that appear near the start of a payload.
44+
*/
45+
private const val SNIFF_SAMPLE_BYTES = 1024
46+
47+
/**
48+
* Fraction of sampled bytes that may be non-text control bytes before the body is judged
49+
* binary. Real text can carry the odd control byte (an ESC sequence, a stray form feed), so
50+
* the cutoff sits well above zero rather than rejecting on the first control byte.
51+
*/
52+
private const val MAX_CONTROL_BYTE_RATIO = 0.30
53+
54+
// Byte values used by the text/binary heuristic, named so the rule logic reads clearly and
55+
// so the values are not flagged as inline magic numbers.
56+
private const val UNSIGNED_BYTE_MASK = 0xFF
57+
private const val NUL = 0x00
58+
private const val FIRST_PRINTABLE = 0x20 // space; bytes below this are C0 control bytes
59+
private const val TAB = 0x09
60+
private const val LINE_FEED = 0x0A
61+
private const val CARRIAGE_RETURN = 0x0D
62+
private const val DEL = 0x7F
63+
64+
/**
65+
* Renders [bytes] as a preview string.
66+
*
67+
* Empty input yields the empty string. A body that does not pass [isProbablyText] is rendered
68+
* as a size-only `[binary N bytes]` summary. Otherwise the bytes are decoded with the charset
69+
* from [mediaType] (or [DEFAULT_CHARSET] when absent/unknown).
70+
*/
71+
internal fun render(
72+
bytes: ByteArray,
73+
mediaType: MediaType?,
74+
): String {
75+
if (bytes.isEmpty()) return ""
76+
if (!isProbablyText(bytes)) return binarySummary(bytes.size)
77+
return previewText(bytes, mediaType)
78+
}
79+
80+
/**
81+
* Decodes [bytes] using the charset declared on [mediaType], falling back to
82+
* [DEFAULT_CHARSET] when the media type is null, declares no charset, or names a charset the
83+
* JVM cannot resolve ([MediaType.charset] returns null in the latter two cases). Invalid byte
84+
* sequences are replaced rather than throwing.
85+
*/
86+
internal fun previewText(
87+
bytes: ByteArray,
88+
mediaType: MediaType?,
89+
): String {
90+
if (bytes.isEmpty()) return ""
91+
val charset = mediaType?.charset ?: DEFAULT_CHARSET
92+
return String(bytes, charset)
93+
}
94+
95+
/**
96+
* Heuristically decides whether [bytes] is text. Samples the first [SNIFF_SAMPLE_BYTES]:
97+
* a single NUL byte is treated as a strong binary signal, and a control-byte ratio above
98+
* [MAX_CONTROL_BYTE_RATIO] (excluding the common text whitespace `\t`, `\n`, `\r`) also marks
99+
* the body binary. Bytes >= 0x80 are not counted against the body — they are legitimate in
100+
* UTF-8 multibyte sequences and in single-byte charsets such as ISO-8859-1.
101+
*/
102+
@Suppress("ReturnCount")
103+
internal fun isProbablyText(bytes: ByteArray): Boolean {
104+
if (bytes.isEmpty()) return true
105+
val sample = minOf(bytes.size, SNIFF_SAMPLE_BYTES)
106+
var controlBytes = 0
107+
for (i in 0 until sample) {
108+
val b = bytes[i].toInt() and UNSIGNED_BYTE_MASK
109+
// A NUL byte essentially never appears in real text and is the canonical marker of a
110+
// binary payload (gzip, images, protobuf, …); reject immediately.
111+
if (b == NUL) return false
112+
if (isControlByte(b)) controlBytes++
113+
}
114+
return controlBytes.toDouble() / sample <= MAX_CONTROL_BYTE_RATIO
115+
}
116+
117+
/**
118+
* True for an ASCII C0 control byte that is not one of the whitespace characters routinely
119+
* found in text ([TAB], [LINE_FEED], [CARRIAGE_RETURN]), or for the [DEL] byte. High bytes
120+
* (>= 0x80) are deliberately excluded — see [isProbablyText].
121+
*/
122+
private fun isControlByte(b: Int): Boolean =
123+
(b < FIRST_PRINTABLE && b != TAB && b != LINE_FEED && b != CARRIAGE_RETURN) || b == DEL
124+
125+
/** The size-only summary emitted for a binary body. */
126+
private fun binarySummary(size: Int): String = "[binary $size bytes]"
127+
}

sdk-core/src/main/kotlin/org/dexpace/sdk/core/http/pipeline/steps/DefaultAsyncInstrumentationStep.kt

Lines changed: 3 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -308,13 +308,13 @@ public class DefaultAsyncInstrumentationStep
308308
if (requestBody != null) {
309309
val preview = requestBody.snapshot(options.bodyPreviewMaxBytes)
310310
ev.field("request.body.size", preview.size.toLong())
311-
.field("request.body.preview", utf8Preview(preview))
311+
.field("request.body.preview", BodyPreview.render(preview, requestBody.mediaType()))
312312
}
313313
val responseBody = response.body
314314
if (responseBody is LoggableResponseBody) {
315315
val preview = responseBody.snapshot(options.bodyPreviewMaxBytes)
316316
ev.field("response.body.size", preview.size.toLong())
317-
.field("response.body.preview", utf8Preview(preview))
317+
.field("response.body.preview", BodyPreview.render(preview, responseBody.mediaType()))
318318
responseBody.captureException?.let {
319319
ev.field("response.body.drain_error", it.javaClass.simpleName ?: "Throwable")
320320
}
@@ -346,7 +346,7 @@ public class DefaultAsyncInstrumentationStep
346346
if (shouldCaptureBody() && requestBody != null) {
347347
val preview = requestBody.snapshot(options.bodyPreviewMaxBytes)
348348
ev.field("request.body.size", preview.size.toLong())
349-
.field("request.body.preview", utf8Preview(preview))
349+
.field("request.body.preview", BodyPreview.render(preview, requestBody.mediaType()))
350350
}
351351
ev.log()
352352
} catch (t: Throwable) {
@@ -416,13 +416,6 @@ public class DefaultAsyncInstrumentationStep
416416

417417
private fun elapsedMillis(startNanos: Long): Double = (clock.monotonic() - startNanos) / NANOS_PER_MILLI_DOUBLE
418418

419-
private fun utf8Preview(bytes: ByteArray): String {
420-
if (bytes.isEmpty()) return ""
421-
// Defensive: a snapshot that ends mid-UTF-8 codepoint shouldn't crash the log line.
422-
// String(bytes, charset) replaces invalid sequences rather than throwing.
423-
return String(bytes, Charsets.UTF_8)
424-
}
425-
426419
private fun emitInstrumentationError(
427420
event: String,
428421
phase: String,

sdk-core/src/main/kotlin/org/dexpace/sdk/core/http/pipeline/steps/DefaultInstrumentationStep.kt

Lines changed: 3 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -212,13 +212,13 @@ public class DefaultInstrumentationStep
212212
if (requestBody != null) {
213213
val preview = requestBody.snapshot(options.bodyPreviewMaxBytes)
214214
ev.field("request.body.size", preview.size.toLong())
215-
.field("request.body.preview", utf8Preview(preview))
215+
.field("request.body.preview", BodyPreview.render(preview, requestBody.mediaType()))
216216
}
217217
val responseBody = response.body
218218
if (responseBody is LoggableResponseBody) {
219219
val preview = responseBody.snapshot(options.bodyPreviewMaxBytes)
220220
ev.field("response.body.size", preview.size.toLong())
221-
.field("response.body.preview", utf8Preview(preview))
221+
.field("response.body.preview", BodyPreview.render(preview, responseBody.mediaType()))
222222
responseBody.captureException?.let {
223223
ev.field("response.body.drain_error", it.javaClass.simpleName ?: "Throwable")
224224
}
@@ -250,7 +250,7 @@ public class DefaultInstrumentationStep
250250
if (shouldCaptureBody() && requestBody != null) {
251251
val preview = requestBody.snapshot(options.bodyPreviewMaxBytes)
252252
ev.field("request.body.size", preview.size.toLong())
253-
.field("request.body.preview", utf8Preview(preview))
253+
.field("request.body.preview", BodyPreview.render(preview, requestBody.mediaType()))
254254
}
255255
ev.log()
256256
} catch (t: Throwable) {
@@ -320,13 +320,6 @@ public class DefaultInstrumentationStep
320320

321321
private fun elapsedMillis(startNanos: Long): Double = (clock.monotonic() - startNanos) / NANOS_PER_MILLI_DOUBLE
322322

323-
private fun utf8Preview(bytes: ByteArray): String {
324-
if (bytes.isEmpty()) return ""
325-
// Defensive: a snapshot that ends mid-UTF-8 codepoint shouldn't crash the log line.
326-
// String(bytes, charset) replaces invalid sequences rather than throwing.
327-
return String(bytes, Charsets.UTF_8)
328-
}
329-
330323
private fun emitInstrumentationError(
331324
event: String,
332325
phase: String,
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
/*
2+
* Copyright (c) 2026 dexpace and Omar Aljarrah
3+
*
4+
* Licensed under the MIT License. See LICENSE in the project root.
5+
* SPDX-License-Identifier: MIT
6+
*/
7+
8+
package org.dexpace.sdk.core.http.pipeline.steps
9+
10+
import org.dexpace.sdk.core.http.common.MediaType
11+
import java.nio.charset.StandardCharsets
12+
import kotlin.test.Test
13+
import kotlin.test.assertEquals
14+
import kotlin.test.assertFalse
15+
import kotlin.test.assertTrue
16+
17+
class BodyPreviewTest {
18+
@Test
19+
fun `empty bytes render to empty string`() {
20+
assertEquals("", BodyPreview.render(ByteArray(0), MediaType.parse("text/plain")))
21+
}
22+
23+
@Test
24+
fun `text body declared as ISO-8859-1 is decoded with that charset`() {
25+
// 0xE9 is 'é' in ISO-8859-1. Decoded as UTF-8 it would be the U+FFFD replacement char.
26+
val bytes = "café".toByteArray(StandardCharsets.ISO_8859_1)
27+
val preview = BodyPreview.render(bytes, MediaType.parse("text/plain;charset=ISO-8859-1"))
28+
assertEquals("café", preview)
29+
assertFalse(preview.contains(''), "latin-1 body must not be decoded as mojibake")
30+
}
31+
32+
@Test
33+
fun `text body declared as UTF-8 is decoded with UTF-8`() {
34+
val bytes = "naïve — text".toByteArray(StandardCharsets.UTF_8)
35+
assertEquals("naïve — text", BodyPreview.render(bytes, MediaType.parse("text/plain;charset=utf-8")))
36+
}
37+
38+
@Test
39+
fun `text body with no declared charset falls back to UTF-8`() {
40+
val bytes = "résumé".toByteArray(StandardCharsets.UTF_8)
41+
// No charset parameter on the media type → documented default is UTF-8.
42+
assertEquals("résumé", BodyPreview.render(bytes, MediaType.parse("text/plain")))
43+
// A null media type also falls back to UTF-8.
44+
assertEquals("résumé", BodyPreview.render(bytes, null))
45+
}
46+
47+
@Test
48+
fun `unknown declared charset falls back to UTF-8`() {
49+
val bytes = "hello".toByteArray(StandardCharsets.UTF_8)
50+
// MediaType.charset returns null for an unrecognised charset name; preview falls back.
51+
assertEquals("hello", BodyPreview.render(bytes, MediaType.parse("text/plain;charset=not-a-real-charset")))
52+
}
53+
54+
@Test
55+
fun `binary body with NUL byte is summarised as size only and not decoded`() {
56+
// gzip magic header (0x1f 0x8b 0x08) followed by a NUL byte — a representative binary body.
57+
val bytes = byteArrayOf(0x1f, 0x8b.toByte(), 0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x03)
58+
val preview = BodyPreview.render(bytes, MediaType.parse("application/gzip"))
59+
assertEquals("[binary ${bytes.size} bytes]", preview)
60+
assertFalse(preview.contains(''), "binary body must not be rendered as replacement chars")
61+
}
62+
63+
@Test
64+
fun `binary body declared with a text charset is still detected as binary`() {
65+
// Even if a server mislabels a gzip stream as text, the NUL byte keeps it from being decoded.
66+
val bytes = byteArrayOf(0x1f, 0x8b.toByte(), 0x00, 0x42, 0x00, 0x13)
67+
val preview = BodyPreview.render(bytes, MediaType.parse("text/plain;charset=utf-8"))
68+
assertEquals("[binary ${bytes.size} bytes]", preview)
69+
}
70+
71+
@Test
72+
fun `isProbablyText accepts plain ASCII and latin-1 high bytes`() {
73+
assertTrue(BodyPreview.isProbablyText("plain ascii text".toByteArray(StandardCharsets.US_ASCII)))
74+
assertTrue(BodyPreview.isProbablyText("café".toByteArray(StandardCharsets.ISO_8859_1)))
75+
// Common text whitespace control bytes do not flip the verdict.
76+
assertTrue(BodyPreview.isProbablyText("line1\r\n\tline2".toByteArray(StandardCharsets.UTF_8)))
77+
}
78+
79+
@Test
80+
fun `isProbablyText rejects a NUL byte and a control-byte-heavy body`() {
81+
assertFalse(BodyPreview.isProbablyText(byteArrayOf(0x41, 0x00, 0x42)))
82+
// A run of non-whitespace C0 control bytes exceeds the ratio cutoff.
83+
assertFalse(BodyPreview.isProbablyText(byteArrayOf(0x01, 0x02, 0x03, 0x04, 0x05)))
84+
}
85+
}

0 commit comments

Comments
 (0)