[draft] workloadmeta/docker: recover panics from moby/client JSON decoder#49897
[draft] workloadmeta/docker: recover panics from moby/client JSON decoder#49897scottopell wants to merge 2 commits intomainfrom
Conversation
An SMP regression run on the gpu_check_hopper_fake_nvml experiment
exposed a real agent-level fragility: a panic thrown by moby/client's
JSON decoder during (*Client).ContainerInspect escapes all the way up
through DockerUtil.InspectNoCache, through the docker workloadmeta
collector's handleContainerEvent / stream() goroutine (neither wraps
with defer recover), and ends up killing the whole agent process with
exit code 2. The observed stack was:
panic: reflect: Field index out of range
encoding/json.(*decodeState).object decode.go:735
encoding/json.(*decodeState).value decode.go:380
encoding/json.(*decodeState).object decode.go:767
encoding/json.(*decodeState).unmarshal decode.go:183
encoding/json.(*Decoder).Decode stream.go:75
moby/client.decodeWithRaw[...] utils.go:125
moby/client.(*Client).ContainerInspect container_inspect.go:45
pkg/util/docker.(*DockerUtil).InspectNoCache docker_util.go:341
comp/core/workloadmeta/collectors/internal/docker.
(*collector).{buildCollectorEvent,handleContainerEvent,stream}
The root-cause panic is inside the json/reflect interaction with
moby/client v0.4.0's InspectResponse; we were unable to reproduce the
exact upstream panic from a natural Docker response body after trying
targeted mutations, a full random fuzz against the field set, and 64k
concurrent inspects of a real docker 28.5.1 response. Whatever the
SMP runner's Docker daemon emits seems specific to that environment.
What this commit DOES reproduce is the agent-visible half of the bug:
any panic that happens inside the moby client's response path \u2014 JSON
decoder, body reader, or transport \u2014 escapes InspectNoCache unchanged.
The test installs a mock http.RoundTripper whose Body.Read panics,
exercises the real InspectNoCache code path, and asserts the panic is
NOT recovered. If a future change teaches InspectNoCache (or a layer
above) to recover, the assertion should be flipped to require.Nil and
a companion test added at the workloadmeta collector level.
This lets us verify in CI that the observed fragility still exists and
gives a concrete, reproducible signal for any future fix.
The adjacent TestInspectNoCache_NormalResponse is a positive-path
sanity check for the same mock plumbing so the failure test stays
honest (if the plumbing were broken and always panicked, the negative
test would pass trivially).
…fix)
The docker workloadmeta collector's stream() goroutine runs forever and
handles two event streams (container + image) by calling
handleContainerEvent / handleImageEvent inline. Neither path had any
panic recovery, so a single panic inside buildCollectorEvent \u2014 for
example, one raised by moby/client's JSON decoder while processing a
malformed ContainerInspect response \u2014 would bubble up, terminate the
goroutine and crash the whole agent process.
This happened in production during an SMP regression-detector run on
the gpu_check_hopper_fake_nvml experiment (every replicate crashed the
agent after a few hundred seconds, exhausting the 1h10m SMP timeout);
see taskmds/05001 for the full captured stack and context.
Wrap both event handlers in a runWithRecovery helper that:
* logs the panic + debug.Stack() at ERROR with enough context to
identify the offending event (container ID + action for containers,
action for images),
* drops the offending event,
* lets the stream loop continue processing subsequent events.
This is a quickfix, not a root-cause fix \u2014 the underlying panic is
still inside moby/client@v0.4.0's JSON decoder and needs to be fixed
there (or upstream in Go's encoding/json + reflect interaction with
container.InspectResponse). taskmds/05001 tracks that work. The
companion regression test for the agent-visible half of the bug is at
pkg/util/docker/inspect_panic_test.go (commit bc6a6f4).
Three unit tests cover the new helper:
* TestRunWithRecovery_SwallowsPanic \u2014 a panicking function returns
to its caller normally.
* TestRunWithRecovery_NoPanicNoOp \u2014 the happy path is unchanged.
* TestRunWithRecovery_SubsequentCallsAfterPanic \u2014 the reusability
property the stream loop actually relies on: a panic in one call
does not affect the next.
|
🎯 Code Coverage (details) 🔗 Commit SHA: 688de84 | Docs | Datadog PR Page | Give us feedback! |
Files inventory check summaryFile checks results against ancestor 34c766ed: Results for datadog-agent_7.80.0~devel.git.214.688de84.pipeline.109687042-1_amd64.deb:No change detected |
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 34c766e Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | -0.96 | [-4.02, +2.10] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | quality_gate_metrics_logs | memory utilization | +0.97 | [+0.71, +1.23] | 1 | Logs bounds checks dashboard |
| ➖ | otlp_ingest_metrics | memory utilization | +0.78 | [+0.63, +0.93] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | +0.71 | [+0.67, +0.74] | 1 | Logs bounds checks dashboard |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | +0.34 | [+0.17, +0.51] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | +0.34 | [+0.10, +0.58] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | +0.28 | [+0.12, +0.44] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | +0.21 | [+0.11, +0.31] | 1 | Logs |
| ➖ | otlp_ingest_logs | memory utilization | +0.13 | [+0.03, +0.23] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | +0.08 | [-0.44, +0.60] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | +0.04 | [-0.40, +0.48] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | +0.03 | [-0.37, +0.43] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | +0.01 | [-0.19, +0.21] | 1 | Logs |
| ➖ | file_tree | memory utilization | +0.00 | [-0.04, +0.05] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.00 | [-0.12, +0.11] | 1 | Logs |
| ➖ | ddot_metrics_sum_delta | memory utilization | -0.01 | [-0.19, +0.18] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | -0.01 | [-0.21, +0.19] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | -0.05 | [-0.19, +0.09] | 1 | Logs |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | -0.06 | [-0.11, -0.01] | 1 | Logs |
| ➖ | quality_gate_idle | memory utilization | -0.11 | [-0.16, -0.06] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_logs | memory utilization | -0.41 | [-0.48, -0.34] | 1 | Logs |
| ➖ | ddot_metrics | memory utilization | -0.53 | [-0.74, -0.33] | 1 | Logs |
| ➖ | docker_containers_cpu | % cpu utilization | -0.96 | [-4.02, +2.10] | 1 | Logs |
| ➖ | quality_gate_logs | % cpu utilization | -3.29 | [-4.93, -1.65] | 1 | Logs bounds checks dashboard |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | 490 ≥ 26 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | 242.19MiB ≤ 370MiB | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | 666 ≥ 26 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | 0.16GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_0ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | 0.20GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_1000ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | 0.17GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_100ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | 0.19GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_500ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | 138.67MiB ≤ 147MiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | 464.92MiB ≤ 495MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | 3 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | 174.25MiB ≤ 195MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | 356.93 ≤ 2000 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | 3 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | 383.43MiB ≤ 430MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
12 successful checks with minimal change (< 2 KiB)
On-wire sizes (compressed)
|
What
Two-commit defense-in-depth against
panics escapingmoby/client@v0.4.0's JSON decoder during container-event streaming in the docker workloadmeta collector:pkg/util/docker: regression test that pins the currentunrecovered-panic behaviour of
InspectNoCache. The decoder panicpropagates all the way up to the caller's goroutine. The test asserts
require.NotNil(recovered)so we get a clean signal the day someoneadopts the upstream fix — at which point the assertion needs to flip
to
require.Nil. The test message spells this out.workloadmeta/docker: panic-recovery wrapper in the collector'sstream()goroutine. BothhandleContainerEventandhandleImageEventnow run inside arunWithRecoveryhelper thatlogs the panic +
debug.Stack()at ERROR with enough context toidentify the offending event, drops it, and lets the loop continue.
Three unit tests cover the helper (panic swallowed, no-op happy
path, and the reusability property the loop relies on).
Why
This is a quickfix to stop a single bad container event from killing the whole agent process. We hit it during an SMP regression run on the new
gpu_check_hopper_fake_nvmlexperiment (PR #49685): every replicate's agent panicked after a few hundred seconds, eventually exhausting SMP's 1h10m job timeout. The trigger was apanic: reflect: Field index out of rangeraised insidemoby/client's JSON decoder while processing what looked like a perfectly well-formedContainerInspectresponse. NeitherDockerUtil.InspectNoCachenor the workloadmetastream()goroutine recovered; the panic terminated the process with exit code 2.The root cause is upstream — somewhere in
moby/client@v0.4.0or in Go'sencoding/json+reflectinteraction withcontainer.InspectResponse— and we couldn't reproduce it from a unit test despite a wide range of mutations (targeted decoder fuzzing, 200-iter random fuzz, 64k concurrent inspects against real Docker 28.5.1). It seems specific to the Docker API version SMP's runner exposes.While that root-cause work continues, the agent should not crash because of it.
Scope
This PR intentionally does not try to fix the moby decoder. It only:
Validation
dda inv test --targets=./pkg/util/docker --test-run-name=InspectNoCache→ 2 passeddda inv test --targets=./comp/core/workloadmeta/collectors/internal/docker --test-run-name=RunWithRecovery→ 3 passed