You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The CI Visibility wrapper deadlocks in waitParallel against the parent-t.Parallel + subtest-t.Parallel pattern, which is the same shape that PR #4554 ("fix orchestrion deadlock with parallel subtests") shipped a fix for in v2.7.1. We are on v2.8.1 — after the fix — but still observe the deadlock in CI.
Stack trace (Linux runner, go test -p 4 -short ./... over a large package). Every blocked goroutine has the same shape: a subtest's call to t.Parallel() parks in testing.(*testState).waitParallel and never returns. The wrapper above it on the stack is gotesting.instrumentTestingTFunc.func1.1:
The real failing test (parent + subtest both call t.Parallel()):
funcTestJobsHealthReadinessCheck(t*testing.T) {
t.Parallel()
t.Run("returns ready when health checker succeeds", func(t*testing.T) {
t.Parallel()
// ... in-process httptest.NewRequest against a chi.Router ...
})
t.Run("returns not_ready when health checker fails", func(t*testing.T) {
t.Parallel() // ← deadlocks here// ...
})
}
Observed conditions
The CI deadlock is deterministic on Linux ubuntu-latest for a full package (~50 tests, multiple parents using parent-t.Parallel). Test binary hits the 10-minute go test timeout. Reproduces both with and without -race.
Reproduction Code
I tried to build a minimal reproducer. It does NOT deadlock on macOS / darwin/arm64 (go1.26.0, orchestrion v1.10.0, dd-trace-go v2.8.1) — passes consistently across 5+ runs. The deadlock seems to require either the Linux scheduler or a heavier per-package test count to saturate testState.maxParallel. Including the attempt here in case it helps narrow it down:
DD_CIVISIBILITY_ENABLED=true \
DD_API_KEY=fake DD_SITE=datadoghq.com \
DD_CIVISIBILITY_AGENTLESS_ENABLED=true \
go test -toolexec='orchestrion toolexec' \
-v -timeout=60s -parallel=2 ./...
Question
Two possibilities I see:
The fix from PR fix(internal/civisibility): fix orchestrion deadlock with parallel subtests #4554 covers the trivial-case shape (clone path with meta.originalTest != nil), but instrumentation_orchestrion.go:217-220 creates fresh metadata with originalTest = nil when no additional-feature wrapper is above the test. In that case instrumentTestingParallel(t) returns false and the call falls through to stdlib Parallel(). On a saturated maxParallel semaphore, the parent's wrapped state never releases the slot.
A regression introduced between v2.7.1 and v2.8.1 re-introduces the deadlock under load.
Happy to provide additional CI artifacts (full stack dump with all ~100+ goroutines, pprof profile, etc.) if useful for triage. The current mitigation is to disable the action on the affected workflows.
Tracer Version(s)
dd-trace-go
v2.8.1(viadatadog/test-visibility-github-action@v2.9.0), orchestrionv1.10.0Go Version(s)
go1.25.10 linux/amd64(GitHub-hostedubuntu-latestrunner)Bug Report
The CI Visibility wrapper deadlocks in
waitParallelagainst the parent-t.Parallel+ subtest-t.Parallelpattern, which is the same shape that PR #4554 ("fix orchestrion deadlock with parallel subtests") shipped a fix for inv2.7.1. We are onv2.8.1— after the fix — but still observe the deadlock in CI.Stack trace (Linux runner,
go test -p 4 -short ./...over a large package). Every blocked goroutine has the same shape: a subtest's call tot.Parallel()parks intesting.(*testState).waitParalleland never returns. The wrapper above it on the stack isgotesting.instrumentTestingTFunc.func1.1:The real failing test (parent + subtest both call
t.Parallel()):Observed conditions
The CI deadlock is deterministic on Linux
ubuntu-latestfor a full package (~50 tests, multiple parents using parent-t.Parallel). Test binary hits the 10-minutego testtimeout. Reproduces both with and without-race.Reproduction Code
I tried to build a minimal reproducer. It does NOT deadlock on macOS /
darwin/arm64(go1.26.0,orchestrion v1.10.0,dd-trace-go v2.8.1) — passes consistently across 5+ runs. The deadlock seems to require either the Linux scheduler or a heavier per-package test count to saturatetestState.maxParallel. Including the attempt here in case it helps narrow it down:go.mod:parallel_test.go:Run with:
Question
Two possibilities I see:
meta.originalTest != nil), butinstrumentation_orchestrion.go:217-220creates fresh metadata withoriginalTest = nilwhen no additional-feature wrapper is above the test. In that caseinstrumentTestingParallel(t)returnsfalseand the call falls through to stdlibParallel(). On a saturatedmaxParallelsemaphore, the parent's wrapped state never releases the slot.v2.7.1andv2.8.1re-introduces the deadlock under load.Happy to provide additional CI artifacts (full stack dump with all ~100+ goroutines,
pprofprofile, etc.) if useful for triage. The current mitigation is to disable the action on the affected workflows.