Skip to content

[BUG]: orchestrion + civisibility deadlock on parent-t.Parallel + subtest-t.Parallel persists in v2.8.1 (despite #4554 in v2.7.1) #4765

@intel352

Description

@intel352

Tracer Version(s)

dd-trace-go v2.8.1 (via datadog/test-visibility-github-action@v2.9.0), orchestrion v1.10.0

Go Version(s)

go1.25.10 linux/amd64 (GitHub-hosted ubuntu-latest runner)

Bug Report

The CI Visibility wrapper deadlocks in waitParallel against the parent-t.Parallel + subtest-t.Parallel pattern, which is the same shape that PR #4554 ("fix orchestrion deadlock with parallel subtests") shipped a fix for in v2.7.1. We are on v2.8.1 — after the fix — but still observe the deadlock in CI.

Stack trace (Linux runner, go test -p 4 -short ./... over a large package). Every blocked goroutine has the same shape: a subtest's call to t.Parallel() parks in testing.(*testState).waitParallel and never returns. The wrapper above it on the stack is gotesting.instrumentTestingTFunc.func1.1:

=== RUN   TestJobsHealthReadinessCheck
=== PAUSE TestJobsHealthReadinessCheck
=== RUN   TestJobsHealthLivenessCheck
=== PAUSE TestJobsHealthLivenessCheck
...
TestJobsHealthLivenessCheck (9m59s)
TestJobsHealthReadinessCheck (9m59s)
FAIL    github.com/.../internal/modules    600.439s

goroutine 96 [chan receive, 9 minutes]:
testing.(*testState).waitParallel(0xc000eddd60)
        /opt/hostedtoolcache/go/1.25.10/x64/src/testing/testing.go:2116 +0xaa
testing.(*T).Parallel(0xc0013601c0)
        /opt/hostedtoolcache/go/1.25.10/x64/src/testing/testing.go:1709 +0x259
TestJobsHealthReadinessCheck.func2(0xc0013601c0)
        .../jobs_health_test.go:329 +0x27
github.com/DataDog/dd-trace-go/v2/internal/civisibility/integrations/gotesting.instrumentTestingTFunc.func1.1(0xc0013601c0)
        /home/runner/go/pkg/mod/github.com/!data!dog/dd-trace-go/v2@v2.8.1/internal/civisibility/integrations/gotesting/instrumentation_orchestrion.go:335 +0x495
github.com/DataDog/dd-trace-go/v2/internal/civisibility/integrations/gotesting.instrumentTestingTFunc.func1(0xc0013601c0)
        /home/runner/go/pkg/mod/github.com/!data!dog/dd-trace-go/v2@v2.8.1/internal/civisibility/integrations/gotesting/instrumentation_orchestrion.go:339 +0x64f
testing.tRunner(0xc0013601c0, 0xc00158a8d0)
        /opt/hostedtoolcache/go/1.25.10/x64/src/testing/testing.go:1934 +0xea
created by testing.(*T).Run in goroutine 206
        /opt/hostedtoolcache/go/1.25.10/x64/src/testing/testing.go:1997 +0x47d

The real failing test (parent + subtest both call t.Parallel()):

func TestJobsHealthReadinessCheck(t *testing.T) {
    t.Parallel()

    t.Run("returns ready when health checker succeeds", func(t *testing.T) {
        t.Parallel()
        // ... in-process httptest.NewRequest against a chi.Router ...
    })

    t.Run("returns not_ready when health checker fails", func(t *testing.T) {
        t.Parallel()  // ← deadlocks here
        // ...
    })
}

Observed conditions

The CI deadlock is deterministic on Linux ubuntu-latest for a full package (~50 tests, multiple parents using parent-t.Parallel). Test binary hits the 10-minute go test timeout. Reproduces both with and without -race.

Reproduction Code

I tried to build a minimal reproducer. It does NOT deadlock on macOS / darwin/arm64 (go1.26.0, orchestrion v1.10.0, dd-trace-go v2.8.1) — passes consistently across 5+ runs. The deadlock seems to require either the Linux scheduler or a heavier per-package test count to saturate testState.maxParallel. Including the attempt here in case it helps narrow it down:

go.mod:

module example.com/orchestrion-parallel-deadlock-repro

go 1.26

require (
    github.com/DataDog/dd-trace-go/orchestrion/all/v2 v2.8.1
    github.com/DataDog/orchestrion v1.10.0
    github.com/stretchr/testify v1.10.0
)

parallel_test.go:

package repro

import (
    "net/http"
    "net/http/httptest"
    "testing"

    "github.com/stretchr/testify/assert"
)

func runOneSubtest(t *testing.T) {
    t.Parallel()
    srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        w.WriteHeader(http.StatusOK)
    }))
    t.Cleanup(srv.Close)
    resp, err := http.Get(srv.URL)
    if err != nil {
        t.Fatalf("get: %v", err)
    }
    resp.Body.Close()
    assert.Equal(t, http.StatusOK, resp.StatusCode)
}

func TestParentA(t *testing.T) {
    t.Parallel()
    for _, name := range []string{"a1", "a2", "a3", "a4"} {
        t.Run(name, runOneSubtest)
    }
}

func TestParentB(t *testing.T) {
    t.Parallel()
    for _, name := range []string{"b1", "b2", "b3", "b4"} {
        t.Run(name, runOneSubtest)
    }
}

func TestParentC(t *testing.T) {
    t.Parallel()
    for _, name := range []string{"c1", "c2", "c3", "c4"} {
        t.Run(name, runOneSubtest)
    }
}

func TestParentD(t *testing.T) {
    t.Parallel()
    for _, name := range []string{"d1", "d2", "d3", "d4"} {
        t.Run(name, runOneSubtest)
    }
}

Run with:

DD_CIVISIBILITY_ENABLED=true \
DD_API_KEY=fake DD_SITE=datadoghq.com \
DD_CIVISIBILITY_AGENTLESS_ENABLED=true \
  go test -toolexec='orchestrion toolexec' \
  -v -timeout=60s -parallel=2 ./...

Question

Two possibilities I see:

  1. The fix from PR fix(internal/civisibility): fix orchestrion deadlock with parallel subtests #4554 covers the trivial-case shape (clone path with meta.originalTest != nil), but instrumentation_orchestrion.go:217-220 creates fresh metadata with originalTest = nil when no additional-feature wrapper is above the test. In that case instrumentTestingParallel(t) returns false and the call falls through to stdlib Parallel(). On a saturated maxParallel semaphore, the parent's wrapped state never releases the slot.
  2. A regression introduced between v2.7.1 and v2.8.1 re-introduces the deadlock under load.

Happy to provide additional CI artifacts (full stack dump with all ~100+ goroutines, pprof profile, etc.) if useful for triage. The current mitigation is to disable the action on the affected workflows.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions