Skip to content

fix: eliminate data race in SleeperTask context lifecycle#1984

Closed
Fletch153 wants to merge 2 commits intomainfrom
fix/sleeper-task-context-race
Closed

fix: eliminate data race in SleeperTask context lifecycle#1984
Fletch153 wants to merge 2 commits intomainfrom
fix/sleeper-task-context-race

Conversation

@Fletch153
Copy link
Copy Markdown
Contributor

@Fletch153 Fletch153 commented Apr 16, 2026

Summary

Defense-in-depth against testify/mock's unsynchronized reflect reads of context arguments. Replaces SleeperTask's context implementation with a read-only stopChanCtx that has no mutable internal state.

Background

testify's mock library has a fundamentally broken code path: when a mock method is called without a matching expectation, callString() formats all arguments via fmt.Sprintf("%#v", args...), which walks them with reflect. This is completely unsynchronized — any concurrent mutation of an argument (like context.cancel() writing to cancelCtx internals) causes a data race. An upstream fix was attempted but rejected.

What This Changes

SleeperTask.workerLoop() previously created its context via StopChan.NewCtx(), which internally calls context.WithCancel() + spawns a detached goroutine that calls cancel() when the stop channel closes. That cancel() writes to cancelCtx internal fields (sync/atomic.AddInt32, mutex state, etc.) — racing with any concurrent fmt.Sprintf("%#v", ctx) from testify's error path.

The new stopChanCtx is a read-only context.Context implementation backed directly by the stop channel:

  • Done() returns the stop channel — closing it signals cancellation
  • Err() checks if the channel is closed
  • No cancel() function, no detached goroutine, no mutable internal state
  • Child contexts (context.WithCancel, context.WithTimeout) propagate correctly via Go's standard Done() mechanism

Since there's nothing to mutate, testify can fmt.Sprintf("%#v") this context all day without racing.

Test Evidence

TestSleeperTask_NoConcurrentContextRace reproduces the exact scenario: Work() goroutines read the context via fmt.Sprintf while Stop() concurrently closes the stop channel.

Old code (NewCtx/CtxCancel): WARNING: DATA RACEreflect.typedmemmove vs sync/atomic.AddInt32
New code (stopChanCtx): 10 runs × all tests, zero races

Related

Replace StopChan.NewCtx() (which spawns a detached goroutine that calls
cancel() when the stop channel closes) with a simple context backed
directly by the stop channel. This eliminates a data race between the
async cancel() writing to context internals and concurrent reflect-based
reads (e.g. testify mock argument formatting via fmt.Sprintf).

The race manifests in the EVM balance monitor's Worker, where SleeperTask
calls Work(ctx) which spawns goroutines that call BalanceAt(ctx, ...) on
a mock client. When Stop() closes chStop, the CtxCancel goroutine fires
cancel() concurrently with testify's callString() reflecting into the
context via fmt.Sprintf("%#v", ctx).

The new stopChanCtx is a read-only context implementation with no mutable
state — closing the channel is the cancellation signal, and no cancel()
function ever modifies internal fields. Child contexts (WithCancel,
WithTimeout) work correctly as the Done() channel propagates.
@Fletch153 Fletch153 requested a review from a team as a code owner April 16, 2026 13:54
@github-actions
Copy link
Copy Markdown

👋 Fletch153, thanks for creating this pull request!

To help reviewers, please consider creating future PRs as drafts first. This allows you to self-review and make any final changes before notifying the team.

Once you're ready, you can mark it as "Ready for review" to request feedback. Thanks!

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 16, 2026

📊 API Diff Results

No changes detected for module github.com/smartcontractkit/chainlink-common

View full report

Adds TestSleeperTask_NoConcurrentContextRace which reproduces the exact
race scenario: Work() spawns goroutines that read the context via
fmt.Sprintf (simulating testify's mock.callString), while Stop()
concurrently triggers context cancellation.

With the old NewCtx()/CtxCancel pattern, this test reliably triggers
WARNING: DATA RACE. With the stopChanCtx fix, it passes cleanly.
Comment thread pkg/utils/sleeper_task.go
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A custom context.Context make me nervous. I do see some in the wild now, though most contain an embedded context.Context that they are extending and sometimes don't even override any methods 🤔

Let's wait and see how the other fixes help before trying this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closing for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants