Skip to content

fix(services): HealthChecker.Register immediately populates cache when already running#1976

Merged
jmank88 merged 2 commits intomainfrom
fix/health-checker-register-immediate-update
Apr 13, 2026
Merged

fix(services): HealthChecker.Register immediately populates cache when already running#1976
jmank88 merged 2 commits intomainfrom
fix/health-checker-register-immediate-update

Conversation

@Fletch153
Copy link
Copy Markdown
Contributor

Summary

  • Root cause: When Register() is called on an already-started HealthChecker (e.g. by the job spawner after starting a new job service), the new service's health state is absent from IsHealthy() / IsReady() for up to 15 seconds — the full polling interval — even though Ready() already returns nil.
  • Fix: Register() now calls IfStarted after adding the service, immediately populating c.ready and c.healthy for that service.
  • The servicesMu write lock is scoped to an inner closure and released before calling IfStarted to avoid a deadlock with Start(), which holds the StateMachine lock while calling update() (which acquires servicesMu.RLock).

Changes

  • pkg/services/health.goRegister() triggers an immediate single-service health check when the checker is already running
  • pkg/services/health_test.go — two new tests covering healthy and unhealthy services registered on a running checker

Test plan

  • go test ./pkg/services/... -run TestHealthChecker_Register — both new tests pass
  • go test ./pkg/services/... — no regressions

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 10, 2026

⚠️ API Diff Results - github.com/smartcontractkit/chainlink-common

⚠️ Breaking Changes (7)

package github (1)
  • com/smartcontractkit/chainlink-common/script — 🗑️ Removed
pkg/capabilities/v2/actions/confidentialrelay.SecretsRequestParams (1)
  • OrgID — 🗑️ Removed
pkg/capabilities/v2/actions/confidentialworkflow.(*WorkflowExecution) (1)
  • GetOrgId — 🗑️ Removed
pkg/capabilities/v2/actions/confidentialworkflow.WorkflowExecution (1)
  • OrgId — 🗑️ Removed
pkg/workflows (3)
  • GenerateExecutionIDWithTriggerIndex — 🗑️ Removed

  • GetTriggerIndexFromReferenceID — 🗑️ Removed

  • GetTriggerReferenceID — 🗑️ Removed

✅ Compatible Changes (3)

pkg/workflows/dontime/pb (1)
  • Observations — ➕ Added
pkg/workflows/dontime/pb.(*Observation) (1)
  • GetPruneExecutions — ➕ Added
pkg/workflows/dontime/pb.Observation (1)
  • PruneExecutions — ➕ Added

📄 View full apidiff report

Fletch153 added a commit to smartcontractkit/chainlink that referenced this pull request Apr 10, 2026
HealthChecker.Register() now signals the polling goroutine to run
update() immediately when the checker is already running, eliminating
the up-to-15-second window where a dynamically registered service
(e.g. a spawned job) was absent from IsHealthy() / IsReady() even
though Ready() already returned nil.

Fixes CORE-2386.

Companion chainlink-common PR: smartcontractkit/chainlink-common#1976
@Fletch153 Fletch153 force-pushed the fix/health-checker-register-immediate-update branch from f6b9ba9 to e76df7d Compare April 10, 2026 21:43
Fletch153 added a commit to smartcontractkit/chainlink that referenced this pull request Apr 10, 2026
HealthChecker.Register() now signals the polling goroutine to run
update() immediately when the checker is already running, eliminating
the up-to-15-second window where a dynamically registered service
(e.g. a spawned job) was absent from IsHealthy() / IsReady() even
though Ready() already returned nil.

Fixes CORE-2386.

Companion chainlink-common PR: smartcontractkit/chainlink-common#1976
@Fletch153 Fletch153 force-pushed the fix/health-checker-register-immediate-update branch from e76df7d to d2ac04e Compare April 10, 2026 21:52
Fletch153 added a commit to smartcontractkit/chainlink that referenced this pull request Apr 10, 2026
HealthChecker.Register() now signals the polling goroutine to run
update() immediately when the checker is already running, eliminating
the up-to-15-second window where a dynamically registered service
(e.g. a spawned job) was absent from IsHealthy() / IsReady() even
though Ready() already returned nil.

Fixes CORE-2386.

Companion chainlink-common PR: smartcontractkit/chainlink-common#1976
@Fletch153 Fletch153 force-pushed the fix/health-checker-register-immediate-update branch from d2ac04e to 07a72b4 Compare April 10, 2026 22:26
Fletch153 added a commit to smartcontractkit/chainlink that referenced this pull request Apr 10, 2026
HealthChecker.Register() now signals the polling goroutine to run
update() immediately when the checker is already running, eliminating
the up-to-15-second window where a dynamically registered service
(e.g. a spawned job) was absent from IsHealthy() / IsReady() even
though Ready() already returned nil.

Fixes CORE-2386.

Companion chainlink-common PR: smartcontractkit/chainlink-common#1976
mchain0
mchain0 previously approved these changes Apr 13, 2026
Replace the chRefresh channel mechanism with direct map population in
Register(). Instead of signaling the run() goroutine to call update()
(which re-checks ALL services), Register() now directly populates
c.ready and c.healthy for just the newly registered service. This is
simpler and avoids unnecessary re-polling of all services.
@Fletch153 Fletch153 force-pushed the fix/health-checker-register-immediate-update branch from 111cd0b to 575541a Compare April 13, 2026 13:17
Fix missing servicesMu.Unlock() in Unregister that would cause a
deadlock on every call. Revert unrelated whitespace alignment changes
to struct definition and New() constructor.
@jmank88 jmank88 added this pull request to the merge queue Apr 13, 2026
Merged via the queue into main with commit a931c3f Apr 13, 2026
30 of 32 checks passed
@jmank88 jmank88 deleted the fix/health-checker-register-immediate-update branch April 13, 2026 22:04
Fletch153 added a commit to smartcontractkit/chainlink that referenced this pull request Apr 13, 2026
HealthChecker.Register() now signals the polling goroutine to run
update() immediately when the checker is already running, eliminating
the up-to-15-second window where a dynamically registered service
(e.g. a spawned job) was absent from IsHealthy() / IsReady() even
though Ready() already returned nil.

Fixes CORE-2386.

Companion chainlink-common PR: smartcontractkit/chainlink-common#1976
Fletch153 added a commit to smartcontractkit/chainlink that referenced this pull request Apr 13, 2026
Update chainlink-common to v0.11.2-0.20260413215716-a931c3f2f162
which is the merge commit of smartcontractkit/chainlink-common#1976
on main, replacing the pre-merge branch reference.
Fletch153 added a commit to smartcontractkit/chainlink that referenced this pull request Apr 14, 2026
Bump chainlink-common to v0.11.2-0.20260413215716-a931c3f2f162
(merge commit of smartcontractkit/chainlink-common#1976) to resolve
stale cache in HealthChecker.Register() (CORE-2386).
github-merge-queue Bot pushed a commit to smartcontractkit/chainlink that referenced this pull request Apr 17, 2026
…running (#22056)

* fix: bump chainlink-common to include HealthChecker Register fix

Bump chainlink-common to v0.11.2-0.20260413215716-a931c3f2f162
(merge commit of smartcontractkit/chainlink-common#1976) to resolve
stale cache in HealthChecker.Register() (CORE-2386).

* chore: retrigger CI

* ci: retrigger (runner shutdown on previous run)

* fix: bump chainlink-deployments-framework to v0.95.0 in load/go.sum
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants