Skip to content

Skip reactivation signals for current/ramping/draining versions#9778

Merged
Shivs11 merged 26 commits intomainfrom
fixing-reactivation-signals
Apr 23, 2026
Merged

Skip reactivation signals for current/ramping/draining versions#9778
Shivs11 merged 26 commits intomainfrom
fixing-reactivation-signals

Conversation

@Shivs11
Copy link
Copy Markdown
Member

@Shivs11 Shivs11 commented Apr 2, 2026

Summary

  • Extends CheckTaskQueueVersionMembership response with two new fields: is_version_active_or_draining (bool) and revision_number (int64). Matching populates both from its deployment data.
  • Reactivation signals are skipped when matching reports the target version as CURRENT/RAMPING/DRAINING; otherwise they are sent with a deterministic UUID v5 RequestId derived from revision_number.
  • Replaces the old TTL-based ReactivationSignalCache with a per-pod revision-based dedup LRU on the worker-deployment client: each entry records the highest revision this pod has successfully signaled for a given version, so older or equal signals are skipped.

What changed on the wire (matching → history)

CheckTaskQueueVersionMembershipResponse now has two new flat fields (no wrapper message):

bool is_version_active_or_draining = 2;  // true when status is CURRENT/RAMPING/DRAINING
int64 revision_number = 3;               // from WorkerDeploymentVersionData.revision_number; 0 if unknown / legacy

Matching's CheckTaskQueueVersionMembership fills both via the helper worker_versioning.IsVersionActiveOrDraining(deploymentData, dep, build) (bool, int64).

Naming choice — we picked is_version_active_or_draining (negative polarity) rather than something like supports_reactivation so the proto zero value (false) maps to the safe default ("send the signal"). Old matching binaries and runtime "version not found" both produce the zero value, and history correctly falls through.

Where revision_number flows

  • Matching: populates the response field from the version's tracked revision.
  • History-side helper/caches: ValidateVersioningOverrideAndGetReactivationEligibility returns (isVersionActiveOrDraining bool, revisionNumber int64, err). VersionMembershipAndReactivationStatusCache stores both.
  • History signaler plumbing: VersionReactivationSignalerFn, ReactivateVersionWorkflowIfPinned, and all five call sites (startworkflow, signalwithstartworkflow, updateworkflowoptions, resetworkflow, multioperation) carry revisionNumber int64. resetworkflow.validatePostResetOperationInputs returns parallel slices ([]bool, []int64, error) for per-operation inputs.
  • Signal RequestId: ClientImpl.SignalVersionReactivation composes requestID = uuid.NewSHA1(uuid.NameSpaceOID, []byte("reactivation-signal:" + revisionNumber)).String() — a deterministic UUID v5 derived from the revision alone. Cassandra's signal_requested set<uuid> column requires UUID-formatted RequestIds.

Why revision-based dedup

History is sharded on (namespaceID, workflowID). N concurrent StartWorkflow calls pinned to the same drained version fan out across potentially every history pod in the fleet. Before this PR each pod independently fired a reactivation signal at the version workflow, producing up to N WorkflowExecutionSignaled events — directly at odds with the version workflow's design (it intentionally keeps history minimal and CaNs aggressively, see version_workflow.go:68-74).

Per-pod caches alone can't fix this because they don't coordinate. What we need is a cluster-wide-deterministic dedup key so all pods converge on the same value for the same reactivation cycle. The version's revision_number — incremented in syncTaskQueuesAsync on every status change — is exactly that signal. Every pod reads the same revision from matching, every pod composes the same UUID RequestId, and Temporal's built-in mutableState.pendingSignalRequestedIDs dedup (see service/history/api/signalworkflow/api.go:40) collapses concurrent signals into exactly one event on the version workflow.

The per-pod map is a local optimization on top of that: it prevents a single pod from re-sending the same-or-older-revision signal once it has successfully sent one, cutting RPC volume.

How the new caches look

1. VersionMembershipAndReactivationStatusCache (read-side, per-pod)

Caches matching's CheckTaskQueueVersionMembership response so repeated pinned-override validations on the same task queue don't re-hit matching.

  • Key: (namespaceID, taskQueue, taskQueueType, deploymentName, buildID)
  • Value: (isMember bool, isVersionActiveOrDraining bool, revisionNumber int64)
  • Eviction: VersionMembershipCacheTTL (1s default; 5s in functional tests).

2. highestRevSignaledToVersionWf (write-side dedup, per-pod)

A field on ClientImpl in service/worker/workerdeployment/client.go. For each target version workflow, stores the highest revision this pod has successfully signaled. Subsequent calls at the same-or-lower revision skip the RPC.

  • Key: reactivationVersionKey{namespaceID, deploymentName, buildID}
  • Value: int64 (highest revision successfully signaled)
  • Eviction: LRU, bounded by VersionReactivationSignalCacheMaxSize.

The previous TTL-based ReactivationSignalCache module (in common/worker_versioning/) has been deleted along with its provider and VersionReactivationSignalCacheTTL config.

Backwards/forwards compatibility

  • Old matching → new history: old binaries don't set is_version_active_or_draining or revision_number; both default to proto zero values. false on the active bool → history falls through → signal fires (safe default). revisionNumber = 0 flows through as-is.
  • New matching → old history: new fields on the response are ignored by old history → identical to pre-PR behavior.
  • New matching → new history: signal fires only when the version is not active/draining; cross-pod fires converge on one UUID RequestId and fold into one WorkflowExecutionSignaled event.

Test plan

  • Unit tests for IsVersionActiveOrDraining covering all status cases (CURRENT, RAMPING, DRAINING, DRAINED, INACTIVE, UNSPECIFIED), new vs. old format, deleted and not-found versions.
  • Unit tests for ValidateVersioningOverrideAndGetReactivationEligibility (cache hit/miss, RPC with/without eligibility, Unimplemented fallback).
  • Unit tests for the per-pod dedup on ClientImpl.SignalVersionReactivation: same-rev dedups, newer-rev fires, older-rev skipped, different version isolated, signal-failure allows retry.
  • Unit test for RequestId format (UUID v5, deterministic across calls with the same revision).
  • Functional tests (all pass on SQLite and cass-es):
    • TestStartWorkflowExecution_ReactivateVersionOnPinned
    • TestStartWorkflowExecution_ReactivateVersionOnPinned_WithConflictPolicy
    • TestSignalWithStartWorkflowExecution_ReactivateVersionOnPinned
    • TestUpdateWorkflowExecutionOptions_ReactivateVersionOnPinned
    • TestResetWorkflowExecution_ReactivateVersionOnPinned

(The four TestReactivationSignalCache_Deduplication_* functional tests from an earlier iteration were deleted — their coverage moved to unit tests.)


Note

Medium Risk
Changes the reactivation-signal behavior and dedup strategy across history/matching/worker-deployment; incorrect eligibility or revision handling could lead to missed or excess signals and delayed reactivation.

Overview
Reactivation signals are now eligibility-aware and revision-deduplicated. CheckTaskQueueVersionMembershipResponse is extended to return should_skip_reactivation (true for CURRENT/RAMPING/DRAINING) and revision_number, and matching populates these from per-task-queue deployment data.

History now threads these values through pinned-version validation (ValidateVersioningOverrideAndGetReactivationEligibility) and skips sending reactivation signals when matching reports the version doesn’t need it; when signaling, the worker-deployment client sends a deterministic UUIDv5 RequestId derived from revision_number and adds a per-pod LRU to avoid re-sending same-or-older revisions. The old TTL-based ReactivationSignalCache and its dynamic config/metrics wiring are removed, and tests are updated/added to cover the new skip + dedup behavior.

Reviewed by Cursor Bugbot for commit 09d63f5. Bugbot is set up for automated code reviews on this repo. Configure here.

Reactivation signals were being sent to version workflows for all pinned
versions, including those that are CURRENT, RAMPING, or DRAINING. While
the version workflow handler already ignored these signals, they still
created unnecessary RPCs and added events to the workflow history.

This change piggybacks on the existing CheckTaskQueueVersionMembership
matching RPC to also return whether the version is drained or inactive.
The matching service already loads the full deployment data but previously
discarded the version status. The new VersionReactivationEligibility
message in the response uses a proto message wrapper to provide nil
semantics for backwards compatibility during rolling deploys.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Shivs11 Shivs11 changed the title Skip reactivation signals for non-drained/inactive versions Skip reactivation signals for current/ramping/draining versions Apr 2, 2026
Shivs11 and others added 2 commits April 1, 2026 21:17
…ld format check

- Rename checkTaskQueueVersionMembership → checkVersionMembershipAndReactivationEligibility
- Rename validatePinnedVersionInTaskQueue → validateVersionAndGetReactivationEligibility
- Rename ValidateVersioningOverride → ValidateVersioningOverrideAndGetReactivationEligibility
- Fix cache comment: "whether the task queue exists in the version"
- Add old format (deprecated versions list) check to IsVersionDrainedOrInactive
- Improve proto comment explaining why only drained/inactive versions get signals
- Improve validatePostResetOperationInputs comment about per-operation eligibility

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Shivs11 Shivs11 marked this pull request as ready for review April 2, 2026 01:21
@Shivs11 Shivs11 requested review from a team as code owners April 2, 2026 01:21
Shivs11 and others added 2 commits April 2, 2026 21:12
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Skip deleted versions (return nil) in IsVersionDrainedOrInactive,
  consistent with how HasDeploymentVersion treats deleted versions as absent.
  This prevents sending reactivation signals to deleted version workflows.
- Add dedicated TestIsVersionDrainedOrInactive covering: nil deployments,
  not found, DRAINED, INACTIVE, CURRENT, RAMPING, deleted (new format),
  and old format (DRAINED, CURRENT).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread common/worker_versioning/version_membership_cache.go Outdated
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread service/worker/workerdeployment/client.go Outdated
Comment thread common/worker_versioning/version_reactivation_signal_cache_test.go Outdated
Comment thread tests/worker_deployment_version_test.go Outdated
Comment thread tests/worker_deployment_version_test.go Outdated
Comment thread tests/worker_deployment_version_test.go Outdated
Comment on lines +64 to +65
isDrainedOrInactive *bool
revisionNumber int64
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one thing that should come up in a follow up PR, by me: right now, in the starter struct, we seem to be accepting the following fields:

versionCache               worker_versioning.VersionMembershipAndReactivationStatusCache
reactivationSignalCache    worker_versioning.ReactivationSignalCache
reactivationSignaler       api.VersionReactivationSignalerFn
isDrainedOrInactive        *bool
revisionNumber             int64

Now, these are all related and can be collapsed into one. I have no problem doing this right now, in this PR too, but I want to keep the refactorings diff from the implementation of the feature while making reviewing easy, so have chosen to postpone this work on the next one

…dy tested with unit-tests

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
}

// The following tests verify that the version_reactivation_signal_cache works as intended.
func (s *DeploymentVersionSuite) TestReactivationSignalCache_Deduplication_StartWorkflow() {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have tested this with unit-tests (as one can see above) and also, we have other tests in this file that are testing the cache path (check out TestUpdateWorkflowExecutionOptions_SetPinned_CacheMissAndHits and more)

Shivs11 and others added 2 commits April 20, 2026 17:54
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread common/worker_versioning/version_membership_cache.go Outdated
Comment thread common/worker_versioning/worker_versioning.go Outdated
Comment thread service/worker/workerdeployment/client.go Outdated
Comment thread common/worker_versioning/version_reactivation_signal_cache.go Outdated
Shivs11 and others added 2 commits April 21, 2026 11:12
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread common/worker_versioning/worker_versioning_test.go Outdated
Comment thread service/worker/workerdeployment/fx.go Outdated
Shivs11 and others added 3 commits April 21, 2026 13:49
…ignals

# Conflicts:
#	tests/worker_deployment_version_test.go
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 2934433. Configure here.

Comment thread service/worker/workerdeployment/client.go
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`Maximum number of entries in the version membership cache.`,
)

VersionReactivationSignalCacheTTL = NewGlobalDurationSetting(
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: i think it's fine to remove this from the dynamic config given that the flag controlling the sending of the signal behaviour has been turned to false globally (history.enableVersionReactivationSignals)

)

type noopVersionMembershipCache struct{}
type noopVersionCache struct{}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just naming changes over here, nothing to see heh

Comment thread service/worker/workerdeployment/client.go
…rics

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread service/worker/workerdeployment/client.go
Comment thread proto/internal/temporal/server/api/matchingservice/v1/request_response.proto Outdated
Shivs11 and others added 4 commits April 22, 2026 17:55
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Shivs11 Shivs11 enabled auto-merge (squash) April 22, 2026 22:14
@Shivs11 Shivs11 merged commit 562d26f into main Apr 23, 2026
48 checks passed
@Shivs11 Shivs11 deleted the fixing-reactivation-signals branch April 23, 2026 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants