Skip reactivation signals for current/ramping/draining versions#9778
Skip reactivation signals for current/ramping/draining versions#9778
Conversation
Reactivation signals were being sent to version workflows for all pinned versions, including those that are CURRENT, RAMPING, or DRAINING. While the version workflow handler already ignored these signals, they still created unnecessary RPCs and added events to the workflow history. This change piggybacks on the existing CheckTaskQueueVersionMembership matching RPC to also return whether the version is drained or inactive. The matching service already loads the full deployment data but previously discarded the version status. The new VersionReactivationEligibility message in the response uses a proto message wrapper to provide nil semantics for backwards compatibility during rolling deploys. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ld format check - Rename checkTaskQueueVersionMembership → checkVersionMembershipAndReactivationEligibility - Rename validatePinnedVersionInTaskQueue → validateVersionAndGetReactivationEligibility - Rename ValidateVersioningOverride → ValidateVersioningOverrideAndGetReactivationEligibility - Fix cache comment: "whether the task queue exists in the version" - Add old format (deprecated versions list) check to IsVersionDrainedOrInactive - Improve proto comment explaining why only drained/inactive versions get signals - Improve validatePostResetOperationInputs comment about per-operation eligibility Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Skip deleted versions (return nil) in IsVersionDrainedOrInactive, consistent with how HasDeploymentVersion treats deleted versions as absent. This prevents sending reactivation signals to deleted version workflows. - Add dedicated TestIsVersionDrainedOrInactive covering: nil deployments, not found, DRAINED, INACTIVE, CURRENT, RAMPING, deleted (new format), and old format (DRAINED, CURRENT). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| isDrainedOrInactive *bool | ||
| revisionNumber int64 |
There was a problem hiding this comment.
one thing that should come up in a follow up PR, by me: right now, in the starter struct, we seem to be accepting the following fields:
versionCache worker_versioning.VersionMembershipAndReactivationStatusCache
reactivationSignalCache worker_versioning.ReactivationSignalCache
reactivationSignaler api.VersionReactivationSignalerFn
isDrainedOrInactive *bool
revisionNumber int64
Now, these are all related and can be collapsed into one. I have no problem doing this right now, in this PR too, but I want to keep the refactorings diff from the implementation of the feature while making reviewing easy, so have chosen to postpone this work on the next one
…dy tested with unit-tests Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| } | ||
|
|
||
| // The following tests verify that the version_reactivation_signal_cache works as intended. | ||
| func (s *DeploymentVersionSuite) TestReactivationSignalCache_Deduplication_StartWorkflow() { |
There was a problem hiding this comment.
have tested this with unit-tests (as one can see above) and also, we have other tests in this file that are testing the cache path (check out TestUpdateWorkflowExecutionOptions_SetPinned_CacheMissAndHits and more)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ignals # Conflicts: # tests/worker_deployment_version_test.go
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
Reviewed by Cursor Bugbot for commit 2934433. Configure here.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| `Maximum number of entries in the version membership cache.`, | ||
| ) | ||
|
|
||
| VersionReactivationSignalCacheTTL = NewGlobalDurationSetting( |
There was a problem hiding this comment.
note: i think it's fine to remove this from the dynamic config given that the flag controlling the sending of the signal behaviour has been turned to false globally (history.enableVersionReactivationSignals)
| ) | ||
|
|
||
| type noopVersionMembershipCache struct{} | ||
| type noopVersionCache struct{} |
There was a problem hiding this comment.
just naming changes over here, nothing to see heh
…rics Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…emporal into fixing-reactivation-signals

Summary
CheckTaskQueueVersionMembershipresponse with two new fields:is_version_active_or_draining(bool) andrevision_number(int64). Matching populates both from its deployment data.revision_number.ReactivationSignalCachewith a per-pod revision-based dedup LRU on the worker-deployment client: each entry records the highest revision this pod has successfully signaled for a given version, so older or equal signals are skipped.What changed on the wire (matching → history)
CheckTaskQueueVersionMembershipResponsenow has two new flat fields (no wrapper message):Matching's
CheckTaskQueueVersionMembershipfills both via the helperworker_versioning.IsVersionActiveOrDraining(deploymentData, dep, build) (bool, int64).Naming choice — we picked
is_version_active_or_draining(negative polarity) rather than something likesupports_reactivationso the proto zero value (false) maps to the safe default ("send the signal"). Old matching binaries and runtime "version not found" both produce the zero value, and history correctly falls through.Where
revision_numberflowsValidateVersioningOverrideAndGetReactivationEligibilityreturns(isVersionActiveOrDraining bool, revisionNumber int64, err).VersionMembershipAndReactivationStatusCachestores both.VersionReactivationSignalerFn,ReactivateVersionWorkflowIfPinned, and all five call sites (startworkflow,signalwithstartworkflow,updateworkflowoptions,resetworkflow,multioperation) carryrevisionNumber int64.resetworkflow.validatePostResetOperationInputsreturns parallel slices([]bool, []int64, error)for per-operation inputs.ClientImpl.SignalVersionReactivationcomposesrequestID = uuid.NewSHA1(uuid.NameSpaceOID, []byte("reactivation-signal:" + revisionNumber)).String()— a deterministic UUID v5 derived from the revision alone. Cassandra'ssignal_requested set<uuid>column requires UUID-formatted RequestIds.Why revision-based dedup
History is sharded on
(namespaceID, workflowID). N concurrentStartWorkflowcalls pinned to the same drained version fan out across potentially every history pod in the fleet. Before this PR each pod independently fired a reactivation signal at the version workflow, producing up to NWorkflowExecutionSignaledevents — directly at odds with the version workflow's design (it intentionally keeps history minimal and CaNs aggressively, seeversion_workflow.go:68-74).Per-pod caches alone can't fix this because they don't coordinate. What we need is a cluster-wide-deterministic dedup key so all pods converge on the same value for the same reactivation cycle. The version's
revision_number— incremented insyncTaskQueuesAsyncon every status change — is exactly that signal. Every pod reads the same revision from matching, every pod composes the same UUID RequestId, and Temporal's built-inmutableState.pendingSignalRequestedIDsdedup (seeservice/history/api/signalworkflow/api.go:40) collapses concurrent signals into exactly one event on the version workflow.The per-pod map is a local optimization on top of that: it prevents a single pod from re-sending the same-or-older-revision signal once it has successfully sent one, cutting RPC volume.
How the new caches look
1.
VersionMembershipAndReactivationStatusCache(read-side, per-pod)Caches matching's
CheckTaskQueueVersionMembershipresponse so repeated pinned-override validations on the same task queue don't re-hit matching.(namespaceID, taskQueue, taskQueueType, deploymentName, buildID)(isMember bool, isVersionActiveOrDraining bool, revisionNumber int64)VersionMembershipCacheTTL(1s default; 5s in functional tests).2.
highestRevSignaledToVersionWf(write-side dedup, per-pod)A field on
ClientImplinservice/worker/workerdeployment/client.go. For each target version workflow, stores the highest revision this pod has successfully signaled. Subsequent calls at the same-or-lower revision skip the RPC.reactivationVersionKey{namespaceID, deploymentName, buildID}int64(highest revision successfully signaled)VersionReactivationSignalCacheMaxSize.The previous TTL-based
ReactivationSignalCachemodule (incommon/worker_versioning/) has been deleted along with its provider andVersionReactivationSignalCacheTTLconfig.Backwards/forwards compatibility
is_version_active_or_drainingorrevision_number; both default to proto zero values.falseon the active bool → history falls through → signal fires (safe default).revisionNumber = 0flows through as-is.WorkflowExecutionSignaledevent.Test plan
IsVersionActiveOrDrainingcovering all status cases (CURRENT, RAMPING, DRAINING, DRAINED, INACTIVE, UNSPECIFIED), new vs. old format, deleted and not-found versions.ValidateVersioningOverrideAndGetReactivationEligibility(cache hit/miss, RPC with/without eligibility, Unimplemented fallback).ClientImpl.SignalVersionReactivation: same-rev dedups, newer-rev fires, older-rev skipped, different version isolated, signal-failure allows retry.TestStartWorkflowExecution_ReactivateVersionOnPinnedTestStartWorkflowExecution_ReactivateVersionOnPinned_WithConflictPolicyTestSignalWithStartWorkflowExecution_ReactivateVersionOnPinnedTestUpdateWorkflowExecutionOptions_ReactivateVersionOnPinnedTestResetWorkflowExecution_ReactivateVersionOnPinned(The four
TestReactivationSignalCache_Deduplication_*functional tests from an earlier iteration were deleted — their coverage moved to unit tests.)Note
Medium Risk
Changes the reactivation-signal behavior and dedup strategy across history/matching/worker-deployment; incorrect eligibility or revision handling could lead to missed or excess signals and delayed reactivation.
Overview
Reactivation signals are now eligibility-aware and revision-deduplicated.
CheckTaskQueueVersionMembershipResponseis extended to returnshould_skip_reactivation(true for CURRENT/RAMPING/DRAINING) andrevision_number, and matching populates these from per-task-queue deployment data.History now threads these values through pinned-version validation (
ValidateVersioningOverrideAndGetReactivationEligibility) and skips sending reactivation signals when matching reports the version doesn’t need it; when signaling, the worker-deployment client sends a deterministic UUIDv5RequestIdderived fromrevision_numberand adds a per-pod LRU to avoid re-sending same-or-older revisions. The old TTL-basedReactivationSignalCacheand its dynamic config/metrics wiring are removed, and tests are updated/added to cover the new skip + dedup behavior.Reviewed by Cursor Bugbot for commit 09d63f5. Bugbot is set up for automated code reviews on this repo. Configure here.