Commit 562d26f
Skip reactivation signals for current/ramping/draining versions (#9778)
## Summary
- Extends `CheckTaskQueueVersionMembership` response with two new
fields: `is_version_active_or_draining` (bool) and `revision_number`
(int64). Matching populates both from its deployment data.
- Reactivation signals are **skipped** when matching reports the target
version as CURRENT/RAMPING/DRAINING; otherwise they are sent with a
**deterministic UUID v5 RequestId** derived from `revision_number`.
- Replaces the old TTL-based `ReactivationSignalCache` with a per-pod
**revision-based dedup LRU** on the worker-deployment client: each entry
records the highest revision this pod has successfully signaled for a
given version, so older or equal signals are skipped.
## What changed on the wire (matching → history)
`CheckTaskQueueVersionMembershipResponse` now has two new flat fields
(no wrapper message):
```proto
bool is_version_active_or_draining = 2; // true when status is CURRENT/RAMPING/DRAINING
int64 revision_number = 3; // from WorkerDeploymentVersionData.revision_number; 0 if unknown / legacy
```
Matching's `CheckTaskQueueVersionMembership` fills both via the helper
`worker_versioning.IsVersionActiveOrDraining(deploymentData, dep, build)
(bool, int64)`.
Naming choice — we picked `is_version_active_or_draining` (negative
polarity) rather than something like `supports_reactivation` so the
proto zero value (`false`) maps to the safe default ("send the signal").
Old matching binaries and runtime "version not found" both produce the
zero value, and history correctly falls through.
## Where `revision_number` flows
- **Matching**: populates the response field from the version's tracked
revision.
- **History-side helper/caches**:
`ValidateVersioningOverrideAndGetReactivationEligibility` returns
`(isVersionActiveOrDraining bool, revisionNumber int64, err)`.
`VersionMembershipAndReactivationStatusCache` stores both.
- **History signaler plumbing**: `VersionReactivationSignalerFn`,
`ReactivateVersionWorkflowIfPinned`, and all five call sites
(`startworkflow`, `signalwithstartworkflow`, `updateworkflowoptions`,
`resetworkflow`, `multioperation`) carry `revisionNumber int64`.
`resetworkflow.validatePostResetOperationInputs` returns parallel slices
`([]bool, []int64, error)` for per-operation inputs.
- **Signal RequestId**: `ClientImpl.SignalVersionReactivation` composes
`requestID = uuid.NewSHA1(uuid.NameSpaceOID,
[]byte("reactivation-signal:" + revisionNumber)).String()` — a
deterministic UUID v5 derived from the revision alone. Cassandra's
`signal_requested set<uuid>` column requires UUID-formatted RequestIds.
## Why revision-based dedup
History is sharded on `(namespaceID, workflowID)`. N concurrent
`StartWorkflow` calls pinned to the same drained version fan out across
potentially every history pod in the fleet. Before this PR each pod
independently fired a reactivation signal at the version workflow,
producing up to N `WorkflowExecutionSignaled` events — directly at odds
with the version workflow's design (it intentionally keeps history
minimal and CaNs aggressively, see `version_workflow.go:68-74`).
Per-pod caches alone can't fix this because they don't coordinate. What
we need is a **cluster-wide-deterministic dedup key** so all pods
converge on the same value for the same reactivation cycle. The
version's `revision_number` — incremented in `syncTaskQueuesAsync` on
every status change — is exactly that signal. Every pod reads the same
revision from matching, every pod composes the same UUID RequestId, and
Temporal's built-in `mutableState.pendingSignalRequestedIDs` dedup (see
`service/history/api/signalworkflow/api.go:40`) collapses concurrent
signals into exactly one event on the version workflow.
The per-pod map is a local optimization on top of that: it prevents a
single pod from re-sending the same-or-older-revision signal once it has
successfully sent one, cutting RPC volume.
## How the new caches look
### 1. `VersionMembershipAndReactivationStatusCache` (read-side,
per-pod)
Caches matching's `CheckTaskQueueVersionMembership` response so repeated
pinned-override validations on the same task queue don't re-hit
matching.
- **Key**: `(namespaceID, taskQueue, taskQueueType, deploymentName,
buildID)`
- **Value**: `(isMember bool, isVersionActiveOrDraining bool,
revisionNumber int64)`
- **Eviction**: `VersionMembershipCacheTTL` (1s default; 5s in
functional tests).
### 2. `highestRevSignaledToVersionWf` (write-side dedup, per-pod)
A field on `ClientImpl` in `service/worker/workerdeployment/client.go`.
For each target version workflow, stores the highest revision this pod
has successfully signaled. Subsequent calls at the same-or-lower
revision skip the RPC.
- **Key**: `reactivationVersionKey{namespaceID, deploymentName,
buildID}`
- **Value**: `int64` (highest revision successfully signaled)
- **Eviction**: LRU, bounded by `VersionReactivationSignalCacheMaxSize`.
The previous TTL-based `ReactivationSignalCache` module (in
`common/worker_versioning/`) has been deleted along with its provider
and `VersionReactivationSignalCacheTTL` config.
## Backwards/forwards compatibility
- **Old matching → new history**: old binaries don't set
`is_version_active_or_draining` or `revision_number`; both default to
proto zero values. `false` on the active bool → history falls through →
signal fires (safe default). `revisionNumber = 0` flows through as-is.
- **New matching → old history**: new fields on the response are ignored
by old history → identical to pre-PR behavior.
- **New matching → new history**: signal fires only when the version is
not active/draining; cross-pod fires converge on one UUID RequestId and
fold into one `WorkflowExecutionSignaled` event.
## Test plan
- [x] Unit tests for `IsVersionActiveOrDraining` covering all status
cases (CURRENT, RAMPING, DRAINING, DRAINED, INACTIVE, UNSPECIFIED), new
vs. old format, deleted and not-found versions.
- [x] Unit tests for
`ValidateVersioningOverrideAndGetReactivationEligibility` (cache
hit/miss, RPC with/without eligibility, Unimplemented fallback).
- [x] Unit tests for the per-pod dedup on
`ClientImpl.SignalVersionReactivation`: same-rev dedups, newer-rev
fires, older-rev skipped, different version isolated, signal-failure
allows retry.
- [x] Unit test for RequestId format (UUID v5, deterministic across
calls with the same revision).
- [x] Functional tests (all pass on SQLite and cass-es):
- `TestStartWorkflowExecution_ReactivateVersionOnPinned`
-
`TestStartWorkflowExecution_ReactivateVersionOnPinned_WithConflictPolicy`
- `TestSignalWithStartWorkflowExecution_ReactivateVersionOnPinned`
- `TestUpdateWorkflowExecutionOptions_ReactivateVersionOnPinned`
- `TestResetWorkflowExecution_ReactivateVersionOnPinned`
(The four `TestReactivationSignalCache_Deduplication_*` functional tests
from an earlier iteration were deleted — their coverage moved to unit
tests.)
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Medium Risk**
> Changes matching↔history API and reactivation signaling semantics by
skipping signals for active/draining versions and introducing
revision-based dedup via deterministic RequestIds; issues could affect
version workflow state transitions or signal fan-out during upgrades.
>
> **Overview**
> Matching’s `CheckTaskQueueVersionMembershipResponse` is extended with
`should_skip_reactivation` and `revision_number`, and matching now
populates both from per-task-queue deployment data.
>
> History-side versioning validation is refactored to return and cache
reactivation eligibility + revision, and reactivation signaling paths
(`StartWorkflow`, `SignalWithStart`, `UpdateWorkflowExecutionOptions`,
`ResetWorkflow`, multi-op) now **skip signals** when matching reports
the version as *CURRENT/RAMPING/DRAINING*.
>
> The old TTL-based `ReactivationSignalCache` is removed
(configs/metrics/providers updated), and the worker-deployment client
now performs **revision-based per-pod dedup** plus receiver-side dedup
by sending signals with a deterministic UUIDv5-like `RequestId` derived
from `revision_number`. Tests are updated/added to cover status
evaluation, new plumbing, and dedup behavior.
>
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
a1ec5e9. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 28fd8c4 commit 562d26f
30 files changed
Lines changed: 850 additions & 1014 deletions
File tree
- api/matchingservice/v1
- common
- dynamicconfig
- metrics
- worker_versioning
- proto/internal/temporal/server/api/matchingservice/v1
- service
- history
- api
- multioperation
- resetworkflow
- respondworkflowtaskcompleted
- signalwithstartworkflow
- startworkflow
- updateworkflowoptions
- configs
- matching
- worker/workerdeployment
- tests
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2921 | 2921 | | |
2922 | 2922 | | |
2923 | 2923 | | |
2924 | | - | |
2925 | | - | |
2926 | | - | |
2927 | | - | |
2928 | | - | |
2929 | | - | |
2930 | | - | |
2931 | | - | |
2932 | | - | |
| 2924 | + | |
| 2925 | + | |
2933 | 2926 | | |
2934 | | - | |
| 2927 | + | |
| 2928 | + | |
| 2929 | + | |
2935 | 2930 | | |
2936 | 2931 | | |
2937 | 2932 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
47 | 47 | | |
48 | 48 | | |
49 | 49 | | |
50 | | - | |
| 50 | + | |
51 | 51 | | |
52 | 52 | | |
53 | 53 | | |
| |||
459 | 459 | | |
460 | 460 | | |
461 | 461 | | |
462 | | - | |
463 | | - | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
464 | 465 | | |
465 | 466 | | |
466 | 467 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
10 | | - | |
11 | | - | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
12 | 20 | | |
13 | 21 | | |
14 | 22 | | |
15 | | - | |
16 | | - | |
| 23 | + | |
17 | 24 | | |
18 | 25 | | |
19 | 26 | | |
20 | 27 | | |
21 | 28 | | |
22 | 29 | | |
23 | | - | |
| 30 | + | |
24 | 31 | | |
25 | 32 | | |
26 | 33 | | |
| |||
29 | 36 | | |
30 | 37 | | |
31 | 38 | | |
| 39 | + | |
| 40 | + | |
32 | 41 | | |
33 | 42 | | |
34 | 43 | | |
| |||
40 | 49 | | |
41 | 50 | | |
42 | 51 | | |
43 | | - | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
44 | 59 | | |
45 | 60 | | |
46 | 61 | | |
47 | 62 | | |
48 | 63 | | |
49 | | - | |
50 | | - | |
| 64 | + | |
| 65 | + | |
51 | 66 | | |
52 | | - | |
| 67 | + | |
53 | 68 | | |
54 | 69 | | |
55 | 70 | | |
56 | 71 | | |
57 | 72 | | |
58 | | - | |
| 73 | + | |
59 | 74 | | |
60 | 75 | | |
61 | 76 | | |
62 | 77 | | |
63 | 78 | | |
64 | | - | |
| 79 | + | |
65 | 80 | | |
66 | 81 | | |
67 | 82 | | |
| |||
75 | 90 | | |
76 | 91 | | |
77 | 92 | | |
78 | | - | |
| 93 | + | |
79 | 94 | | |
80 | | - | |
| 95 | + | |
81 | 96 | | |
82 | 97 | | |
83 | 98 | | |
84 | | - | |
| 99 | + | |
85 | 100 | | |
86 | | - | |
| 101 | + | |
87 | 102 | | |
88 | 103 | | |
89 | | - | |
| 104 | + | |
90 | 105 | | |
91 | 106 | | |
92 | 107 | | |
93 | 108 | | |
94 | 109 | | |
95 | 110 | | |
| 111 | + | |
| 112 | + | |
96 | 113 | | |
97 | 114 | | |
98 | 115 | | |
| |||
104 | 121 | | |
105 | 122 | | |
106 | 123 | | |
107 | | - | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
108 | 129 | | |
Lines changed: 0 additions & 65 deletions
This file was deleted.
0 commit comments