[Nexus] Migrate cancellation executors to chasm#9762
Closed
gow wants to merge 22 commits intonexus/hsm-to-chasm-migrationfrom
Closed
[Nexus] Migrate cancellation executors to chasm#9762gow wants to merge 22 commits intonexus/hsm-to-chasm-migrationfrom
gow wants to merge 22 commits intonexus/hsm-to-chasm-migrationfrom
Conversation
| var timeoutType enumspb.TimeoutType | ||
| if args.startToCloseTimeout > 0 { | ||
| callTimeout = min(callTimeout, args.startToCloseTimeout-elapsed) | ||
| timeoutType = enumspb.TIMEOUT_TYPE_START_TO_CLOSE |
There was a problem hiding this comment.
Start-to-close timeout computed from wrong time base
Medium Severity
The elapsed time for the startToCloseTimeout calculation is computed from scheduledTime, but the old code correctly uses startedTime. Since startedTime > scheduledTime, using scheduledTime overestimates the elapsed duration, making callTimeout shorter than intended. This can cause premature timeouts for cancellation requests. The cancelArgs struct and loadCancelArgs don't load or expose startedTime at all, and the new OperationState proto lacks a started_time field, so there's no way to compute this correctly with the current data model.
Additional Locations (1)
35b4a1f to
312e79a
Compare
0c35467 to
aec1985
Compare
- Added workflow command handler registry to CHASM's workflow library. - Integrated CHASM's workflow library into workflow completion handler. Migrating Nexus from HSM to CHASM. Tests will be ported over once actual command handler implementations are added.
- Added `OperationState` proto fields - Migrated nexus operation state transitions. Migrating nexus from HSM to CHASM - [x] built - [ ] run locally and tested manually - [ ] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s) N/A. This code path is currently unreachable. <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Touches persisted proto schemas and core state-transition logic for retries/timeouts; while executors are still unimplemented, any activation of this path could impact task scheduling and retry semantics. > > **Overview** > Migrates Nexus operation lifecycle handling to CHASM by implementing the operation state machine transitions to emit invocation, backoff-retry, and timeout tasks and to record attempt metadata (last failure/completion time, next retry time, operation token). > > Expands the `OperationState` and task protos to persist endpoint/operation identifiers, scheduling timestamps, retry/attempt fields, and separate timeout task types (`schedule-to-start`, `start-to-close`, `schedule-to-close`), and wires new timeout task executors through Fx and the library task registry. Adds unit tests covering the new transition behavior and task scheduling. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 4ab5fd0. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
Ported command handler for Nexus "schedule" command from HSM to CHASM. CHASM migration. - [ ] built - [ ] run locally and tested manually - [x] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s) --------- Co-authored-by: Chetan Gowda <chetan.gowda@temporal.io> Co-authored-by: Chetan Gowda <gow@users.noreply.github.com> Co-authored-by: Shivam <57200924+Shivs11@users.noreply.github.com>
Ported command handler for Nexus "cancel" command from HSM to CHASM. CHASM migration. - [ ] built - [ ] run locally and tested manually - [x] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s) --------- Co-authored-by: Roey Berman <roey.berman@gmail.com>
## What changed? Replaces workflow-specific fields (ie `scheduled_event_token` and `requested_event_id`) in the CHASM Nexus operation state proto with a generic field. ## Why? Ensure CHASM Nexus operation state has no workflow-specifics. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…flowregistry (#9474) ## What changed? Just a package rename. No other code change. ## Why? Migrating Nexus from HSM to CHASM ## How did you test it? - [x] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Low Risk** > Primarily a package/API rename and dependency wiring update; low behavioral risk, but broad mechanical changes could cause compile-time breakage if any call sites were missed. > > **Overview** > Renames the CHASM workflow command registry package from `chasm/lib/workflow/command` to `chasm/lib/workflow/workflowregistry` and updates its public API (`RegisterCommandHandler`, `CommandHandler`, `CommandHandlerOptions`, `ErrCommandNotSupported`). > > Propagates the rename through Nexus operation command handlers/tests, History `RespondWorkflowTaskCompleted` CHASM fallback path, engine/fx wiring, and related Nexus components so CHASM command handling continues to resolve and invoke handlers via the new registry. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit e4740af. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
## What changed? Migrating Nexus history event Registry and Definition. I've also moved all the event implementations as well with commented out bodies. I will replace the implementations of `Apply()` and `CherryPick()` in follow PRs. Depends on #9474 ## Why? Migrating Nexus from HSM to CHASM. ## How did you test it? - [x] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Introduces new history event registration and lookup paths for Nexus operations; since the `Apply` implementations are currently stubs, there’s some risk of silently skipping state transitions during replication/reset until follow-up PRs complete the logic. > > **Overview** > Adds first-class **history event definitions** to `workflowregistry.Registry` via a new `EventDefinition` interface (with `Apply` and `CherryPick`) and `RegisterEventDefinition`/`EventDefinition` APIs. > > Wires Nexus operation event registration into the `fx` module and introduces `events.go` with definitions for Nexus lifecycle events (scheduled/cancel/start/complete/fail/cancel/timeout), including basic workflow-task-trigger flags and `CherryPick` exclusion handling (notably `RESET_REAPPLY_EXCLUDE_TYPE_NEXUS`), while leaving `Apply` bodies as TODO stubs. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit e4bd20b. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
## What changed? This PR adds methods in workflow component to handle nexus events. ## Why? Migrating Nexus from HSM to CHASM ## How did you test it? - [x] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Touches the workflow-task completion path and CHASM workflow wiring (registry injection and history-event application), so misregistration or missing context could cause runtime command failures despite largely being additive/refactor changes. > > **Overview** > Adds CHASM `Workflow` helpers to *emit and apply* Nexus operation lifecycle events (started/completed/failed/canceled/timed-out), including consistent failure wrapping via `NexusOperationExecutionFailure`. > > Refactors CHASM workflow command/event registration by moving `workflowregistry` into `chasm/lib/workflow` as `Registry`, injecting it into CHASM context, and updating Nexus workflow command handlers to use `AddAndApplyHistoryEvent` so command-emitted history events immediately run their registered event definitions. Updates wiring/tests/callers across services to construct `NewLibrary(NewRegistry())` and use the new types/errors. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 70936c6. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
## What changed? Migrating all the event definition's `Apply()` method from HSM to CHASM. Also migrated unit tests. ## Why? HSM to CHASM migration ## How did you test it? - [x] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Changes Nexus operation lifecycle handling by moving scheduling/cancellation and terminal event application into CHASM event definitions and adjusting state transition semantics; regressions could affect operation task emission, cancellation timing, and cleanup during replay/reset. > > **Overview** > Implements CHASM-based Nexus operation event `Apply()` handlers: scheduled/cancel-requested/started now create or update the in-memory operation component (including spawning/scheduling a cancellation child once an operation token exists), and terminal events (completed/failed/canceled/timed-out) transition the operation then remove it from the workflow. > > Refactors workflow Nexus operation storage to be keyed by `ScheduledEventId` (`int64`), simplifies command handlers to only emit history events (letting event definitions create/update components), and expands cancellation/operation state machines with concrete task emission and retry/backoff metadata. Updates `chasm.Transition.Apply` to run transition logic before mutating state (enabling source-state inspection) and adds new CHASM-focused unit tests for the migrated event definitions and updated transition behavior. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit c0f4e55. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
312e79a to
4b13977
Compare
aec1985 to
d17c2e9
Compare
d17c2e9 to
a64b0a0
Compare
Base automatically changed from
cg/nexus/task_executors_2
to
nexus/hsm-to-chasm-migration
April 6, 2026 17:09
6421ece to
634fa80
Compare
Member
|
I'll open a new PR instead of this one. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


What changed?
Describe what has changed in this PR.
Why?
Tell your future self why have you made these changes.
How did you test it?
Potential risks
Any change is risky. Identify all risks you are aware of. If none, remove this section.
Note
Medium Risk
Cancellation execution is now implemented end-to-end (including outbound Nexus/HistoryService calls, timeout calculations, and retry state transitions), which can affect cancellation reliability and queue behavior if misconfigured. Changes also introduce new dependency wiring and metrics tagging that could impact observability and error handling paths.
Overview
Implements CHASM-based Nexus operation cancellation execution:
CancellationTaskHandlernow validates scheduled attempts, loads cancellation/operation args from CHASM state, performs the cancel call (either via external Nexus client or internally via HistoryService for the system endpoint), and records outbound request metrics and failure-source tagging.Adds CHASM state-machine integration for cancel outcomes by applying results back onto the
Cancellationcomponent (retryable vs non-retryable errors drive backoff rescheduling or terminal failure/success), and updates the backoff task handler to reschedule attempts.Introduces a new dynamic config
nexusoperation.recordCancelRequestCompletionEvents(currently TODO-gated for history event emission) and addscancelCallOutcomeTagto classify cancel outcomes for metrics.Written by Cursor Bugbot for commit 0c35467. This will update automatically on new commits. Configure here.