Skip to content

[Nexus] Migrate cancellation executors to chasm#9762

Closed
gow wants to merge 22 commits intonexus/hsm-to-chasm-migrationfrom
cg/nexus/task_executors_3
Closed

[Nexus] Migrate cancellation executors to chasm#9762
gow wants to merge 22 commits intonexus/hsm-to-chasm-migrationfrom
cg/nexus/task_executors_3

Conversation

@gow
Copy link
Copy Markdown
Contributor

@gow gow commented Apr 1, 2026

What changed?

Describe what has changed in this PR.

Why?

Tell your future self why have you made these changes.

How did you test it?

  • built
  • run locally and tested manually
  • covered by existing tests
  • added new unit test(s)
  • added new functional test(s)

Potential risks

Any change is risky. Identify all risks you are aware of. If none, remove this section.


Note

Medium Risk
Cancellation execution is now implemented end-to-end (including outbound Nexus/HistoryService calls, timeout calculations, and retry state transitions), which can affect cancellation reliability and queue behavior if misconfigured. Changes also introduce new dependency wiring and metrics tagging that could impact observability and error handling paths.

Overview
Implements CHASM-based Nexus operation cancellation execution: CancellationTaskHandler now validates scheduled attempts, loads cancellation/operation args from CHASM state, performs the cancel call (either via external Nexus client or internally via HistoryService for the system endpoint), and records outbound request metrics and failure-source tagging.

Adds CHASM state-machine integration for cancel outcomes by applying results back onto the Cancellation component (retryable vs non-retryable errors drive backoff rescheduling or terminal failure/success), and updates the backoff task handler to reschedule attempts.

Introduces a new dynamic config nexusoperation.recordCancelRequestCompletionEvents (currently TODO-gated for history event emission) and adds cancelCallOutcomeTag to classify cancel outcomes for metrics.

Written by Cursor Bugbot for commit 0c35467. This will update automatically on new commits. Configure here.

@gow gow requested review from a team as code owners April 1, 2026 06:12
@gow gow marked this pull request as draft April 1, 2026 06:13
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

var timeoutType enumspb.TimeoutType
if args.startToCloseTimeout > 0 {
callTimeout = min(callTimeout, args.startToCloseTimeout-elapsed)
timeoutType = enumspb.TIMEOUT_TYPE_START_TO_CLOSE
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Start-to-close timeout computed from wrong time base

Medium Severity

The elapsed time for the startToCloseTimeout calculation is computed from scheduledTime, but the old code correctly uses startedTime. Since startedTime > scheduledTime, using scheduledTime overestimates the elapsed duration, making callTimeout shorter than intended. This can cause premature timeouts for cancellation requests. The cancelArgs struct and loadCancelArgs don't load or expose startedTime at all, and the new OperationState proto lacks a started_time field, so there's no way to compute this correctly with the current data model.

Additional Locations (1)
Fix in Cursor Fix in Web

@gow gow force-pushed the cg/nexus/task_executors_2 branch from 35b4a1f to 312e79a Compare April 1, 2026 06:30
@gow gow force-pushed the cg/nexus/task_executors_3 branch from 0c35467 to aec1985 Compare April 1, 2026 06:31
@gow gow requested review from bergundy and stephanos April 2, 2026 05:20
gow and others added 21 commits April 3, 2026 09:58
- Added workflow command handler registry to CHASM's workflow library.
- Integrated CHASM's workflow library into workflow completion handler.

Migrating Nexus from HSM to CHASM.

Tests will be ported over once actual command handler implementations
are added.
 - Added `OperationState` proto fields
 - Migrated nexus operation state transitions.

Migrating nexus from HSM to CHASM

- [x] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [x] added new unit test(s)
- [ ] added new functional test(s)

N/A. This code path is currently unreachable.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Touches persisted proto schemas and core state-transition logic for
retries/timeouts; while executors are still unimplemented, any
activation of this path could impact task scheduling and retry
semantics.
>
> **Overview**
> Migrates Nexus operation lifecycle handling to CHASM by implementing
the operation state machine transitions to emit invocation,
backoff-retry, and timeout tasks and to record attempt metadata (last
failure/completion time, next retry time, operation token).
>
> Expands the `OperationState` and task protos to persist
endpoint/operation identifiers, scheduling timestamps, retry/attempt
fields, and separate timeout task types (`schedule-to-start`,
`start-to-close`, `schedule-to-close`), and wires new timeout task
executors through Fx and the library task registry. Adds unit tests
covering the new transition behavior and task scheduling.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
4ab5fd0. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
Ported command handler for Nexus "schedule" command from HSM to CHASM.

CHASM migration.

- [ ] built
- [ ] run locally and tested manually
- [x] covered by existing tests
- [x] added new unit test(s)
- [ ] added new functional test(s)

---------

Co-authored-by: Chetan Gowda <chetan.gowda@temporal.io>
Co-authored-by: Chetan Gowda <gow@users.noreply.github.com>
Co-authored-by: Shivam <57200924+Shivs11@users.noreply.github.com>
Ported command handler for Nexus "cancel" command from HSM to CHASM.

CHASM migration.

- [ ] built
- [ ] run locally and tested manually
- [x] covered by existing tests
- [x] added new unit test(s)
- [ ] added new functional test(s)

---------

Co-authored-by: Roey Berman <roey.berman@gmail.com>
## What changed?

Replaces workflow-specific fields (ie `scheduled_event_token` and
`requested_event_id`) in the CHASM Nexus operation state proto with a
generic field.

## Why?

Ensure CHASM Nexus operation state has no workflow-specifics.

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [x] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…flowregistry (#9474)

## What changed?
Just a package rename. No other code change.

## Why?
Migrating Nexus from HSM to CHASM

## How did you test it?
- [x] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)



<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Low Risk**
> Primarily a package/API rename and dependency wiring update; low
behavioral risk, but broad mechanical changes could cause compile-time
breakage if any call sites were missed.
> 
> **Overview**
> Renames the CHASM workflow command registry package from
`chasm/lib/workflow/command` to `chasm/lib/workflow/workflowregistry`
and updates its public API (`RegisterCommandHandler`, `CommandHandler`,
`CommandHandlerOptions`, `ErrCommandNotSupported`).
> 
> Propagates the rename through Nexus operation command handlers/tests,
History `RespondWorkflowTaskCompleted` CHASM fallback path, engine/fx
wiring, and related Nexus components so CHASM command handling continues
to resolve and invoke handlers via the new registry.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
e4740af. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
## What changed?
Migrating Nexus history event Registry and Definition. I've also moved
all the event implementations as well with commented out bodies. I will
replace the implementations of `Apply()` and `CherryPick()` in follow
PRs.
Depends on #9474

## Why?
Migrating Nexus from HSM to CHASM.

## How did you test it?
- [x] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)



<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Introduces new history event registration and lookup paths for Nexus
operations; since the `Apply` implementations are currently stubs,
there’s some risk of silently skipping state transitions during
replication/reset until follow-up PRs complete the logic.
> 
> **Overview**
> Adds first-class **history event definitions** to
`workflowregistry.Registry` via a new `EventDefinition` interface (with
`Apply` and `CherryPick`) and
`RegisterEventDefinition`/`EventDefinition` APIs.
> 
> Wires Nexus operation event registration into the `fx` module and
introduces `events.go` with definitions for Nexus lifecycle events
(scheduled/cancel/start/complete/fail/cancel/timeout), including basic
workflow-task-trigger flags and `CherryPick` exclusion handling (notably
`RESET_REAPPLY_EXCLUDE_TYPE_NEXUS`), while leaving `Apply` bodies as
TODO stubs.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
e4bd20b. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
## What changed?
This PR adds methods in workflow component to handle nexus events.

## Why?
Migrating Nexus from HSM to CHASM

## How did you test it?
- [x] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)



<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Touches the workflow-task completion path and CHASM workflow wiring
(registry injection and history-event application), so misregistration
or missing context could cause runtime command failures despite largely
being additive/refactor changes.
> 
> **Overview**
> Adds CHASM `Workflow` helpers to *emit and apply* Nexus operation
lifecycle events (started/completed/failed/canceled/timed-out),
including consistent failure wrapping via
`NexusOperationExecutionFailure`.
> 
> Refactors CHASM workflow command/event registration by moving
`workflowregistry` into `chasm/lib/workflow` as `Registry`, injecting it
into CHASM context, and updating Nexus workflow command handlers to use
`AddAndApplyHistoryEvent` so command-emitted history events immediately
run their registered event definitions. Updates wiring/tests/callers
across services to construct `NewLibrary(NewRegistry())` and use the new
types/errors.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
70936c6. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
## What changed?
Migrating all the event definition's `Apply()` method from HSM to CHASM.
Also migrated unit tests.

## Why?
HSM to CHASM migration

## How did you test it?
- [x] built
- [ ] run locally and tested manually
- [x] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)


<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Changes Nexus operation lifecycle handling by moving
scheduling/cancellation and terminal event application into CHASM event
definitions and adjusting state transition semantics; regressions could
affect operation task emission, cancellation timing, and cleanup during
replay/reset.
> 
> **Overview**
> Implements CHASM-based Nexus operation event `Apply()` handlers:
scheduled/cancel-requested/started now create or update the in-memory
operation component (including spawning/scheduling a cancellation child
once an operation token exists), and terminal events
(completed/failed/canceled/timed-out) transition the operation then
remove it from the workflow.
> 
> Refactors workflow Nexus operation storage to be keyed by
`ScheduledEventId` (`int64`), simplifies command handlers to only emit
history events (letting event definitions create/update components), and
expands cancellation/operation state machines with concrete task
emission and retry/backoff metadata. Updates `chasm.Transition.Apply` to
run transition logic before mutating state (enabling source-state
inspection) and adds new CHASM-focused unit tests for the migrated event
definitions and updated transition behavior.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
c0f4e55. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
@gow gow force-pushed the cg/nexus/task_executors_2 branch from 312e79a to 4b13977 Compare April 3, 2026 16:59
@gow gow force-pushed the cg/nexus/task_executors_3 branch from aec1985 to d17c2e9 Compare April 3, 2026 17:15
@gow gow force-pushed the cg/nexus/task_executors_3 branch from d17c2e9 to a64b0a0 Compare April 3, 2026 17:41
Base automatically changed from cg/nexus/task_executors_2 to nexus/hsm-to-chasm-migration April 6, 2026 17:09
@bergundy bergundy force-pushed the nexus/hsm-to-chasm-migration branch 3 times, most recently from 6421ece to 634fa80 Compare April 8, 2026 04:21
@bergundy
Copy link
Copy Markdown
Member

bergundy commented Apr 8, 2026

I'll open a new PR instead of this one.

@bergundy bergundy closed this Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants