feat(provision): prompt to cancel Azure deployment on Ctrl+C (Bicep)#7795
feat(provision): prompt to cancel Azure deployment on Ctrl+C (Bicep)#7795
Conversation
When a user presses Ctrl+C during 'azd provision' or 'azd up' while a Bicep deployment is in flight on Azure, azd now pauses and asks whether to leave the Azure deployment running (default) or to cancel it via the ARM Cancel API and wait for a terminal state. - pkg/input: register-able interrupt handler stack with re-entrant Ctrl+C suppression while a handler is running. - pkg/azapi + pkg/infra: Cancel methods on DeploymentService / Deployment for both subscription- and resource-group-scoped deployments. Deployment Stacks return 'not supported' (no Cancel API surface today). - pkg/infra/provisioning: typed sentinel errors for the 4 outcomes (leave running / canceled / cancel timed out / cancel too late) plus telemetry attribute provision.cancellation. - pkg/infra/provisioning/bicep: interactive prompt + cancel-and-poll flow with 30s cancel-request timeout and 2-min terminal-state wait. - cmd/middleware + internal/cmd: bypass agent troubleshooting and map sentinels to telemetry codes. - docs/provision-cancellation.md: user-facing behavior, outcomes, provider scope, telemetry, and non-interactive fallback. Terraform and Deployment Stacks are out of scope and unchanged. Closes #2810 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds an interactive Ctrl+C flow for in-flight Bicep (ARM) deployments so users can explicitly choose to leave the deployment running or request Azure-side cancellation, with typed outcomes and telemetry.
Changes:
- Introduces a stack-based SIGINT handler mechanism with re-entrant suppression for interactive prompts.
- Adds ARM deployment cancel support for subscription- and resource-group-scoped deployments (with stack deployments explicitly unsupported).
- Wires Bicep provision/deploy to prompt on Ctrl+C, emit typed sentinel outcomes, and record
provision.cancellationtelemetry; documents the behavior.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| cli/azd/pkg/input/interrupt.go | New interrupt handler stack + re-entrant suppression primitives. |
| cli/azd/pkg/input/interrupt_test.go | Unit tests for handler stack and re-entrancy behavior. |
| cli/azd/pkg/input/console.go | Updates SIGINT watcher to consult registered handlers and ignore re-entrant interrupts. |
| cli/azd/pkg/infra/scope.go | Adds Cancel(ctx) to infra.Deployment and implements for RG/sub scopes. |
| cli/azd/pkg/infra/scope_test.go | Tests cancel endpoint calls for both scopes and error propagation. |
| cli/azd/pkg/infra/provisioning/cancel.go | Adds provisioning cancellation sentinel errors surfaced to middleware. |
| cli/azd/pkg/infra/provisioning/bicep/interrupt.go | Implements the interactive prompt + cancel/poll flow and outcome mapping. |
| cli/azd/pkg/infra/provisioning/bicep/interrupt_test.go | Tests terminal-state detection and interrupt outcome application. |
| cli/azd/pkg/infra/provisioning/bicep/bicep_provider.go | Wires interrupt handler around the in-flight ARM deploy call and sets telemetry. |
| cli/azd/pkg/azapi/deployments.go | Extends DeploymentService with cancel methods. |
| cli/azd/pkg/azapi/standard_deployments.go | Implements ARM cancel for subscription + resource group deployments. |
| cli/azd/pkg/azapi/stack_deployments.go | Implements cancel methods as unsupported for deployment stacks. |
| cli/azd/internal/tracing/fields/fields.go | Adds provision.cancellation tracing attribute key. |
| cli/azd/internal/cmd/errors.go | Maps new provisioning sentinel errors to classifications. |
| cli/azd/cmd/middleware/error.go | Ensures new sentinels bypass agent troubleshooting. |
| cli/azd/docs/provision-cancellation.md | Documents the Ctrl+C cancellation UX and outcomes. |
- pkg/input: LIFO test now invokes handlers and asserts distinct call counts to prove ordering. - pkg/infra/provisioning: add ErrDeploymentCancelFailed sentinel so the cancel-request-failure path no longer misclassifies as a timeout; wire it through error middleware skip-list and telemetry mapping. - pkg/infra: switch new TestScopeCancel subtests to t.Context(). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- pkg/azapi: add typed ErrCancelNotSupported sentinel; stack CancelSubscriptionDeployment / CancelResourceGroupDeployment now return it instead of an opaque string. - pkg/infra/provisioning/bicep: interrupt handler treats ErrCancelNotSupported as the safer 'leave running' outcome (matches documented stacks behavior + telemetry). Cancel-request error path routes through terminalToOutcome when the deployment is already in a terminal state, so the portal URL and consistent messaging are surfaced. Canceled terminal branch now prints the portal URL too. - pkg/infra/provisioning: ErrDeploymentCancelFailed doc comment now references errors.Is/errors.As (matches the multi-%w joined-error wrapping pattern used here). - pkg/infra/provisioning/bicep/bicep_provider: tear down the interrupt handler immediately after deployModule returns (sync.OnceFunc) to avoid a small window where a late Ctrl+C could surface the prompt over post-processing output. - internal/cmd/errors: map ErrCancelNotSupported in classifySentinel. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
If Ctrl+C arrives but the ARM deployment happens to finish naturally before the user picks an option in the prompt, the previous design could take the success path and silently drop the interrupt. - installDeploymentInterruptHandler now exposes a 'started' channel that is closed the instant Ctrl+C is received, before the prompt is shown. deployCtx is also cancelled immediately so PollUntilDone unblocks ASAP. - BicepProvider.Deploy block-receives the outcome whenever 'started' is closed (instead of a non-blocking drain), so the user's choice is always honored regardless of who wins the race. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- pkg/input/console: watchTerminalInterrupt now reserves the running slot before consulting the handler stack so re-entrant Ctrl+C is suppressed even if the stack is briefly empty (e.g. handler popped but still executing the prompt). - pkg/infra/provisioning/bicep/bicep_provider: defer cleanup until after the interrupt outcome is received so a second Ctrl+C during the prompt is still suppressed; the no-interrupt path tears down immediately as before. - pkg/infra/provisioning/cancel: doc reads 'sentinel errors' instead of 'typed errors' to match the implementation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- pkg/input/interrupt: enforce strict LIFO when popping handlers (only pop when this handler is still top-of-stack), so out-of-order pops never accidentally remove unrelated newer handlers. - pkg/infra/provisioning/bicep/interrupt: defensive default in terminalToOutcome now stops the spinner and emits a warning with the observed state and portal URL, leaving the UI clean if an unexpected terminal state is ever observed. - pkg/infra/provisioning/bicep/interrupt: treat DeploymentProvisioningStateDeleted as terminal in the cancel poll so we don't keep polling until the deadline if the deployment is deleted out from under us. Test updated accordingly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- pkg/infra/provisioning/bicep/interrupt: wrap the interrupt handler closure with sync.OnceValue so close(started), cancelDeploy() and the outcome channel send all run at most once. Combined with the in-flight guard from tryStartInterruptHandler and the strict LIFO pop, additional Ctrl+C signals after the prompt completes can no longer panic or block on the buffered channel. - pkg/infra/provisioning/bicep/interrupt: print the portal URL on the prompt-failure leave-running path so the user always has a link to follow up when the URL is available. - docs/provision-cancellation: clarify that the portal URL is printed when available (not 'in every case'). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- pkg/input/interrupt: nil out the popped slot before truncating the interrupt stack so the GC can reclaim the popped handler and any state it captured, even before the underlying array is reallocated. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- pkg/input/console: run the registered interrupt handler inline on the signal goroutine instead of in a nested goroutine. This removes the scheduling window where SIGINT was received but the handler had not yet run, which could let a deploy goroutine complete naturally and silently drop the Ctrl+C. Re-entrant signals remain suppressed via tryStartInterruptHandler. - pkg/infra/provisioning/bicep/interrupt: switch the cancel poll loop to a time.Ticker and move the wait before each Get, so a slow Get cannot produce back-to-back ARM polls (preventing throttling). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| // If the deployment is already in a terminal state, route through | ||
| // the same terminal-outcome reporter so the user sees consistent | ||
| // messaging (including the portal URL). | ||
| if state, getErr := deployment.Get(context.WithoutCancel(ctx)); getErr == nil && |
There was a problem hiding this comment.
In the cancel-request error path, the follow-up deployment.Get(context.WithoutCancel(ctx)) call is unbounded (no timeout). If ARM is slow/unreachable, the interrupt handler can hang indefinitely after the user selected “Cancel”, preventing azd from exiting or surfacing the portal URL/outcome. Consider wrapping this Get in a short timeout context (similar to cancelRequestTimeout) so the interrupt flow always makes progress even when ARM is degraded.
| if state, getErr := deployment.Get(context.WithoutCancel(ctx)); getErr == nil && | |
| getCtx, getDone := context.WithTimeout( | |
| context.WithoutCancel(ctx), cancelRequestTimeout) | |
| defer getDone() | |
| if state, getErr := deployment.Get(getCtx); getErr == nil && |
| firstCalls++ | ||
| return true | ||
| } | ||
| pop1 := PushInterruptHandler(first) |
There was a problem hiding this comment.
This test pushes a global interrupt handler but doesn’t register cleanup until the end of the test. If any require.* before pop1() fails, the handler can remain on the global stack and pollute later pkg/input tests. Consider calling t.Cleanup(pop1) (or defer pop1()) immediately after this PushInterruptHandler call.
| secondCalls++ | ||
| return true | ||
| } | ||
| pop2 := PushInterruptHandler(second) |
There was a problem hiding this comment.
Same as above: if an assertion fails before pop2() runs, the global interrupt stack can be left in a dirty state. Register pop2 with t.Cleanup(pop2) (or defer pop2()) immediately after pushing it.
| switch state { | ||
| case azapi.DeploymentProvisioningStateCanceled, | ||
| azapi.DeploymentProvisioningStateFailed, | ||
| azapi.DeploymentProvisioningStateSucceeded, | ||
| azapi.DeploymentProvisioningStateDeleted: |
There was a problem hiding this comment.
isTerminalProvisioningState includes DeploymentProvisioningStateDeleted as terminal, but terminalToOutcome does not handle Deleted explicitly and will treat it as an “unexpected terminal state” (and map it to cancel_too_late). Either handle Deleted explicitly in terminalToOutcome (with appropriate messaging/telemetry) or remove it from the terminal set if it shouldn’t be surfaced here.
wbreza
left a comment
There was a problem hiding this comment.
Reviewing as wbreza. Blocking on H1–H4 — genuine correctness/industry-standard Ctrl+C defects not covered by the existing 52 comments. Rest are architectural consistency + test coverage gaps to consider.
🔴 High — must-fix before merge (4)
H1 — Race between default branch and signal-goroutine handler dispatch
pkg/infra/provisioning/bicep/bicep_provider.go:790–802 + pkg/input/console.go:1000–1018
In watchTerminalInterrupt there are three sequential steps after SIGINT: ryStartInterruptHandler() → currentInterruptHandler() → handler() (where close(started) lives). Between step B and C the Go scheduler can run the main goroutine, which sees <-interruptStarted not yet closed, takes the default branch, and calls cleanupOnce() to pop + cancelDeploy(). Two failure modes:
- The captured handler runs the prompt over success output; the interruptCh send has no reader.
- If the pop happened before B, currentInterruptHandler() returns
il and the goroutine hits os.Exit(1) after success.
Fix options: expose TryClaimInterrupt/ReleaseInterrupt so the default branch acquires the running slot before popping; or add an OnDispatch pre-handler callback invoked by the signal goroutine before handler() so interruptStarted is closed before handler() begins.
H2 — Handler panic leaves process unkillable from Ctrl+C
pkg/input/console.go:1013
go handled := handler() finishInterruptHandler() // never reached on panic
No
ecover(), not deferred. If a handler panics, interruptRunning=true forever AND the signal goroutine dies, so or range signalChan stops draining — every future Ctrl+C is silently dropped AND Go's default SIGINT handler is not restored. User must SIGKILL.
Fix:
go defer func() { recover(); signal.Stop(signalChan) }() for range signalChan { if !tryStartInterruptHandler() { continue } handler := currentInterruptHandler() func() { defer finishInterruptHandler() defer func() { if r := recover(); r != nil { log.Printf("interrupt handler panic: %v", r) } }() handled = handler() }() if !handled { os.Exit(1) } }
H3 — Second Ctrl+C is silently dropped; no industry-standard force-exit escape
pkg/infra/provisioning/bicep/interrupt.go (prompt at ~line 75, poll at 185–281)
Two windows where a second Ctrl+C has no effect:
- While the console.Select prompt is showing — ryStartInterruptHandler() gate + sync.OnceValue cached return both swallow the press.
- During the 2-minute cancel-await poll — interruptRunning stays true, all further SIGINTs are dropped silently with zero user feedback.
This violates the near-universal POSIX CLI convention — kubectl, erraform apply, docker compose down, git, docker run,
pm, cargo, and all azd peers treat a 2nd Ctrl+C as force-exit. Users who hit Ctrl+C by accident, or who see the cancel wait hanging, have no escape short of SIGKILL.
Fix: Track a press counter separately from interruptRunning. On 2nd press while a handler is running → os.Exit(130) (standard 128+SIGINT). Or nest a PushInterruptHandler inside the poll that aborts the wait and returns ErrDeploymentInterruptedLeaveRunning on re-press.
H4 — Process-global interrupt stack breaks IoC, parallel tests, multi-console hosts
pkg/input/interrupt.go:22–25
interruptStack, interruptMu, interruptRunning are package vars. pkg/input is shared UX infrastructure consumed via IoC — every other piece of Console state is scoped to the instance. Impacts:
- .Parallel() tests using input collide silently. The current interrupt_test.go has to avoid parallelism with no documentation.
- Embeddings that host multiple consoles (MCP server, dev-center-mode, potential agent harness hosting multiple azd sessions) share one handler stack — handlers leak across sessions.
- Tests can't override the broker for unit isolation.
Fix: Define InterruptBroker interface with Push(InterruptHandler) (pop func()), register as a singleton in cmd/container.go, and inject into AskerConsole + BicepProvider. Keep the package-global only as a default-adapter fallback.
🟡 Medium (13) — architectural consistency + test coverage
Architecture / API design
- M1 — DeploymentService.CancelSubscriptionDeployment + CancelResourceGroupDeployment duplicate the N² scope surface. Repo guidance says "extract shared helpers for multiple scopes". Either (a) CancelDeployment(ctx, scope) taking infra.Deployment, or (b) drop from the interface — each scope impl has a single caller (SubscriptionDeployment.Cancel / ResourceGroupDeployment.Cancel) that can call �rmDeployments.CancelAtSubscriptionScope directly.
- M2 — ErrCancelNotSupported lives in pkg/azapi but is consumed via infra.Deployment in �icep/interrupt.go. Terraform/Pulumi will need the same sentinel and will either have to import �zapi or redefine. Move to pkg/infra/provisioning/cancel.go.
- M3 — 5 new sentinels duplicated across cmd/middleware/error.go:90–96 (shouldSkipAgentHandling) AND internal/cmd/errors.go:287–299 (classifySentinel). #7797 is solving the same pattern with AuthInteractionError. Introduce UserDecisionError interface with UserDecisionCode() string. One �rrors.AsType check collapses both lists and absorbs ErrAbortedByUser, ErrSamplingDenied, ErrPreviewNotSupported as a follow-up cleanup.
- M4 — BicepProvider.installDeploymentInterruptHandler calls console.Select, manages spinners, formats portal URLs. Provider owns interactive UX copy. Introduce a small InterruptDecider interface in pkg/infra/provisioning (returns a InterruptChoice enum). Default impl lives in pkg/input / pkg/ux. Terraform won't copy-paste this when it adds support.
Correctness / robustness
- M5 — provision.cancellation attribute is only emitted by the Bicep path (3 call sites). Terraform and Stacks runs never emit it. count by (provision.cancellation) Kusto aggregations under-count. Either emit
one in a shared wrapper or rename the key to provision.bicep.cancellation. - M6 — Non-TTY contract relies on console.Select returning an error. With --no-prompt or a pipe, Select may return a zero-value choice (not error) and the prompt can hang. Upfront check: if !console.IsUnformatted() || console.IsNoPrompt() { return leaveRunningOutcome } before calling Select.
- M7 — Pre-PR Ctrl+C was instant in TTY provisions. Post-PR, Azure DevOps hosted agents, GitHub Actions ty: true, buildkite --tty users get a modal invisible prompt. Wrapping scripts will hang. Add AZD_PROVISION_INTERRUPT=abort|prompt escape hatch and document in docs/provision-cancellation.md.
- M8 — watchTerminalInterrupt registers signal.Notify but never calls signal.Stop(signalChan). Goroutine + signal subscription leak per AskerConsole. Peer code already pairs them (pkg/ux/internal/input.go:83, �xtensions/azure.ai.agents/internal/cmd/run.go:179).
- M9 — PushInterruptHandler's closure only truncates when len == idx+1. Out-of-order pop silently becomes a no-op — orphaned handler intercepts Ctrl+C forever. Panic in dev builds or trace-warn.
Test coverage (user's explicit ask on industry-standard Ctrl+C)
- M10 — pkg/input/interrupt_test.go has zero race/signal-delivery/panic/re-entry tests. This is the primary test surface for the new subsystem. Missing: concurrent Push/Pop under -race, handler that recursively pushes, handler that panics, multiple SIGINTs delivered while a handler runs.
- M11 — pkg/infra/provisioning/bicep/interrupt_test.go only covers the 2 pure helpers (TestIsTerminalProvisioningState, TestApplyInterruptOutcome). Missing: prompt accept → Cancel happy path, prompt accept → poll-budget-exhaust → ErrDeploymentCancelTimeout, cancel-submit timeout (30s), deployment already terminal on first Get → ErrDeploymentCancelTooLate, prompt reject → ErrDeploymentInterruptedLeaveRunning, non-TTY fallback, ErrCancelNotSupported → leave-running + elemetryValue=leave_running.
- M12 — No test asserts provision.cancellation emits the 6 defined outcomes (
one, leave_running, canceled, cancel_too_late, cancel_timed_out, cancel_failed). Silent telemetry regressions will sneak in.
Standards
- M13 — erminalToOutcome(state, portalUrl, p, ctx) has ctx as last parameter and takes *BicepProvider as a parameter. Per AGENTS.md ctx must be first. Make it a method on BicepProvider like its neighbors (
unInterruptPrompt, cancelAndAwaitTerminal, installDeploymentInterruptHandler).
🟢 Low (9)
- L1 — Stacks cancel maps to internal.cancel_not_supported in classifySentinel but surfaces as ErrDeploymentInterruptedLeaveRunning (legitimate user outcome). Misclassifies legitimate UX as an internal error. Emit a secondary attribute like provision.cancellation.reason=stacks_unsupported.
- L2 — �pplyInterruptOutcome uses mt.Errorf("%w: %w", outcome.err, deployErr). If the deploy failed with a non-cancel error racing the prompt, user sees "deployment was canceled by user request: template validation failed" even though cancel never happened. Use �rrors.Join or branch on �rrors.Is(deployErr, context.Canceled).
- L3 — context.WithoutCancel decoupling in cancel/poll is untested. A future refactor could silently pass deployCtx and kill cancel requests.
- L4 — scope_test.TestScopeCancel has no case asserting �rrors.Is(target.Cancel(ctx), azapi.ErrCancelNotSupported) for the stacks scope.
- L5 — lastState hoisted outside the loop but only read within one inner block — inline it.
- L6 — erminalToOutcome has no doc comment; peers do.
- L7 — Mixed American/British spelling across one feature: "Cancelling", "Cancellation", vs ARM's Canceled / Canceling and existing sentinel ErrDeploymentCanceledByUser. Pick American to match ARM + codebase.
- L8 — Stray blank line mid-|| chain in cmd/middleware/error.go:89.
- L9 — Follow-up: no gRPC hook for extensions to register interrupt handlers. Not blocking this PR, but worth a tracking issue — today extensions running long ops can't opt into graceful cancel.
Regression audit
Verified clean:
- pkg/ux/internal/input.go prompt SIGINT — independent signal.Notify channel, coexists correctly
- �xtensions/azure.ai.agents — separate process group
- Hooks, gRPC extensions, monitor, deploy, package — no PushInterruptHandler callers; pre-PR behavior preserved
Flying blind (no regression tests):
- Hooks SIGINT forwarding on Unix — signal.Notify(os.Interrupt) now consumes the terminal-generated SIGINT before the kernel sends it to the process group. Child hook processes may no longer receive SIGINT when the user presses Ctrl+C during hook execution.
- �zd up UX consistency — provision phase gets the new prompt; deploy phase still hard-exits. Inconsistent across one command.
- SIGTERM — new handler stack covers SIGINT only. kill bypasses all this. Worth documenting explicitly.
Excellent feature overall. H1–H4 are the blockers. M-findings are a mix of architectural consistency (aligns with the direction #7797 is taking) and the test coverage that your "industry-standard" framing deserves. Happy to discuss any of these.
wbreza
left a comment
There was a problem hiding this comment.
Reviewing as wbreza. Blocking on H1–H4 — genuine correctness/industry-standard Ctrl+C defects not covered by the existing 52 comments. Rest are architectural consistency + test coverage gaps to consider.
🔴 High — must-fix before merge (4)
H1 — Race between default branch and signal-goroutine handler dispatch
pkg/infra/provisioning/bicep/bicep_provider.go:790–802 + pkg/input/console.go:1000–1018
In watchTerminalInterrupt there are three sequential steps after SIGINT: ryStartInterruptHandler() → currentInterruptHandler() → handler() (where close(started) lives). Between step B and C the Go scheduler can run the main goroutine, which sees <-interruptStarted not yet closed, takes the default branch, and calls cleanupOnce() to pop + cancelDeploy(). Two failure modes:
- The captured handler runs the prompt over success output; the interruptCh send has no reader.
- If the pop happened before B, currentInterruptHandler() returns
il and the goroutine hits os.Exit(1) after success.
Fix options: expose TryClaimInterrupt/ReleaseInterrupt so the default branch acquires the running slot before popping; or add an OnDispatch pre-handler callback invoked by the signal goroutine before handler() so interruptStarted is closed before handler() begins.
H2 — Handler panic leaves process unkillable from Ctrl+C
pkg/input/console.go:1013
go handled := handler() finishInterruptHandler() // never reached on panic
No
ecover(), not deferred. If a handler panics, interruptRunning=true forever AND the signal goroutine dies, so or range signalChan stops draining — every future Ctrl+C is silently dropped AND Go's default SIGINT handler is not restored. User must SIGKILL.
Fix:
go defer func() { recover(); signal.Stop(signalChan) }() for range signalChan { if !tryStartInterruptHandler() { continue } handler := currentInterruptHandler() func() { defer finishInterruptHandler() defer func() { if r := recover(); r != nil { log.Printf("interrupt handler panic: %v", r) } }() handled = handler() }() if !handled { os.Exit(1) } }
H3 — Second Ctrl+C is silently dropped; no industry-standard force-exit escape
pkg/infra/provisioning/bicep/interrupt.go (prompt at ~line 75, poll at 185–281)
Two windows where a second Ctrl+C has no effect:
- While the console.Select prompt is showing — ryStartInterruptHandler() gate + sync.OnceValue cached return both swallow the press.
- During the 2-minute cancel-await poll — interruptRunning stays true, all further SIGINTs are dropped silently with zero user feedback.
This violates the near-universal POSIX CLI convention — kubectl, erraform apply, docker compose down, git, docker run,
pm, cargo, and all azd peers treat a 2nd Ctrl+C as force-exit. Users who hit Ctrl+C by accident, or who see the cancel wait hanging, have no escape short of SIGKILL.
Fix: Track a press counter separately from interruptRunning. On 2nd press while a handler is running → os.Exit(130) (standard 128+SIGINT). Or nest a PushInterruptHandler inside the poll that aborts the wait and returns ErrDeploymentInterruptedLeaveRunning on re-press.
H4 — Process-global interrupt stack breaks IoC, parallel tests, multi-console hosts
pkg/input/interrupt.go:22–25
interruptStack, interruptMu, interruptRunning are package vars. pkg/input is shared UX infrastructure consumed via IoC — every other piece of Console state is scoped to the instance. Impacts:
- .Parallel() tests using input collide silently. The current interrupt_test.go has to avoid parallelism with no documentation.
- Embeddings that host multiple consoles (MCP server, dev-center-mode, potential agent harness hosting multiple azd sessions) share one handler stack — handlers leak across sessions.
- Tests can't override the broker for unit isolation.
Fix: Define InterruptBroker interface with Push(InterruptHandler) (pop func()), register as a singleton in cmd/container.go, and inject into AskerConsole + BicepProvider. Keep the package-global only as a default-adapter fallback.
🟡 Medium (13) — architectural consistency + test coverage
Architecture / API design
- M1 — DeploymentService.CancelSubscriptionDeployment + CancelResourceGroupDeployment duplicate the N² scope surface. Repo guidance says "extract shared helpers for multiple scopes". Either (a) CancelDeployment(ctx, scope) taking infra.Deployment, or (b) drop from the interface — each scope impl has a single caller (SubscriptionDeployment.Cancel / ResourceGroupDeployment.Cancel) that can call �rmDeployments.CancelAtSubscriptionScope directly.
- M2 — ErrCancelNotSupported lives in pkg/azapi but is consumed via infra.Deployment in �icep/interrupt.go. Terraform/Pulumi will need the same sentinel and will either have to import �zapi or redefine. Move to pkg/infra/provisioning/cancel.go.
- M3 — 5 new sentinels duplicated across cmd/middleware/error.go:90–96 (shouldSkipAgentHandling) AND internal/cmd/errors.go:287–299 (classifySentinel). #7797 is solving the same pattern with AuthInteractionError. Introduce UserDecisionError interface with UserDecisionCode() string. One �rrors.AsType check collapses both lists and absorbs ErrAbortedByUser, ErrSamplingDenied, ErrPreviewNotSupported as a follow-up cleanup.
- M4 — BicepProvider.installDeploymentInterruptHandler calls console.Select, manages spinners, formats portal URLs. Provider owns interactive UX copy. Introduce a small InterruptDecider interface in pkg/infra/provisioning (returns a InterruptChoice enum). Default impl lives in pkg/input / pkg/ux. Terraform won't copy-paste this when it adds support.
Correctness / robustness
- M5 — provision.cancellation attribute is only emitted by the Bicep path (3 call sites). Terraform and Stacks runs never emit it. count by (provision.cancellation) Kusto aggregations under-count. Either emit
one in a shared wrapper or rename the key to provision.bicep.cancellation. - M6 — Non-TTY contract relies on console.Select returning an error. With --no-prompt or a pipe, Select may return a zero-value choice (not error) and the prompt can hang. Upfront check: if !console.IsUnformatted() || console.IsNoPrompt() { return leaveRunningOutcome } before calling Select.
- M7 — Pre-PR Ctrl+C was instant in TTY provisions. Post-PR, Azure DevOps hosted agents, GitHub Actions ty: true, buildkite --tty users get a modal invisible prompt. Wrapping scripts will hang. Add AZD_PROVISION_INTERRUPT=abort|prompt escape hatch and document in docs/provision-cancellation.md.
- M8 — watchTerminalInterrupt registers signal.Notify but never calls signal.Stop(signalChan). Goroutine + signal subscription leak per AskerConsole. Peer code already pairs them (pkg/ux/internal/input.go:83, �xtensions/azure.ai.agents/internal/cmd/run.go:179).
- M9 — PushInterruptHandler's closure only truncates when len == idx+1. Out-of-order pop silently becomes a no-op — orphaned handler intercepts Ctrl+C forever. Panic in dev builds or trace-warn.
Test coverage (user's explicit ask on industry-standard Ctrl+C)
- M10 — pkg/input/interrupt_test.go has zero race/signal-delivery/panic/re-entry tests. This is the primary test surface for the new subsystem. Missing: concurrent Push/Pop under -race, handler that recursively pushes, handler that panics, multiple SIGINTs delivered while a handler runs.
- M11 — pkg/infra/provisioning/bicep/interrupt_test.go only covers the 2 pure helpers (TestIsTerminalProvisioningState, TestApplyInterruptOutcome). Missing: prompt accept → Cancel happy path, prompt accept → poll-budget-exhaust → ErrDeploymentCancelTimeout, cancel-submit timeout (30s), deployment already terminal on first Get → ErrDeploymentCancelTooLate, prompt reject → ErrDeploymentInterruptedLeaveRunning, non-TTY fallback, ErrCancelNotSupported → leave-running + elemetryValue=leave_running.
- M12 — No test asserts provision.cancellation emits the 6 defined outcomes (
one, leave_running, canceled, cancel_too_late, cancel_timed_out, cancel_failed). Silent telemetry regressions will sneak in.
Standards
- M13 — erminalToOutcome(state, portalUrl, p, ctx) has ctx as last parameter and takes *BicepProvider as a parameter. Per AGENTS.md ctx must be first. Make it a method on BicepProvider like its neighbors (
unInterruptPrompt, cancelAndAwaitTerminal, installDeploymentInterruptHandler).
🟢 Low (9)
- L1 — Stacks cancel maps to internal.cancel_not_supported in classifySentinel but surfaces as ErrDeploymentInterruptedLeaveRunning (legitimate user outcome). Misclassifies legitimate UX as an internal error. Emit a secondary attribute like provision.cancellation.reason=stacks_unsupported.
- L2 — �pplyInterruptOutcome uses mt.Errorf("%w: %w", outcome.err, deployErr). If the deploy failed with a non-cancel error racing the prompt, user sees "deployment was canceled by user request: template validation failed" even though cancel never happened. Use �rrors.Join or branch on �rrors.Is(deployErr, context.Canceled).
- L3 — context.WithoutCancel decoupling in cancel/poll is untested. A future refactor could silently pass deployCtx and kill cancel requests.
- L4 — scope_test.TestScopeCancel has no case asserting �rrors.Is(target.Cancel(ctx), azapi.ErrCancelNotSupported) for the stacks scope.
- L5 — lastState hoisted outside the loop but only read within one inner block — inline it.
- L6 — erminalToOutcome has no doc comment; peers do.
- L7 — Mixed American/British spelling across one feature: "Cancelling", "Cancellation", vs ARM's Canceled / Canceling and existing sentinel ErrDeploymentCanceledByUser. Pick American to match ARM + codebase.
- L8 — Stray blank line mid-|| chain in cmd/middleware/error.go:89.
- L9 — Follow-up: no gRPC hook for extensions to register interrupt handlers. Not blocking this PR, but worth a tracking issue — today extensions running long ops can't opt into graceful cancel.
Regression audit
Verified clean:
- pkg/ux/internal/input.go prompt SIGINT — independent signal.Notify channel, coexists correctly
- �xtensions/azure.ai.agents — separate process group
- Hooks, gRPC extensions, monitor, deploy, package — no PushInterruptHandler callers; pre-PR behavior preserved
Flying blind (no regression tests):
- Hooks SIGINT forwarding on Unix — signal.Notify(os.Interrupt) now consumes the terminal-generated SIGINT before the kernel sends it to the process group. Child hook processes may no longer receive SIGINT when the user presses Ctrl+C during hook execution.
- �zd up UX consistency — provision phase gets the new prompt; deploy phase still hard-exits. Inconsistent across one command.
- SIGTERM — new handler stack covers SIGINT only. kill bypasses all this. Worth documenting explicitly.
Excellent feature overall. H1–H4 are the blockers. M-findings are a mix of architectural consistency (aligns with the direction #7797 is taking) and the test coverage that your "industry-standard" framing deserves. Happy to discuss any of these.
Azure Dev CLI Install InstructionsInstall scriptsMacOS/Linux
bash: pwsh: WindowsPowerShell install MSI install Standalone Binary
MSI
Documentationlearn.microsoft.com documentationtitle: Azure Developer CLI reference
|
jongio
left a comment
There was a problem hiding this comment.
Three items I didn't see covered elsewhere in the existing thread.
N1 [high] - race between deploy-success and the interrupt handler goroutine
In watchTerminalInterrupt the handler now runs on a spawned goroutine (pkg/input/console.go:1003), but BicepProvider.Deploy only does a non-blocking read of interruptCh after cleanupOnce():
cleanupOnce() // pops handler, cancels deployCtx. Doesn't wait for an in-flight handler.
select {
case outcome := <-interruptCh:
...
default: // handler hasn't written yet? we take this path.
}If SIGINT arrives in the narrow window where deployModule has just returned successfully, the signal goroutine can capture the handler via currentInterruptHandler() before cleanupOnce pops, then spawn the handler while the main thread races past, takes the default branch, and returns success. The handler goroutine now shows the interrupt prompt over success output, and if the user picks "Cancel" it calls ARM Cancel on a deployment that already succeeded.
This is separate from the "second Ctrl+C is dropped" concern in H3, it's a single Ctrl+C arriving at the wrong moment. cleanupOnce needs to either wait for any in-flight handler to drain, or the handler closure needs a "deploy already completed" guard before calling runInterruptPrompt.
N2 [info] - quick check on H2's premise
On this branch finishInterruptHandler() is deferred at pkg/input/console.go:1004, and the handler runs on a spawned goroutine, so for range signalChan keeps draining even if the handler panics. Unhandled goroutine panics crash the whole process by Go's default behavior, so "process unkillable" doesn't match the runtime semantics here. Adding recover() inside the handler goroutine is still reasonable defense-in-depth, but worth verifying the failure modes before refactoring on H2's described premise.
N3 [low] - test isolation in interrupt_test.go
TestPushInterruptHandler_LIFO and TestTryStartInterruptHandler_PreventsConcurrent mutate interruptStack and interruptRunning (package globals) but don't reset them via t.Cleanup. If a future test panics before calling the returned pop() or the deferred finishInterruptHandler, residual state leaks into subsequent tests. Complements M10's test-coverage gap, and the broker abstraction from H4 would eliminate this class of fragility outright.
Fixes from wbreza's review: - H1/N1: Race between deploy-success and interrupt handler — replaced select-based check with atomic CAS state machine (deployStateRunning → deployStateInterrupting or deployStateCompleted) so the handler and Deploy goroutine never conflict. - H2: Panic in interrupt handler — added recover() with stack trace logging in watchTerminalInterrupt so a handler panic doesn't leave the process unkillable. - H3: Second Ctrl+C force-exit — added forceExitPending counter in interrupt.go; second suppressed Ctrl+C while a handler is running triggers os.Exit(130) matching POSIX convention (kubectl, terraform). - M13: terminalToOutcome now a BicepProvider method with ctx as first parameter per AGENTS.md convention. - L7: Spelling consistency — 'Cancelling' → 'Canceling' to match ARM API and codebase convention. - L8: Removed stray blank line in cmd/middleware/error.go. Fixes from jongio's review: - N1: Same race fix as H1 above (CAS state machine). - N2: Panic recovery (defense-in-depth per Jon's suggestion). - N3: Test cleanup — added t.Cleanup() for PushInterruptHandler pops and finishInterruptHandler to prevent global state leaks on assertion failure. Fixes from Copilot bot review: - Unbounded Get call in cancel-request error path — added context.WithTimeout wrapper (30s). - DeploymentUrl fetch in prompt — added timeout to prevent indefinite blocking on slow/unreachable ARM. - Deleted state mismatch — added explicit case in terminalToOutcome for DeploymentProvisioningStateDeleted. - Test cleanup in interrupt_test.go (same as N3 above). New tests: - TestForceExitCounter: validates force-exit on 2nd suppressed Ctrl+C. - TestForceExitCounter_ResetsOnNewHandler: ensures counter resets between handler lifecycles. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Review feedback addressed@wbreza @jongio — pushed a commit addressing the blocking and actionable findings from both reviews. Here is the breakdown: Addressed in this push
New tests
Deferred to follow-up (architectural)
All tests pass ( |
| if portalUrl != "" { | ||
| p.console.Message(ctx, | ||
| output.WithHighLightFormat("The Azure deployment will continue running. Track it here:\n %s", | ||
| portalUrl)) | ||
| } | ||
| return interruptOutcome{ | ||
| err: provisioning.ErrDeploymentInterruptedLeaveRunning, | ||
| telemetryValue: "leave_running", | ||
| } | ||
| } | ||
|
|
||
| switch choice { | ||
| case 0: // leave running | ||
| if portalUrl != "" { | ||
| p.console.Message(ctx, | ||
| output.WithHighLightFormat("The Azure deployment will continue running. Track it here:\n %s", | ||
| portalUrl)) | ||
| } |
| p.console.Message(ctx, | ||
| output.WithHighLightFormat( | ||
| "Canceled deployment is recorded in the portal:\n %s", portalUrl)) | ||
| } |
There was a problem hiding this comment.
Same here (URL formatting)
There was a problem hiding this comment.
Does this cancel nested deployments? I tried Ctrl+C to cancel the deployment, which completed. But when ran azd provision it failed with:
(x) Failed: Validating deployment
ERROR: deployment failed: error deploying infrastructure: validating deployment to subscription:
Validation Error Details:
DeploymentActive: The deployment with resource id '/subscriptions/<id>/resourceGroups/microsoft-rg/providers/Microsoft.Resources/deployments/ai-project' cannot be saved, because this would overwrite an existing deployment which is still active. If this is an intentional re-deployment, please wait for the original deployment to complete, or cancel it before re-attempting the operation. The previous deployment was started at '4/23/2026 12:19:18 AM' with correlationId '<id>', and will expire at '4/30/2026 12:19:18 AM' if it does not complete before then. Please see https://aka.ms/arm-deploy-resources for usage details.
It turned out there was a nested deployment still in progress and I had to manually cancel it.
There was a problem hiding this comment.
It should cancel the top level deployment which should have fall back to cancel all nested children. It should just take some time to propagate, but I will double check this expectation. Thanks
There was a problem hiding this comment.
I noticed when I Ctrl+C while it's trying to cancel the deployment, it ignores/swallows them, which is a bit inconsistent with the changes made earlier to treat the second Ctrl+C as force-exit during the prompt.
Any thoughts on if we should allow users to force-exit during this window as well? I guess the deployment could be in an unknown state (could be cancelled or not).
There was a problem hiding this comment.
In comments above folks were in favor of not swallowing the second cancellation. Initially I wanted to swallow it b/c I usually press ctr+c like a maniac to stop a process 😆, so I would probably miss the change to ask azd to cancel.
But I updated that. I don't have a strong opinion,
jongio
left a comment
There was a problem hiding this comment.
Addresses my prior feedback. Verified the new commit (bea4593...199a887).
The deployState atomic CAS approach correctly closes the race from N1. Both the handler and Deploy atomically compete for state ownership from deployStateRunning - clean lock-free coordination with no window for both sides to proceed. Panic recovery (N2) and t.Cleanup registrations (N3) also look correct.
Force-exit on third Ctrl+C (first=handled, second=suppressed, third=exit) is a good addition. More forgiving than strict POSIX convention (kubectl/terraform exit on 2nd total press), which prevents accidental double-press from killing the process during the cancel flow.
+1 on @JeffreyCA's output.WithLinkFormat suggestion for portal URLs. provisioning_progress_display.go:107 already uses it for deployment links, and the style guide lists WithLinkFormat as the convention for URLs.
| if !tryStartInterruptHandler() { | ||
| // A handler is already running. A second Ctrl+C while a | ||
| // handler is active is treated as a force-exit (standard | ||
| // POSIX convention: kubectl, terraform, docker, etc.). |
There was a problem hiding this comment.
Nit: this comment says "A second Ctrl+C" but the actual behavior is the third total press triggers force-exit (first=handler starts, second=suppressed/pending, third=exit). The comment in interrupt.go correctly says "second suppressed press". Consider aligning the wording here so readers don't expect 2nd-press exit behavior.
wbreza
left a comment
There was a problem hiding this comment.
Re-reviewed after addressing feedback (commit 199a887). Verified the original high-priority findings:
Resolved ✅
H1 (Race condition) — Atomic CAS state machine (\deployStateRunning\ → \Interrupting/\Completed) properly eliminates the race between signal handler and deploy completion. Both sides compete via \CompareAndSwap; the loser waits on channels. Clean fix.
H2 (Panic recovery) — \defer recover()\ with
untime.Stack()\ logging wraps handler execution. \inishInterruptHandler()\ always runs to reset state, ensuring the process remains killable.
H3 (Force-exit on 2nd Ctrl+C) — \orceExitPending\ counter with \os.Exit(130)\ on second suppressed press. Counter resets on new handler start. Follows POSIX convention correctly.
Remaining ⚠️
H4 (Process-global interrupt state) — \interruptStack, \interruptMu, \interruptRunning, \orceExitPending\ remain package-global variables. No \InterruptBroker\ interface or IoC injection. This means parallel tests collide on shared state and multi-console hosts would leak handlers across sessions. Acceptable for the single-process CLI today, but should be tracked as a follow-up to avoid issues when adding parallel test coverage or multi-session support.
New observations
- Test coverage gap: Core interrupt flow (
unInterruptPrompt, \cancelAndAwaitTerminal, \installDeploymentInterruptHandler) has no unit tests — only the utility functions are covered. Targeted tests with faked \console.Select()\ and \deployment.Cancel()/\Get()\ would significantly reduce risk. - Minor: In \cancelAndAwaitTerminal, when the cancel API fails and the fallback \Get()\ also fails, \getErr\ is silently discarded. A \log.Printf\ for that error would improve production diagnosability.

Closes #2810
Summary
When a user presses Ctrl+C during `azd provision` or `azd up` while a Bicep deployment is in flight on Azure, azd now pauses and asks whether to:
Previously, Ctrl+C exited azd immediately while the deployment kept running on Azure with no easy follow-up.
Behavior
Provider scope
Implementation
Tests
Validation
Docs