Skip to content

Commit dfb1ed8

Browse files
LEANDERANTONYclaude
andcommitted
fix(workspace): unfreeze re-run pipeline visual + add cooperative run cancel
Two coupled analysis-run-lifecycle changes (AnalysisRunner.tsx carries both, so they ship together). 1. fix: the re-run pipeline visual was frozen at all-done. On a re-run the parent keeps the previous completed `analysisState` mounted (it is only swapped when the NEW result lands). AnalysisRunner tested `analysisState` BEFORE `analysisLoading` in the stage computation, status pip, sub-text and summary, so the stale completed result short-circuited every `analysisState ? : analysisLoading ?` ternary — every stage stuck at done/100% for the whole re-run. The Run button was the lone loading-first branch, which is exactly why it flipped to "Running..." while the agent cards stayed frozen. Flipped the precedence so `analysisLoading` wins everywhere a re-run needs live state; result-derived stale/outage notices are suppressed while a re-run is in flight. 2. feat: stop a run mid-flight (Cancel/Abort). There was no off-switch — a misfired premium (gpt-5.5) run burned tokens with no way to stop it. Added cooperative cancellation at the existing stage boundary: - WorkspaceRunJob.cancel_requested + cancel_workspace_analysis_job() + POST /workspace/analyze-jobs/{job_id}/cancel. WorkspaceRunJobCancelled is a plain Exception (not AppError) so it travels unchanged through the orchestrator's per-agent / AgentExecutionError handlers. - _update_job_progress (runs at every begin_stage) raises it once the flag is set; _run_job catches it BEFORE the failure handlers and ends the job in a distinct terminal "cancelled" state (no error banner, INFO log). - Credit auto-refunds: the cancel exception flows through run_workspace_analysis' existing `except BaseException` refund path, so a stopped run costs zero application/premium credits. - Frontend: cancelWorkspaceAnalysisJob() client; useAnalysisJob treats `cancelled` as terminal (info notice + quota refetch) + cancelAnalysis() + sticky analysisCancelling; "Stop run" button (replaces Clear role while running, "Stopping..." while it unwinds). Idempotent / double-click-guarded; unknown id -> actionable 404. Honest caveat by design: a Python thread mid-OpenAI-call cannot be force-killed, so cancel lands at the NEXT agent boundary (a few-30s, <=120s worst case). The UI says so rather than implying instant. Tests: +10 unit (tests/test_workspace_run_jobs_cancel.py — cancel mechanics, idempotency, the cooperative seam via a faithful fake, regression guard that real failures still report `failed`) + 2 route integration. 67 workspace-suite green (no regression); error-message allowlist green; frontend tsc + eslint clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 59b1288 commit dfb1ed8

8 files changed

Lines changed: 514 additions & 37 deletions

File tree

backend/routers/workspace.py

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,7 @@
5858
from backend.services.workspace_run_jobs import (
5959
JOB_RETRY_AFTER_SECONDS,
6060
WorkspaceRunJobCapacityError,
61+
cancel_workspace_analysis_job,
6162
get_workspace_analysis_job,
6263
start_workspace_analysis_job,
6364
)
@@ -774,6 +775,31 @@ def get_workspace_analysis_job_route(job_id: str):
774775
return payload
775776

776777

778+
@router.post(
779+
"/analyze-jobs/{job_id}/cancel",
780+
response_model=WorkspaceAnalyzeJobStatusResponseModel,
781+
)
782+
def cancel_workspace_analysis_job_route(job_id: str):
783+
# Cooperative cancel: sets the flag and returns immediately. The
784+
# job typically comes back still "running" (the worker observes
785+
# the flag at its next stage boundary); the frontend keeps polling
786+
# GET /analyze-jobs/{job_id} until it sees the terminal
787+
# "cancelled". Idempotent for already-terminal jobs. Same
788+
# job_id-scoped access model as the status route (the id is an
789+
# unguessable uuid4 hex; no extra auth surface added).
790+
payload = cancel_workspace_analysis_job(job_id)
791+
if payload is None:
792+
raise HTTPException(
793+
status_code=404,
794+
detail=(
795+
"This workflow run is no longer available — it may have "
796+
"already finished, or the server restarted. There's "
797+
"nothing to stop; run the workflow again if needed."
798+
),
799+
)
800+
return payload
801+
802+
777803
@router.post("/assistant/answer")
778804
@limiter.limit(LIMIT_LLM)
779805
def answer_assistant_question(

backend/services/workspace_run_jobs.py

Lines changed: 107 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,31 @@ class WorkspaceRunJobCapacityError(RuntimeError):
2727
"""Raised when `_RUN_SEMAPHORE` is exhausted at request time."""
2828

2929

30+
class WorkspaceRunJobCancelled(Exception):
31+
"""Cooperative-cancellation signal for an in-flight analysis job.
32+
33+
Deliberately a plain `Exception` (NOT an `AppError` / not an
34+
`AgentExecutionError`): it must travel UNCHANGED through every
35+
handler between the stage-boundary progress callback and
36+
`_run_job`'s terminal handler —
37+
* the orchestrator's per-agent `except AgentExecutionError` /
38+
`except OpenAIUnavailableError` (no match → not swallowed),
39+
* `ApplicationOrchestrator.run`'s `except AgentExecutionError`
40+
(no match → not turned into a deterministic fallback),
41+
* `run_workspace_analysis`'s `except BaseException` (matches →
42+
refunds the consumed quota credit, then re-raises — so a
43+
cancelled run never costs the user an application credit).
44+
`_run_job` catches it explicitly and marks the job `cancelled`
45+
(a normal user action, not a failure).
46+
"""
47+
48+
49+
# Terminal statuses: a job here is done moving and a cancel request is
50+
# a no-op (idempotent — a double-click or a cancel that races
51+
# completion must not error).
52+
_TERMINAL_JOB_STATUSES = frozenset({"completed", "failed", "cancelled"})
53+
54+
3055
@dataclass
3156
class WorkspaceRunJob:
3257
job_id: str
@@ -36,6 +61,13 @@ class WorkspaceRunJob:
3661
progress_percent: int = 3
3762
result: dict[str, Any] | None = None
3863
error_message: str | None = None
64+
# Set by `cancel_workspace_analysis_job`; observed by the worker at
65+
# the next stage boundary (begin_stage → progress callback →
66+
# `_update_job_progress`). Cooperative because a Python thread
67+
# blocked inside an OpenAI call cannot be force-killed safely, so
68+
# cancellation takes effect at the next agent boundary (≤ one
69+
# agent / ≤ the per-call timeout), never mid-LLM-call.
70+
cancel_requested: bool = False
3971
created_at: float = field(default_factory=time.time)
4072
updated_at: float = field(default_factory=time.time)
4173

@@ -65,15 +97,30 @@ def _serialize_job(job: WorkspaceRunJob) -> dict[str, Any]:
6597

6698

6799
def _update_job_progress(job_id: str, title: str, detail: str, value: int) -> None:
100+
# This runs on every pipeline stage boundary (the orchestrator's
101+
# `begin_stage` → `_emit_progress` → this callback), which makes it
102+
# the natural cooperative-cancellation checkpoint: if a cancel was
103+
# requested while the previous agent was working, we abandon the
104+
# progress write and raise so the run unwinds at the boundary
105+
# instead of advancing into the next (possibly premium) LLM call.
106+
cancelled = False
68107
with _LOCK:
69108
job = _JOBS.get(job_id)
70109
if job is None:
71110
return
72-
job.status = "running"
73-
job.stage_title = title
74-
job.stage_detail = detail
75-
job.progress_percent = max(0, min(100, int(value)))
76-
job.updated_at = time.time()
111+
if job.cancel_requested:
112+
cancelled = True
113+
else:
114+
job.status = "running"
115+
job.stage_title = title
116+
job.stage_detail = detail
117+
job.progress_percent = max(0, min(100, int(value)))
118+
job.updated_at = time.time()
119+
if cancelled:
120+
# Raise OUTSIDE the lock — the unwinding stack (orchestrator →
121+
# run_workspace_analysis' refund → _run_job) must never contend
122+
# on _LOCK while this propagates.
123+
raise WorkspaceRunJobCancelled(job_id)
77124

78125

79126
def _run_job(
@@ -117,6 +164,31 @@ def _run_job(
117164
job.stage_title = "Workflow crew"
118165
job.stage_detail = "All agents are done. Your tailored documents are ready to review."
119166
job.updated_at = time.time()
167+
except WorkspaceRunJobCancelled:
168+
# A normal user action, not a failure — log at INFO and end
169+
# the job in a distinct terminal state (NOT "failed", so the
170+
# UI doesn't show an error banner). The quota credit was
171+
# already refunded by run_workspace_analysis' BaseException
172+
# handler on the way up, so the copy can promise that.
173+
log_event(
174+
LOGGER,
175+
20,
176+
"workspace_run_job_cancelled",
177+
"The background workspace analysis job was cancelled by the user before completion.",
178+
job_id=job_id,
179+
)
180+
with _LOCK:
181+
job = _JOBS.get(job_id)
182+
if job is None:
183+
return
184+
job.status = "cancelled"
185+
job.stage_title = "Run stopped"
186+
job.stage_detail = (
187+
"You stopped this run before it finished. No credit "
188+
"was used — start a new run whenever you're ready."
189+
)
190+
job.error_message = None
191+
job.updated_at = time.time()
120192
except AppError as error:
121193
message = error.user_message
122194
log_event(
@@ -211,3 +283,33 @@ def get_workspace_analysis_job(job_id: str) -> dict[str, Any] | None:
211283
if job is None:
212284
return None
213285
return _serialize_job(job)
286+
287+
288+
def cancel_workspace_analysis_job(job_id: str) -> dict[str, Any] | None:
289+
"""Request cooperative cancellation of an in-flight analysis job.
290+
291+
Returns the serialized job, or ``None`` when ``job_id`` is unknown
292+
(pruned past TTL, wrong id, or the single-worker process restarted
293+
and lost the in-memory registry — the caller maps this to a 404).
294+
295+
Idempotent by design: cancelling an already-terminal job
296+
(completed / failed / cancelled) just returns its current state. A
297+
double-click, or a Stop that races the run finishing, must never
298+
error.
299+
300+
This only *sets the flag*. The worker thread is blocked inside the
301+
synchronous pipeline (often mid-OpenAI-call) and a Python thread
302+
can't be force-killed safely, so the request returns immediately
303+
with the job still ``running``; the worker observes the flag at its
304+
next stage boundary and flips the job to ``cancelled`` within
305+
≤ one agent. The frontend keeps polling until that terminal state.
306+
"""
307+
with _LOCK:
308+
_prune_jobs()
309+
job = _JOBS.get(job_id)
310+
if job is None:
311+
return None
312+
if job.status not in _TERMINAL_JOB_STATUSES:
313+
job.cancel_requested = True
314+
job.updated_at = time.time()
315+
return _serialize_job(job)

frontend/src/components/workspace/AnalysisRunner.tsx

Lines changed: 68 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,14 @@ export type AnalysisRunnerProps = {
3737
analysisIsStale: boolean;
3838
currentWorkflowStage: WorkflowStage | null;
3939
onRunAnalysis: () => void;
40+
/** Request cooperative cancellation of the in-flight run. Only
41+
* meaningful while `analysisLoading` is true. */
42+
onCancelAnalysis: () => void;
43+
/** True from Stop-pressed until the run actually ends. Drives the
44+
* Stop button's "Stopping…" + disabled state — cancel is
45+
* cooperative (effective at the next agent boundary), not instant,
46+
* and the UI must not pretend otherwise. */
47+
analysisCancelling: boolean;
4048
onClearRole: () => void;
4149
/** True when both a resume + JD are present. */
4250
ready: boolean;
@@ -120,6 +128,8 @@ export function AnalysisRunner({
120128
analysisIsStale,
121129
currentWorkflowStage,
122130
onRunAnalysis,
131+
onCancelAnalysis,
132+
analysisCancelling,
123133
onClearRole,
124134
ready,
125135
quota,
@@ -169,6 +179,13 @@ export function AnalysisRunner({
169179
// and a value. We mark stages BEFORE the live one as done, the live
170180
// one as active w/ the live percent, and stages AFTER as next.
171181
// After analysis completes, every stage ticks to done.
182+
//
183+
// Precedence: `analysisLoading` is checked BEFORE `analysisState`.
184+
// On a re-run the parent keeps the previous completed
185+
// `analysisState` mounted (it's only swapped when the NEW result
186+
// lands), so checking it first would freeze every stage at
187+
// done/100% for the whole re-run — the live pipeline would never
188+
// re-animate. Loading must win so the cards replay the run.
172189
const liveIndex = liveStageTitle
173190
? PIPELINE_STAGES.findIndex((stage) => stage.key === liveStageTitle)
174191
: -1;
@@ -178,10 +195,7 @@ export function AnalysisRunner({
178195
let value = 0;
179196
let detail = "";
180197

181-
if (analysisState) {
182-
state = "done";
183-
value = 100;
184-
} else if (analysisLoading) {
198+
if (analysisLoading) {
185199
if (liveIndex >= 0) {
186200
if (index < liveIndex) {
187201
state = "done";
@@ -196,6 +210,9 @@ export function AnalysisRunner({
196210
value = livePercent ?? 25;
197211
detail = "Coordinating agents";
198212
}
213+
} else if (analysisState) {
214+
state = "done";
215+
value = 100;
199216
}
200217
return { ...stage, state, value, detail };
201218
});
@@ -206,14 +223,14 @@ export function AnalysisRunner({
206223
<div>
207224
<div className="b-region-title">Workflow run</div>
208225
<div className="b-region-sub">
209-
{analysisState
210-
? `${analysisState.workflow.mode} · ${
211-
analysisState.workflow.review_approved
212-
? "review approved"
213-
: "review pending"
214-
}`
215-
: analysisLoading
216-
? "Generating tailored documents…"
226+
{analysisLoading
227+
? "Generating tailored documents…"
228+
: analysisState
229+
? `${analysisState.workflow.mode} · ${
230+
analysisState.workflow.review_approved
231+
? "review approved"
232+
: "review pending"
233+
}`
217234
: ready
218235
? "Ready to run — both inputs are loaded."
219236
: "Need a parsed resume + JD to run."}
@@ -226,17 +243,17 @@ export function AnalysisRunner({
226243
<div className="b-run-bar-info">
227244
<span
228245
className={
229-
analysisState
230-
? "rd-pip rd-pip-live"
231-
: analysisLoading
232-
? "rd-pip rd-pip-ready"
246+
analysisLoading
247+
? "rd-pip rd-pip-ready"
248+
: analysisState
249+
? "rd-pip rd-pip-live"
233250
: "rd-pip"
234251
}
235252
>
236-
{analysisState
237-
? "Outputs ready"
238-
: analysisLoading
239-
? "Running…"
253+
{analysisLoading
254+
? "Running…"
255+
: analysisState
256+
? "Outputs ready"
240257
: ready
241258
? "Idle"
242259
: "Inputs needed"}
@@ -290,25 +307,45 @@ export function AnalysisRunner({
290307
>
291308
<PlayIcon /> {analysisLoading ? "Running…" : analysisState ? "Re-run" : "Run analysis"}
292309
</button>
293-
<button
294-
className="rd-btn rd-btn-danger rd-btn-sm"
295-
disabled={analysisLoading}
296-
onClick={onClearRole}
297-
type="button"
298-
>
299-
Clear role
300-
</button>
310+
{analysisLoading ? (
311+
// Stop is only meaningful mid-run. Disabled until the
312+
// backend job has an id (the queued placeholder carries
313+
// job_id "") and while a stop is already unwinding. Cancel
314+
// is cooperative — it lands at the next agent boundary, so
315+
// the label says "Stopping…" rather than implying instant.
316+
<button
317+
className="rd-btn rd-btn-danger rd-btn-sm"
318+
disabled={analysisCancelling || !analysisJobState?.job_id}
319+
onClick={onCancelAnalysis}
320+
title="Stop this run. It wraps up after the current step; no application credit is used."
321+
type="button"
322+
>
323+
{analysisCancelling ? "Stopping…" : "Stop run"}
324+
</button>
325+
) : (
326+
<button
327+
className="rd-btn rd-btn-danger rd-btn-sm"
328+
disabled={analysisLoading}
329+
onClick={onClearRole}
330+
type="button"
331+
>
332+
Clear role
333+
</button>
334+
)}
301335
</div>
302336
</div>
303337

304-
{analysisIsStale ? (
338+
{analysisIsStale && !analysisLoading ? (
305339
<div className="b-notice b-notice-warning">
306340
The inputs changed after the last run. Re-run the workflow to refresh
307341
your documents.
308342
</div>
309343
) : null}
310344

311-
{analysisState?.workflow?.service_unavailable ? (
345+
{/* Result-derived notices reflect the PREVIOUS run. Suppress
346+
while a re-run is in flight so a stale "OpenAI had a moment"
347+
banner doesn't contradict the live "Running…" pipeline. */}
348+
{analysisState?.workflow?.service_unavailable && !analysisLoading ? (
312349
<div className="b-notice b-notice-warning">
313350
{analysisState.workflow.fallback_reason ||
314351
"Our AI provider (OpenAI) is having a moment, so we built a baseline version of your application. Re-run in a few minutes for the full AI-tailored result."}
@@ -361,7 +398,7 @@ export function AnalysisRunner({
361398
line. Hidden on desktop via CSS. The pipeline cards
362399
themselves are also hidden on mobile in the idle / all-done
363400
states (see globals.css mobile pass). */}
364-
{analysisState ? (
401+
{analysisState && !analysisLoading ? (
365402
<div className="b-pipeline-summary" role="status">
366403
<span aria-hidden="true" className="b-pipeline-summary-pip">
367404

frontend/src/components/workspace/WorkspaceShell.tsx

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -497,6 +497,8 @@ export function WorkspaceShell() {
497497
setAnalysisJobState: _setAnalysisJobState,
498498
currentWorkflowStage,
499499
runAnalysis: handleRunAnalysis,
500+
cancelAnalysis: handleCancelAnalysis,
501+
analysisCancelling,
500502
resetAnalysis,
501503
} = useAnalysisJob({
502504
resumeText,
@@ -2337,7 +2339,9 @@ export function WorkspaceShell() {
23372339
analysisJobState={analysisJobState}
23382340
analysisLoading={analysisLoading}
23392341
analysisState={analysisState}
2342+
analysisCancelling={analysisCancelling}
23402343
currentWorkflowStage={currentWorkflowStage}
2344+
onCancelAnalysis={() => void handleCancelAnalysis()}
23412345
onClearRole={clearWorkspaceRole}
23422346
onPremiumChange={setPremium}
23432347
onPremiumLockedUpgrade={() =>

0 commit comments

Comments
 (0)