Skip to content

Build resume lifecycle + SKIPPED status UI polish#141

Merged
andhus merged 3 commits into
mainfrom
feature/build-resume-status
May 8, 2026
Merged

Build resume lifecycle + SKIPPED status UI polish#141
andhus merged 3 commits into
mainfrom
feature/build-resume-status

Conversation

@andhus
Copy link
Copy Markdown
Collaborator

@andhus andhus commented May 7, 2026

Summary

Fixes a UX bug where sd.build(resume_build_id=...) silently reused the build_id without notifying the registry — so a previously-failed build kept showing as failed in the UI even while the SDK was actively running tasks under it. Resumed builds now flip back to RUNNING with a "running (resumed)" badge and jump to the top of the Home list.

What's in the PR

  • SDK: emits a BUILD_RESUMED event when resume_build_id is set (in build, build_aio, build_sequential, build_sequential_aio).
  • API: new POST /builds/{id}/resume endpoint, new EventType.BUILD_RESUMED, status replay treats it like BUILD_STARTED (flips status to RUNNING, clears completed_at, sets is_resumed: true until the next terminal event).
  • DB: new Build.last_active_at column, touched only on build-level lifecycle events (BUILD_RESUMED / BUILD_COMPLETED / BUILD_FAILED / BUILD_CANCELLED / BUILD_EXIT_EARLY). Build list sorts by (last_active_at desc, id desc) — UUID7 tiebreaker keeps pagination stable across timestamp ties. Migration backfills existing rows from created_at.
  • UI: BuildStatusBadge shows running (resumed) when applicable.

last_active_at write semantics

Touching is intentionally confined to lifecycle events, not every task event. With the high task concurrency stardag supports (Modal/Prefect distributed runners can fan out hundreds of concurrent task events against one build), an UPDATE per task event would serialise workers on a row-level exclusive lock against the build row and bloat builds with MVCC versions. The trade-off is that a long-running mid-execution build won't bump its position from intermediate task events — but its status=running badge already signals activity, and "most recent lifecycle change" is the cleaner sort key. The bug-fix goal (resumed build pops to the top) is preserved because BUILD_RESUMED is itself a lifecycle event.

A read-time JOIN against MAX(events.created_at) was also considered but rejected: it would add an O(builds_in_env) cost to every list query (≈100ms for a 1k-build environment, scaling poorly with workspace size), versus the column-based design's sub-millisecond list latency at any scale.

Backwards compatibility

  • New SDK + old API: POST /resume 404s; the registry client swallows the missing-route 404 with a warning (same _is_route_not_found pattern as task_skip / dependencies). Build still runs to completion; only the registry-side status flip is missing until the API is upgraded.
  • Old SDK + new API: no-op — old SDK doesn't call resume. last_active_at is set on insert by DEFAULT and gets bumped by the new build-level handlers, so list ordering is correct without SDK cooperation.

Drive-by UI polish (same PR by request)

  • TaskStatus type now includes "skipped" (was missing → StatusBadge rendered black on default for skipped, near-invisible on dark rows). skipped now uses amber (warmer than pending's yellow, distinct from cancelled's grey) consistently across StatusBadge, TaskNode (DAG), and TaskExplorerTable.
  • TaskFilters dropdown: added missing Skipped and Cancelled options.
  • /tasks/search/values?key=status autocomplete: was hardcoded to {pending, running, completed, failed}; now returns the full filterable set (still excludes the internal unregistered phantom marker).

Test plan

  • API unit tests pass (pytest app/stardag-api/tests/) — 257 passed including 4 new resume tests + 1 status-values pin
  • SDK unit tests pass (pytest lib/stardag/tests/ --ignore=tests/test_integration) — 525 passed including 2 new build_resume tests + 4 new 404-fallback tests
  • UI vitest passes
  • Pre-commit clean: prettier, ruff, ruff-format, pyright (lib + api), eslint, tsc
  • Alembic migration round-trips (upgrade → downgrade → upgrade)
  • Manual: trigger sd.build(resume_build_id=<failed-build-id>) and verify the UI badge flips to "running (resumed)" and the build moves to the top of Home

🤖 Generated with Claude Code

Fix the silent reuse of resume_build_id in sd.build*: the SDK now emits
a BUILD_RESUMED event when an existing build is picked up, so a build
that previously terminated (FAILED/COMPLETED/etc.) flips back to RUNNING
and surfaces "running (resumed)" in the UI. Builds list is now ordered
by Build.last_active_at so a resumed build jumps to the top instead of
staying buried at its original created_at position.

Backwards compat:
- New SDK + old API: POST /builds/{id}/resume 404s, the registry client
  swallows the missing-route 404 with a warning (same pattern as
  task_skip / dependencies). Build runs to completion locally.
- Old SDK + new API: no-op, behaviour unchanged. Existing rows get
  last_active_at backfilled from created_at by the migration.

Also includes UI polish on TaskStatus rendering (drive-by, asked for in
the same PR):
- Add "skipped" to the TaskStatus type and render it in amber across
  StatusBadge / TaskNode / TaskExplorerTable so it's visible against
  dark-blue table rows (was effectively black-on-dark-blue before).
- Add the missing statuses (suspended, completed-set) to the BuildView
  status filter dropdown and to the search-bar status autocomplete on
  the API side.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class “build resume” semantics end-to-end so sd.build(resume_build_id=...) updates registry/UI state (instead of silently reusing an ID that remains FAILED in the UI), and polishes UI support for the skipped task status.

Changes:

  • SDK: emit BUILD_RESUMED via new build_resume/build_resume_aio registry calls (with backwards-compatible 404 swallow for older APIs).
  • API/DB: add POST /builds/{id}/resume, derive is_resumed in build status replay, and add Build.last_active_at to sort builds by recent activity.
  • UI: render “running (resumed)” for resumed builds; add skipped status to types, filters, and consistent amber styling.

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
lib/stardag/tests/test_registry/test_api_registry.py Adds tests verifying missing-route 404 handling for build resume calls.
lib/stardag/tests/test_build/test_sequential.py Ensures sequential build path calls build_resume when resume_build_id is provided.
lib/stardag/tests/test_build/test_concurrent.py Ensures concurrent async build path calls build_resume_aio when resume_build_id is provided.
lib/stardag/tests/test_build/conftest.py Extends the recording registry to track build_resume/build_resume_aio invocations.
lib/stardag/src/stardag/registry/_base.py Introduces no-op build_resume/build_resume_aio APIs for registry backends.
lib/stardag/src/stardag/registry/_api_registry.py Implements POST /builds/{id}/resume calls with 404 backwards-compat swallowing + logging.
lib/stardag/src/stardag/build/_sequential.py Emits resume event to registry when resuming builds (sync + async sequential).
lib/stardag/src/stardag/build/_concurrent.py Emits resume event to registry when resuming builds (async concurrent).
app/stardag-ui/src/types/task.ts Adds skipped task status and optional build.is_resumed field.
app/stardag-ui/src/components/TaskNode.tsx Adds amber styling for skipped tasks in DAG nodes.
app/stardag-ui/src/components/TaskFilters.tsx Adds Skipped/Cancelled options to the status filter dropdown.
app/stardag-ui/src/components/TaskExplorerTable.tsx Adds colors for suspended/skipped/cancelled and ensures pending fallback styling.
app/stardag-ui/src/components/StatusBadge.tsx Adds amber styling for skipped status (normal + muted).
app/stardag-ui/src/components/BuildView.tsx Passes is_resumed into BuildStatusBadge in breadcrumbs.
app/stardag-ui/src/components/BuildStatusBadge.tsx Renders “running (resumed)” label + tooltip when isResumed is true.
app/stardag-ui/src/components/BuildsList.tsx Reuses BuildStatusBadge and surfaces resumed state in builds list.
app/stardag-api/tests/test_search.py Pins /tasks/search/values?key=status to return full filterable status set.
app/stardag-api/tests/test_builds.py Adds coverage for resume behavior, is_resumed semantics, and list ordering by activity.
app/stardag-api/src/stardag_api/services/status.py Adds BUILD_RESUMED replay semantics + is_resumed return value in derived build status.
app/stardag-api/src/stardag_api/schemas.py Extends BuildResponse with is_resumed.
app/stardag-api/src/stardag_api/routes/search.py Returns full filterable status value suggestions (excluding unregistered).
app/stardag-api/src/stardag_api/routes/builds.py Adds resume endpoint, touches last_active_at on event creation, and sorts builds by last_active_at.
app/stardag-api/src/stardag_api/models/enums.py Adds EventType.BUILD_RESUMED.
app/stardag-api/src/stardag_api/models/build.py Adds Build.last_active_at column + index for environment/activity sorting.
app/stardag-api/migrations/versions/20260507_121615_add_build_resumed_event_type_and_last_.py Migrates last_active_at with backfill + index.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread app/stardag-api/src/stardag_api/services/status.py
andhus and others added 2 commits May 8, 2026 08:59
Address review feedback: the previous design touched ``Build.last_active_at``
on every task event, which would serialise concurrent task workers on a
row-level exclusive lock against the same build row and bloat ``builds``
with MVCC versions under the kind of high task concurrency stardag
explicitly supports (Modal/Prefect distributed runners).

Switch to lifecycle-only touching: the column is bumped only by the five
build-level event handlers (BUILD_RESUMED / BUILD_COMPLETED /
BUILD_FAILED / BUILD_CANCELLED / BUILD_EXIT_EARLY) and the INSERT
default. The hot path (task event endpoints, single + bulk register) no
longer issues an UPDATE against ``builds``. The list-builds sort key is
now "most recent lifecycle change", which is the cleaner semantic — a
long-running mid-execution build won't bump position, but its
``status=running`` badge already signals activity, and a resumed build
still pops to the top because BUILD_RESUMED is itself a lifecycle event.

Also pick up two smaller review nits:
- ``list_builds`` now orders by ``(last_active_at desc, id desc)`` —
  ``Build.id`` is a UUID7 so it's a stable, time-correlated tiebreaker
  for builds that share a timestamp. Without it, paginating across a
  tie can yield duplicates or skips.
- ``test_list_builds_orders_resumed_first`` adds short ``asyncio.sleep``
  gaps between creates so it's robust on coarse-resolution CI clocks,
  and ``test_resume_build_not_found`` now pins the resource-level
  response body (``{"detail": "Build not found"}``) so the SDK's
  missing-route-vs-resource-404 distinction stays explicit on the API
  side.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address Copilot review on PR #141: ``get_build_status`` documented
``is_resumed`` as "latest build-level event is BUILD_RESUMED", but the
BUILD_STARTED branch wasn't resetting the flag. The SDK never emits
BUILD_STARTED after BUILD_RESUMED in normal flow (it's one-or-the-other
at build init), so this is a theoretical bug, not an observable one.
But the replay shouldn't rely on event ordering — admin/manual event
inserts could produce any sequence — so make BUILD_STARTED clear the
flag like the terminal branches already do.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@andhus andhus merged commit 7d8f0f9 into main May 8, 2026
9 checks passed
andhus added a commit that referenced this pull request May 8, 2026
Patch release covering the build-resume lifecycle fix + SKIPPED status
UI polish merged in #141. No breaking changes — pip install -U stardag
is sufficient.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants