Skip to content

feat: expose signal wait marker on instances#268

Open
nanookclaw wants to merge 1 commit into
microsoft:mainfrom
nanookclaw:feat/signal-wait-observability
Open

feat: expose signal wait marker on instances#268
nanookclaw wants to merge 1 commit into
microsoft:mainfrom
nanookclaw:feat/signal-wait-observability

Conversation

@nanookclaw

Copy link
Copy Markdown
Contributor

Summary

Closes #239.

This adds a nullable df.instances.blocked_on_signal column so an instance exposes the signal name while it is parked on a SIGNAL node. The existing update-node-status activity now keeps that marker in sync when SIGNAL nodes enter or leave running, recomputing from currently running SIGNAL nodes so overlapping waits do not clear the marker too early.

Terminal instance transitions clear the marker, and df.cancel() clears it atomically with the cancellation update. Fresh-install DDL, the 0.2.3 to 0.2.4 upgrade script, grant/revoke helpers, and security/upgrade notes are updated so runtime-status privileges include the new column. Signal E2E SQL now asserts that the marker is set while waiting and cleared after signal receipt or timeout.

Verification

  • cargo fmt -p pg_durable -- --check
  • git diff --check

I could not run cargo check locally because PGRX_HOME is not configured on this host. I also could not run scripts/pgspot-gate.sh because the environment could not fetch pgspot==0.9.2 from PyPI.

Copy link
Copy Markdown
Contributor

Thanks for this — the implementation itself is clean (backward-compat probing, terminal-state guards, grant replay, and docs are all solid, and it builds/clippy/fmt cleanly locally).

That said, I'd like to hold off on merging until #239 is better specified, because I think the underlying data model is still an open question rather than a code-quality one.

The key issue: an instance can be blocked on multiple signals concurrently. Parallel JOIN (&) and RACE (?>) each run their branches as concurrent sub-orchestrations, so df.wait_for_signal('a') & df.wait_for_signal('b') leaves two SIGNAL nodes running under one instance. A scalar blocked_on_signal can only represent one of them (here, ORDER BY updated_at DESC, id LIMIT 1), which can mislead an operator into thinking sending that one signal will unblock the instance.

There are also some design tradeoffs worth weighing against the benefit:

  • Write amplification / contention: every node-status transition (all node types) now does a catalog probe + node lookup, and each SIGNAL transition opens a second txn taking FOR UPDATE on the hot instance row.
  • Coupling: a failure in sync_instance_signal_wait returns Err and fails the node-status activity, so an observability write can stall the workflow it's observing.
  • Duplicated source of truth: df.nodes already holds the authoritative wait state; this denormalizes it (with migration, per-column grants, and drift surface) to populate a SELECT *-friendly field.

I've left a longer write-up in #239 proposing we first decide scalar-single vs. set-of-many, and whether a read-side view/function over df.nodes (always correct, multi-signal-native, zero write cost) better satisfies the actual requirement than a denormalized column. Once the issue lands on the intended semantics, we can re-shape this PR accordingly.

Not a criticism of the execution here — just want the spec nailed down before we commit to a schema/write-path change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

No observable "waiting on a signal" state at the instance level

2 participants