feat: expose signal wait marker on instances by nanookclaw · Pull Request #268 · microsoft/pg_durable

nanookclaw · 2026-06-25T17:48:06Z

Summary

Closes #239.

This adds a nullable df.instances.blocked_on_signal column so an instance exposes the signal name while it is parked on a SIGNAL node. The existing update-node-status activity now keeps that marker in sync when SIGNAL nodes enter or leave running, recomputing from currently running SIGNAL nodes so overlapping waits do not clear the marker too early.

Terminal instance transitions clear the marker, and df.cancel() clears it atomically with the cancellation update. Fresh-install DDL, the 0.2.3 to 0.2.4 upgrade script, grant/revoke helpers, and security/upgrade notes are updated so runtime-status privileges include the new column. Signal E2E SQL now asserts that the marker is set while waiting and cleared after signal receipt or timeout.

Verification

cargo fmt -p pg_durable -- --check
git diff --check

I could not run cargo check locally because PGRX_HOME is not configured on this host. I also could not run scripts/pgspot-gate.sh because the environment could not fetch pgspot==0.9.2 from PyPI.

pinodeca · 2026-06-25T20:48:03Z

Thanks for this — the implementation itself is clean (backward-compat probing, terminal-state guards, grant replay, and docs are all solid, and it builds/clippy/fmt cleanly locally).

That said, I'd like to hold off on merging until #239 is better specified, because I think the underlying data model is still an open question rather than a code-quality one.

The key issue: an instance can be blocked on multiple signals concurrently. Parallel JOIN (&) and RACE (?>) each run their branches as concurrent sub-orchestrations, so df.wait_for_signal('a') & df.wait_for_signal('b') leaves two SIGNAL nodes running under one instance. A scalar blocked_on_signal can only represent one of them (here, ORDER BY updated_at DESC, id LIMIT 1), which can mislead an operator into thinking sending that one signal will unblock the instance.

There are also some design tradeoffs worth weighing against the benefit:

Write amplification / contention: every node-status transition (all node types) now does a catalog probe + node lookup, and each SIGNAL transition opens a second txn taking FOR UPDATE on the hot instance row.
Coupling: a failure in sync_instance_signal_wait returns Err and fails the node-status activity, so an observability write can stall the workflow it's observing.
Duplicated source of truth: df.nodes already holds the authoritative wait state; this denormalizes it (with migration, per-column grants, and drift surface) to populate a SELECT *-friendly field.

I've left a longer write-up in #239 proposing we first decide scalar-single vs. set-of-many, and whether a read-side view/function over df.nodes (always correct, multi-signal-native, zero write cost) better satisfies the actual requirement than a denormalized column. Once the issue lands on the intended semantics, we can re-shape this PR accordingly.

Not a criticism of the execution here — just want the spec nailed down before we commit to a schema/write-path change.

feat: expose signal wait marker on instances

0360f1c

pinodeca mentioned this pull request Jun 25, 2026

No observable "waiting on a signal" state at the instance level #239

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: expose signal wait marker on instances#268

feat: expose signal wait marker on instances#268
nanookclaw wants to merge 1 commit into
microsoft:mainfrom
nanookclaw:feat/signal-wait-observability

nanookclaw commented Jun 25, 2026

Uh oh!

pinodeca commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

nanookclaw commented Jun 25, 2026

Summary

Verification

Uh oh!

pinodeca commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants