Skip to content

feat(jobs/mcp): thread_id continuation with race-safe FIFO queue#1243

Merged
srtab merged 12 commits into
mainfrom
worktree-submit-job-thread-continuation
May 20, 2026
Merged

feat(jobs/mcp): thread_id continuation with race-safe FIFO queue#1243
srtab merged 12 commits into
mainfrom
worktree-submit-job-thread-continuation

Conversation

@srtab
Copy link
Copy Markdown
Owner

@srtab srtab commented May 20, 2026

Summary

Adds thread_id continuation support to both the Jobs API and the MCP submit_job tool, with a race-safe FIFO queue for concurrent submissions on the same thread.

  • API/MCP: submit_job accepts an optional thread_id; subsequent submissions on the same thread resume the same chat thread. Job IDs now match Activity.id (the durable identifier that survives DBTaskResult pruning).
  • FIFO queue: when a thread already has an active (READY/RUNNING) Activity, the new submission lands in a new QUEUED status. Queued siblings are dispatched in created_at order when the active sibling reaches a terminal state.
  • Race safety: a partial unique constraint activity_one_active_per_thread blocks double-active rows at the DB layer; the dispatcher uses an atomic CAS and an in-call loop (replacing signal recursion). A 3-failure cap prevents broker outages from mass-failing the QUEUED backlog.
  • Recovery: new release_orphan_queued_threads management command recovers rare TOCTOU losses.
  • UX: dedicated "Waiting in queue" hero on the Activity detail page and an amber QUEUED status badge in the stream.

Test plan

  • make test passes (unit + the new constraint, dispatcher race, services, and MCP/Jobs API tests)
  • Submit two jobs on the same thread_id from the API/MCP — second is QUEUED, dispatches on first's terminal transition
  • Verify the partial unique constraint rejects two READY rows on the same thread (DB-level)
  • Run release_orphan_queued_threads against a synthetic orphan and confirm it releases
  • Inspect the QUEUED hero in the UI for a thread with a queued sibling

srtab added 12 commits May 20, 2026 14:51
…b_id

Fix fake Activity attributes in environment tests (task_result_id -> id/thread_id/status) to match the updated API surface from the thread continuation feature.
… outages

Closes a TOCTOU race in the thread-continuation FIFO and bounds the blast
radius of broker outages so a single bad submission cannot strand QUEUED
siblings or double-promote them.

- DB: partial unique constraint `activity_one_active_per_thread` enforces
  "at most one active (READY/RUNNING) API/MCP Activity per thread". QUEUED
  is intentionally outside the constraint; webhook trigger types are also
  excluded (they share deterministic thread_ids across events).
- Dispatcher: atomic CAS (`UPDATE filter(status=QUEUED) -> READY`) replaces
  the racy read-then-save; an in-call loop replaces signal recursion; a
  cap of 3 consecutive enqueue failures bails the loop so a broker outage
  leaves the rest QUEUED for the new `release_orphan_queued_threads`
  management command to recover.
- Services: post-create error paths (enqueue or task_result_id link)
  transition rows to FAILED with `finished_at` set and re-emit
  `activity_finished` so queued siblings advance.
- API/MCP: schema-level UUID validation for `thread_id` (proper 422 on
  malformed input); MCP `submit_job` now rejects unauthenticated calls
  early; MCP batch-poll timeout reports each job's real status
  (QUEUED/READY/RUNNING) instead of a `PENDING` placeholder.
- UI: dedicated "Waiting in queue" hero and amber QUEUED status badge.
Replace the hand-rolled regex pattern on the REST schema and the manual
``uuid_mod.UUID(thread_id)`` parse inside the MCP ``submit_job`` tool
with a Pydantic ``UUID`` type. Pydantic validates UUIDs natively, so
both entrypoints now share one boundary check expressed in the type
system instead of two divergent ad-hoc styles.

The REST path is fully covered: ninja always runs Pydantic, and the
existing 422-at-schema tests still pin the contract. The MCP path
keeps a defensive ``str(uuid_mod.UUID(str(thread_id)))`` normalisation
because FastMCP only validates at the protocol layer — direct callers
(tests, in-process use) still pass raw strings. Also log on the
malformed-thread_id branch so operators can spot misuse, and add a
test covering the ``TypeError`` arm.
@srtab srtab merged commit 306159f into main May 20, 2026
6 checks passed
@srtab srtab deleted the worktree-submit-job-thread-continuation branch May 20, 2026 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant