Skip to content
20 changes: 20 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Changed

- Reserved the `sub::` marker for runtime-generated sub-orchestration instance ids.
`Client::start_orchestration` and `Client::start_orchestration_versioned` now
return `ClientError::InvalidInput` for root instance ids that start with `sub::`
or contain `::sub::`; other uses of `::` remain supported. Applications that used
the reserved marker in root instance ids must rename those ids before upgrading.
See [docs/migration-guide.md](docs/migration-guide.md) for guidance.
- **`ctx.new_guid()` now returns a standard UUID v4.** The previous
implementation derived the value from `SystemTime::now()` nanoseconds plus a
thread-local counter, which produced low-entropy, structured values (the
Expand All @@ -19,6 +25,20 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
`nanos + process id`, removing a predictable-token pattern in work-item
ownership checks.

### Fixed

- **Parent hang on sub-orchestration instance-id collision** — When an auto-generated
child instance id already named a terminal instance, the scheduling parent could await
a completion that never arrived. The runtime now notifies the parent with a
sub-orchestration failure so it fails fast. The failure (and all sub-orchestration
completion/failure notifications) is routed to the parent's current execution using
durable provider state instead of process-local memory, so routing stays correct across
runtime restarts and multiple dispatcher nodes.
- **Sub-orchestration id reuse across continue-as-new** — Child instance ids generated
after a parent `continue_as_new` now include the parent execution id
(`{parent}::sub::{execution_id}_{event_id}`), preventing collisions with the terminal
child of a previous iteration that schedules at the same position.

## [0.1.29] - 2026-05-08

**Release:** <https://crates.io/crates/duroxide/0.1.29>
Expand Down
30 changes: 30 additions & 0 deletions docs/migration-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,36 @@

This guide helps you migrate between Duroxide versions and handle orchestration versioning.

## Reserved `sub::` instance-id marker (Unreleased)

The `sub::` marker is now reserved for runtime-generated sub-orchestration instance ids.
`Client::start_orchestration` and `Client::start_orchestration_versioned` reject root
instance ids that:

- start with `sub::`, or
- contain the `::sub::` infix.

Such ids return `ClientError::InvalidInput`. Ordinary uses of `::` in instance ids remain
valid (e.g. `tenant-7::order-42`); only the `sub::` marker is reserved.

This prevents a root instance id from pre-occupying an auto-generated child id. Child
sub-orchestration ids take the form `{parent}::sub::{event_id}` on the first parent
execution and `{parent}::sub::{execution_id}_{event_id}` after `continue_as_new`.

Before upgrading client code, audit your root instance-id scheme for the reserved marker:

```text
# Reject — start with `sub::` or contain `::sub::`
sub::job-1
tenant-7::sub::order-42

# Accept — ordinary `::` is fine
tenant-7::order-42
order-2026-06-09
```

Rename any root instance ids that use the reserved marker before upgrading.

## Orchestration Versioning

Duroxide supports versioning to handle code evolution while maintaining compatibility with running instances.
Expand Down
34 changes: 32 additions & 2 deletions src/client/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,25 @@ pub struct Client {
store: Arc<dyn Provider>,
}

/// Reject instance ids that collide with the reserved sub-orchestration markers.
///
/// Child sub-orchestration instance ids reserve the `sub::` marker (see
/// [`crate::auto_sub_orch_suffix`], the canonical formatter). The first parent
/// execution uses `{parent}::sub::{event_id}`; executions after continue-as-new use
/// `{parent}::sub::{execution_id}_{event_id}`. A user-supplied id matching either form
/// could pre-occupy a future child id, so the `sub::` prefix and `::sub::` infix are
/// reserved. Other uses of `::` remain valid.
fn validate_instance_id(instance: &str) -> Result<(), ClientError> {
if instance.starts_with(crate::SUB_ORCH_AUTO_PREFIX) || instance.contains("::sub::") {
return Err(ClientError::InvalidInput {
message: format!(
"instance id '{instance}' uses the reserved sub-orchestration marker 'sub::'"
),
});
}
Ok(())
}

impl Client {
/// Create a client bound to a Provider instance.
///
Expand Down Expand Up @@ -211,6 +230,9 @@ impl Client {
/// - Must be unique across all orchestrations
/// - Can be any string (alphanumeric + hyphens recommended)
/// - Reusing an instance ID that already exists will fail
/// - Must not use the reserved sub-orchestration marker `sub::` (as a prefix
/// or in the `::sub::` form); these are reserved for auto-generated child
/// instance ids. Such ids are rejected with [`ClientError::InvalidInput`].
///
/// # Example
///
Expand All @@ -230,15 +252,19 @@ impl Client {
///
/// # Errors
///
/// Returns `ClientError::InvalidInput` if the instance id uses the reserved
/// `sub::` marker.
/// Returns `ClientError::Provider` if the provider fails to enqueue the orchestration.
pub async fn start_orchestration(
&self,
instance: impl Into<String>,
orchestration: impl Into<String>,
input: impl Into<String>,
) -> Result<(), ClientError> {
let instance = instance.into();
validate_instance_id(&instance)?;
let item = WorkItem::StartOrchestration {
instance: instance.into(),
instance,
orchestration: orchestration.into(),
input: input.into(),
version: None,
Expand All @@ -256,6 +282,8 @@ impl Client {
///
/// # Errors
///
/// Returns `ClientError::InvalidInput` if the instance id uses the reserved
/// `sub::` marker.
/// Returns `ClientError::Provider` if the provider fails to enqueue the orchestration.
pub async fn start_orchestration_versioned(
&self,
Expand All @@ -264,8 +292,10 @@ impl Client {
version: impl Into<String>,
input: impl Into<String>,
) -> Result<(), ClientError> {
let instance = instance.into();
validate_instance_id(&instance)?;
let item = WorkItem::StartOrchestration {
instance: instance.into(),
instance,
orchestration: orchestration.into(),
input: input.into(),
version: Some(version.into()),
Expand Down
39 changes: 38 additions & 1 deletion src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -873,6 +873,24 @@ pub fn is_auto_generated_sub_orch_id(instance: &str) -> bool {
instance.starts_with(SUB_ORCH_AUTO_PREFIX)
}

/// Build the auto-generated sub-orchestration suffix for a given parent execution
/// and scheduling event id.
///
/// The first execution uses `sub::{event_id}` for backward compatibility. Later
/// executions (after `continue_as_new`) include the execution id as
/// `sub::{execution_id}_{event_id}`: event ids reset on continue-as-new, so a parent
/// that schedules a sub-orchestration at the same position on each iteration would
/// otherwise regenerate an identical child id and collide with the now-terminal
/// child from the previous iteration.
#[inline]
pub(crate) fn auto_sub_orch_suffix(execution_id: u64, event_id: u64) -> String {
if execution_id == INITIAL_EXECUTION_ID {
format!("{SUB_ORCH_AUTO_PREFIX}{event_id}")
} else {
format!("{SUB_ORCH_AUTO_PREFIX}{execution_id}_{event_id}")
}
}

/// Build the full child instance ID, adding parent prefix only for auto-generated IDs.
///
/// - Auto-generated IDs (starting with "sub::"): `{parent}::{child}` (e.g., `parent-1::sub::5`)
Expand Down Expand Up @@ -3732,7 +3750,22 @@ impl OrchestrationContext {
/// without any parent prefix. Use this when you need to control the exact
/// instance ID for the sub-orchestration.
///
/// For auto-generated instance IDs, use [`schedule_sub_orchestration`] instead.
/// For auto-generated instance IDs, use [`schedule_sub_orchestration`](Self::schedule_sub_orchestration)
/// instead.
///
/// # Reserved marker (advanced escape hatch)
///
/// Unlike [`crate::Client::start_orchestration`], explicit child ids are **not**
/// validated against the reserved `sub::` marker — they are an advanced escape hatch
/// where the caller owns the full id space. Two consequences to be aware of:
///
/// - An explicit id of the runtime-generated shape (e.g. `parent::sub::2`) is allowed
/// and may therefore collide with an auto-generated child id. The runtime defends
/// against the resulting collision: if the id already names a terminal instance the
/// scheduling parent receives a sub-orchestration failure instead of hanging.
/// - An explicit id that *starts with* `sub::` is treated as auto-generated by
/// [`crate::build_child_instance_id`] and therefore gets the parent prefix added,
/// so it is **not** used verbatim. Avoid leading `sub::` in explicit ids.
pub fn schedule_sub_orchestration_with_id(
&self,
name: impl Into<String>,
Expand Down Expand Up @@ -3760,6 +3793,10 @@ impl OrchestrationContext {
/// The provided `instance` value is used exactly as the child instance ID,
/// without any parent prefix.
///
/// Like [`schedule_sub_orchestration_with_id`](Self::schedule_sub_orchestration_with_id),
/// explicit child ids are an advanced escape hatch and are **not** validated against the
/// reserved `sub::` marker; see that method for the collision and leading-`sub::` caveats.
///
/// Returns a [`DurableFuture`] that supports cancellation on drop. If the future
/// is dropped without completing, a `CancelInstance` work item will be enqueued
/// for the child orchestration.
Expand Down
69 changes: 67 additions & 2 deletions src/runtime/dispatchers/orchestration.rs
Original file line number Diff line number Diff line change
Expand Up @@ -593,13 +593,18 @@ impl Runtime {
|| (temp_history_mgr.is_continued_as_new && !workitem_reader.is_continue_as_new)
{
warn!(instance = %instance, "Instance is terminal (completed/failed or CAN without start), acking batch without processing");
// If a StartOrchestration in this discarded batch is a sub-orchestration whose
// parent differs from this instance's own recorded parent, the instance id was
// reused for an unrelated child. Notify that parent with a SubOrchFailed so it
// fails fast instead of awaiting a completion that will never arrive.
let orchestrator_items = self.terminal_collision_notifications(&item).await;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get this, why would a StartOrchestration for a suborchestration be routed to a terminal parent? I'm probably missing a scenario

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the comment was kind of garbled here. The scenario is not "route to a terminal parent" but rather "a terminal child instance receives a new StartOrchestration work item targeting the same child id, so notify the incoming scheduling parent instead of silently discarding." I've updated the comment, let me know if it passes the @affandar test now :)

let _ = self
.ack_orchestration_with_changes(
lock_token,
item.execution_id,
vec![],
vec![],
vec![],
orchestrator_items,
ExecutionMetadata::default(),
vec![], // cancelled_activities - none for terminal instances
)
Expand Down Expand Up @@ -1373,7 +1378,7 @@ impl Runtime {
parent_id = %parent_id,
"Enqueue SubOrchFailed to parent (poison)"
);
let parent_execution_id = self.get_execution_id_for_instance(&parent_instance, None).await;
let parent_execution_id = self.parent_execution_id_for_routing(&parent_instance).await;
vec![WorkItem::SubOrchFailed {
parent_instance,
parent_execution_id,
Expand All @@ -1399,4 +1404,64 @@ impl Runtime {
// Record metrics for poison detection
self.record_orchestration_poison();
}

/// Build `SubOrchFailed` notifications for a terminal instance that received a
/// `StartOrchestration` belonging to a different parent.
///
/// Sub-orchestration child instance ids reserve the `sub::` marker (see
/// [`crate::auto_sub_orch_suffix`]): the first parent execution uses
/// `{parent}::sub::{event_id}` and executions after continue-as-new use
/// `{parent}::sub::{execution_id}_{event_id}`. If such an id already names a terminal
/// instance, the incoming `StartOrchestration` is discarded by the terminal fast-ack
/// path. Without this notification the scheduling parent would await a completion
/// forever. We only notify when the incoming work item's parent differs from the
/// terminal instance's own recorded parent, so genuine redelivery of a completed
/// child's start (parent already notified) does not spuriously fail the parent again.
async fn terminal_collision_notifications(&self, item: &crate::providers::OrchestrationItem) -> Vec<WorkItem> {
// The terminal instance's own parent, as recorded in its history.
let own_parent = item.history.iter().find_map(|e| match &e.kind {
EventKind::OrchestrationStarted {
parent_instance: Some(pi),
parent_id: Some(pid),
..
} => Some((pi.clone(), *pid)),
_ => None,
});

let mut notifications = Vec::new();
for msg in &item.messages {
if let WorkItem::StartOrchestration {
parent_instance: Some(parent_instance),
parent_id: Some(parent_id),
..
} = msg
{
// Skip genuine redelivery: same parent that already owns this instance.
if own_parent.as_ref() == Some(&(parent_instance.clone(), *parent_id)) {
continue;
}
warn!(
instance = %item.instance,
parent_instance = %parent_instance,
parent_id = %parent_id,
"Sub-orchestration target instance id already exists and is terminal; notifying parent of failure"
);
let parent_execution_id = self.parent_execution_id_for_routing(parent_instance).await;
notifications.push(WorkItem::SubOrchFailed {
parent_instance: parent_instance.clone(),
parent_execution_id,
parent_id: *parent_id,
details: crate::ErrorDetails::Application {
kind: crate::AppErrorKind::OrchestrationFailed,
message: format!(
"sub-orchestration instance id '{}' already exists and is terminal",
item.instance
),
retryable: false,
},
});
}
}
notifications
}
}
9 changes: 3 additions & 6 deletions src/runtime/execution.rs
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,6 @@ impl Runtime {
session_id,
tag,
} => {
let execution_id = self.get_execution_id_for_instance(instance, Some(execution_id)).await;
worker_items.push(WorkItem::ActivityExecute {
instance: instance.to_string(),
execution_id,
Expand All @@ -166,8 +165,6 @@ impl Runtime {
scheduling_event_id,
fire_at_ms,
} => {
let execution_id = self.get_execution_id_for_instance(instance, Some(execution_id)).await;

// Enqueue TimerFired to orchestrator queue with delayed visibility
// Provider will use fire_at_ms for the visible_at timestamp
// Note: fire_at_ms is computed at scheduling time (wall-clock),
Expand Down Expand Up @@ -244,7 +241,7 @@ impl Runtime {
tracing::debug!(target = "duroxide::runtime::execution", instance=%instance, parent_instance=%parent_instance, parent_id=%parent_id, "Enqueue SubOrchCompleted to parent");
orchestrator_items.push(WorkItem::SubOrchCompleted {
parent_instance: parent_instance.clone(),
parent_execution_id: self.get_execution_id_for_instance(&parent_instance, None).await,
parent_execution_id: self.parent_execution_id_for_routing(&parent_instance).await,
parent_id,
result: output.clone(),
});
Expand Down Expand Up @@ -277,7 +274,7 @@ impl Runtime {
tracing::debug!(target = "duroxide::runtime::execution", instance=%instance, parent_instance=%parent_instance, parent_id=%parent_id, "Enqueue SubOrchFailed to parent");
orchestrator_items.push(WorkItem::SubOrchFailed {
parent_instance: parent_instance.clone(),
parent_execution_id: self.get_execution_id_for_instance(&parent_instance, None).await,
parent_execution_id: self.parent_execution_id_for_routing(&parent_instance).await,
parent_id,
details: details.clone(),
});
Expand Down Expand Up @@ -364,7 +361,7 @@ impl Runtime {
if let Some((parent_instance, parent_id)) = parent_link {
orchestrator_items.push(WorkItem::SubOrchFailed {
parent_instance: parent_instance.clone(),
parent_execution_id: self.get_execution_id_for_instance(&parent_instance, None).await,
parent_execution_id: self.parent_execution_id_for_routing(&parent_instance).await,
parent_id,
details: details.clone(),
});
Expand Down
Loading
Loading