|
| 1 | +# Adding `transfer_to` to the Executor Concept |
| 2 | + |
| 3 | +## The Problem |
| 4 | + |
| 5 | +When a coroutine crosses an executor boundary via `run`, the strand on either side of the boundary can get trapped. The bug appears in two directions: |
| 6 | + |
| 7 | +**Case 1: caller on strand, target is a different executor.** |
| 8 | + |
| 9 | +1. A coroutine is running on a strand |
| 10 | +2. It does `co_await run(ex)(f())` where `ex` is not the strand |
| 11 | +3. During `f()`, the strand still thinks it is running - no other queued coroutine can make progress |
| 12 | + |
| 13 | +**Case 2: caller on io_context, target is a strand.** |
| 14 | + |
| 15 | +1. A coroutine is running on an io_context |
| 16 | +2. It does `co_await run(strand)(f())` |
| 17 | +3. `f()` runs inside the strand's invoker. When `f()` completes, the trampoline dispatches the parent back to the io_context. The io_context inlines (same thread), so the parent runs inside the strand's `safe_resume` call. The strand is held until the parent suspends. |
| 18 | + |
| 19 | +Both cases have the same root cause: `dispatch` can inline across an executor boundary, and the symmetric transfer chain runs the entire sequence without returning to the strand's invoker loop. |
| 20 | + |
| 21 | +## Root Cause |
| 22 | + |
| 23 | +The call site is `safe_resume(h)` at `strand_queue.hpp:269`, inside `dispatch_batch`. Each queued item is wrapped in a `strand_op` coroutine whose body calls `safe_resume(target)` at line 132. `safe_resume` calls `h.resume()`, which runs the coroutine until something in the chain returns `void` or `noop_coroutine()` from `await_suspend`. Symmetric transfer does not unwind - it tail-jumps through coroutine handles without returning to the caller. |
| 24 | + |
| 25 | +**Case 1 trace** (caller on strand, target is io_context): |
| 26 | + |
| 27 | +``` |
| 28 | +dispatch_batch: safe_resume(wrapper_h) <- line 269 |
| 29 | + -> wrapper: safe_resume(caller_coro) <- line 132 |
| 30 | + -> caller does co_await run(io_ctx)(f()) |
| 31 | + -> await_suspend: io_ctx.dispatch(task_cont) |
| 32 | + -> io_ctx INLINES, returns task_cont.h |
| 33 | + -> symmetric-transfer to f() |
| 34 | + -> f() runs to completion |
| 35 | + -> f()'s final_suspend symmetric-transfers to trampoline |
| 36 | + -> trampoline: caller_ex.dispatch(parent) |
| 37 | + -> caller_ex is strand, running_in_this_thread() is true, INLINES |
| 38 | + -> symmetric-transfer to parent |
| 39 | + -> parent runs <- still inside safe_resume |
| 40 | + -> parent suspends |
| 41 | + <- safe_resume returns |
| 42 | +``` |
| 43 | + |
| 44 | +**Case 2 trace** (caller on io_context, target is strand): |
| 45 | + |
| 46 | +``` |
| 47 | +dispatch_batch: safe_resume(wrapper_h) <- line 269 |
| 48 | + -> wrapper: safe_resume(inner_task) <- line 132 |
| 49 | + -> f() runs on the strand, completes |
| 50 | + -> f()'s final_suspend symmetric-transfers to trampoline |
| 51 | + -> trampoline: caller_ex.dispatch(parent) |
| 52 | + -> caller_ex is io_context, INLINES, returns parent.h |
| 53 | + -> symmetric-transfer to parent |
| 54 | + -> parent runs <- still inside safe_resume |
| 55 | + -> parent suspends |
| 56 | + <- safe_resume returns |
| 57 | +``` |
| 58 | + |
| 59 | +In both cases, the strand's invoker loop does not get control back until the parent suspends. The strand is held for the duration of the inner task, the trampoline, and the parent's resumed execution. |
| 60 | + |
| 61 | +## The Fix: `transfer_to` |
| 62 | + |
| 63 | +The problem is that `dispatch` does not give the source executor a chance to clean up before handing off to the target. We need a third verb on the Executor concept. |
| 64 | + |
| 65 | +The three executor verbs become: |
| 66 | + |
| 67 | +- `std::coroutine_handle<> dispatch(continuation& c)` - Run `c` on this executor. If already on the right thread, return `c.h` for symmetric transfer. Otherwise queue `c` and return `noop_coroutine()`. |
| 68 | + |
| 69 | +- `void post(continuation& c)` - Queue `c` on this executor. Never run inline. |
| 70 | + |
| 71 | +- `std::coroutine_handle<> transfer_to(executor_ref target, continuation& c)` - This executor is releasing `c`. Do whatever is needed to let go, then get `c` running on `target`. Non-serializing executors forward to `target.dispatch(c)`. Serializing executors (strands) post to `target` so the current dispatch batch can finish and the serialization frame can close normally. |
| 72 | + |
| 73 | +`transfer_to` is called on the source executor - the one being left - because the source is the one that knows whether it needs to break the symmetric transfer chain. The target is always a valid (non-null) `executor_ref`. |
| 74 | + |
| 75 | +## Concept Change |
| 76 | + |
| 77 | +`executor.hpp` requires clause gains: |
| 78 | + |
| 79 | +```cpp |
| 80 | +{ ce.transfer_to(executor_ref{}, c) } -> std::same_as<std::coroutine_handle<>>; |
| 81 | +``` |
| 82 | +
|
| 83 | +## Type Erasure |
| 84 | +
|
| 85 | +The vtable in `executor_ref.hpp` gains a slot: |
| 86 | +
|
| 87 | +```cpp |
| 88 | +std::coroutine_handle<> (*transfer_to)(void const*, executor_ref, continuation&); |
| 89 | +``` |
| 90 | + |
| 91 | +`executor_ref` gains a forwarding method: |
| 92 | + |
| 93 | +```cpp |
| 94 | +std::coroutine_handle<> transfer_to(executor_ref target, continuation& c) const |
| 95 | +{ |
| 96 | + return vt_->transfer_to(ex_, target, c); |
| 97 | +} |
| 98 | +``` |
| 99 | +
|
| 100 | +The `vtable_for<Ex>` template gains a corresponding lambda: |
| 101 | +
|
| 102 | +```cpp |
| 103 | +[](void const* p, executor_ref target, continuation& c) -> std::coroutine_handle<> { |
| 104 | + return static_cast<Ex const*>(p)->transfer_to(target, c); |
| 105 | +}, |
| 106 | +``` |
| 107 | + |
| 108 | +## Per-Executor Implementation |
| 109 | + |
| 110 | +### thread_pool::executor_type |
| 111 | + |
| 112 | +The thread pool has no serialization state. `transfer_to` just forwards to the target: |
| 113 | + |
| 114 | +```cpp |
| 115 | +std::coroutine_handle<> |
| 116 | +transfer_to(executor_ref target, continuation& c) const |
| 117 | +{ |
| 118 | + return target.dispatch(c); |
| 119 | +} |
| 120 | +``` |
| 121 | +
|
| 122 | +The thread pool was never affected by the strand escape bug. Its `dispatch` already calls `post` and returns `noop_coroutine()`, so the symmetric transfer chain always breaks at the thread pool boundary. |
| 123 | +
|
| 124 | +### strand |
| 125 | +
|
| 126 | +The strand is why `transfer_to` exists. Its implementation posts to the target instead of dispatching: |
| 127 | +
|
| 128 | +```cpp |
| 129 | +// strand.hpp |
| 130 | +std::coroutine_handle<> |
| 131 | +transfer_to(executor_ref target, continuation& c) const |
| 132 | +{ |
| 133 | + return detail::strand_service::transfer_to( |
| 134 | + *impl_, executor_ref(ex_), target, c); |
| 135 | +} |
| 136 | +``` |
| 137 | + |
| 138 | +```cpp |
| 139 | +// strand_service.cpp |
| 140 | +std::coroutine_handle<> |
| 141 | +strand_service::transfer_to( |
| 142 | + strand_impl& impl, executor_ref inner_ex, |
| 143 | + executor_ref target, continuation& c) |
| 144 | +{ |
| 145 | + target.post(c); |
| 146 | + return std::noop_coroutine(); |
| 147 | +} |
| 148 | +``` |
| 149 | +
|
| 150 | +This is deliberately minimal. The strand does not touch `dispatch_thread_` or `locked_`. It does not need to. Here is why: |
| 151 | +
|
| 152 | +When `transfer_to` is called, we are inside the invoker's `dispatch_pending` call, deep in a `.resume()` on a batch item. The invoker loop looks like this: |
| 153 | +
|
| 154 | +```cpp |
| 155 | +for(;;) |
| 156 | +{ |
| 157 | + set_dispatch_thread(*p); |
| 158 | + dispatch_pending(*p); // we are here |
| 159 | + if(try_unlock(*p)) |
| 160 | + { |
| 161 | + clear_dispatch_thread(*p); |
| 162 | + co_return; |
| 163 | + } |
| 164 | +} |
| 165 | +``` |
| 166 | + |
| 167 | +`target.post(c)` queues the inner task on the target executor. Returning `noop_coroutine()` causes the coroutine to suspend, so `.resume()` returns and `dispatch_batch` moves to the next item. The batch finishes. The invoker loop reaches `try_unlock`, which either unlocks the strand (if the queue is empty) or loops to drain more work. Either way, the strand releases through its normal path. |
| 168 | + |
| 169 | +The inner task `f()` runs concurrently on the target executor. When it finishes, the trampoline dispatches the parent back to the strand through the normal enqueue path. |
| 170 | + |
| 171 | +An earlier version of this document proposed calling `clear_dispatch_thread` and `try_unlock` from inside `transfer_to`. That is wrong. Calling `try_unlock` mid-batch can set `locked_ = false` while the invoker is still processing items. If new work arrives and triggers a second invoker, two invokers run concurrently and the strand's serialization invariant breaks. The invoker loop is the only safe place to manipulate `locked_` and `dispatch_thread_`. |
| 172 | + |
| 173 | +### Asio-style io_context bridges |
| 174 | + |
| 175 | +User-written executor adapters (like `asio_executor` in the examples) have no serialization state. They get the same trivial implementation as thread_pool: |
| 176 | + |
| 177 | +```cpp |
| 178 | +std::coroutine_handle<> |
| 179 | +transfer_to(executor_ref target, continuation& c) const |
| 180 | +{ |
| 181 | + return target.dispatch(c); |
| 182 | +} |
| 183 | +``` |
| 184 | +
|
| 185 | +Any user-defined executor that implements its own serialization (an actor, an ordered queue, a custom strand-like primitive) should follow the strand pattern: post to the target instead of dispatching, so the current serialization frame can close normally. |
| 186 | +
|
| 187 | +## Trampoline Change |
| 188 | +
|
| 189 | +Both the forward trip and the return trip need `transfer_to`. The trampoline must store both executors: |
| 190 | +
|
| 191 | +```cpp |
| 192 | +struct promise_type |
| 193 | +{ |
| 194 | + executor_ref caller_ex_; |
| 195 | + executor_ref target_ex_; // NEW |
| 196 | + continuation parent_; |
| 197 | +}; |
| 198 | +``` |
| 199 | + |
| 200 | +**Forward trip** (`run_awaitable_ex::await_suspend`): |
| 201 | + |
| 202 | +Currently: |
| 203 | + |
| 204 | +```cpp |
| 205 | +task_cont_.h = h; |
| 206 | +return ex_.dispatch(task_cont_); |
| 207 | +``` |
| 208 | + |
| 209 | +Becomes: |
| 210 | + |
| 211 | +```cpp |
| 212 | +task_cont_.h = h; |
| 213 | +return caller_env->executor.transfer_to(ex_, task_cont_); |
| 214 | +``` |
| 215 | + |
| 216 | +This fixes Case 1. If the caller is on a strand, the strand posts to the target and returns `noop_coroutine()`. The coroutine suspends, the strand's batch finishes, and the invoker loop releases the strand normally. |
| 217 | + |
| 218 | +**Return trip** (trampoline `final_suspend`): |
| 219 | + |
| 220 | +Currently: |
| 221 | + |
| 222 | +```cpp |
| 223 | +return detail::symmetric_transfer( |
| 224 | + p_->caller_ex_.dispatch(p_->parent_)); |
| 225 | +``` |
| 226 | + |
| 227 | +Becomes: |
| 228 | + |
| 229 | +```cpp |
| 230 | +return detail::symmetric_transfer( |
| 231 | + p_->target_ex_.transfer_to(p_->caller_ex_, p_->parent_)); |
| 232 | +``` |
| 233 | + |
| 234 | +This fixes Case 2. If the target is a strand, the strand posts the parent to the caller's executor and returns `noop_coroutine()`. The trampoline suspends, control returns to the strand's `safe_resume` call, and the invoker loop proceeds to drain and unlock. |
| 235 | + |
| 236 | +If neither the caller nor the target is a strand, both `transfer_to` calls forward to `target.dispatch(c)`, preserving the inline fast path. |
| 237 | + |
| 238 | +## Alternative: Guard Object Instead of `transfer_to` |
| 239 | + |
| 240 | +The `transfer_to` design requires every launch function author to call `transfer_to` on both trips. If someone writes a custom `run`-like function and forgets the return trip, they reintroduce the bug. An RAII guard in the coroutine frame could make the fix automatic. Three approaches: |
| 241 | + |
| 242 | +### Option A: Flag in io_env |
| 243 | + |
| 244 | +The strand sets a flag in the `io_env` when it dispatches a coroutine. The trampoline checks the flag on both trips and posts instead of dispatching when it is set. |
| 245 | + |
| 246 | +`io_env` gains a bool: |
| 247 | + |
| 248 | +```cpp |
| 249 | +struct io_env |
| 250 | +{ |
| 251 | + executor_ref executor; |
| 252 | + std::stop_token stop_token; |
| 253 | + std::pmr::memory_resource* frame_allocator = nullptr; |
| 254 | + bool serialized = false; // NEW |
| 255 | +}; |
| 256 | +``` |
| 257 | + |
| 258 | +The strand's `dispatch` sets `serialized = true` in the env before returning the handle. The trampoline reads it: |
| 259 | + |
| 260 | +```cpp |
| 261 | +// forward trip (await_suspend) |
| 262 | +if(caller_env->serialized) |
| 263 | +{ |
| 264 | + ex_.post(task_cont_); |
| 265 | + return std::noop_coroutine(); |
| 266 | +} |
| 267 | +return ex_.dispatch(task_cont_); |
| 268 | + |
| 269 | +// return trip (final_suspend) |
| 270 | +if(/* target env was serialized */) |
| 271 | +{ |
| 272 | + caller_ex_.post(parent_); |
| 273 | + return std::noop_coroutine(); |
| 274 | +} |
| 275 | +return caller_ex_.dispatch(parent_); |
| 276 | +``` |
| 277 | + |
| 278 | +**For:** Zero-cost for non-strand cases (one branch on a bool). No changes to the Executor concept. The strand marks it, the trampoline reads it, the user never touches it. |
| 279 | + |
| 280 | +**Against:** The env is `const` from the awaitable's perspective - the strand would need to set the flag before the env reaches the awaitable, which means the flag lives in the env owned by the `run` awaitable, not the caller's env. On the return trip, the trampoline needs to know whether the *target's* env was serialized, but the trampoline only stores the caller's executor, not the target's env. Plumbing the target's serialization flag to the trampoline adds complexity similar to storing `target_ex_`. The env also flows through the entire coroutine chain, so the flag would affect nested `run` calls - a coroutine on a strand that calls `run(strand2)(f())` inside `run(pool)(g())` would see `serialized = true` from the outer strand even though the inner context is a pool. |
| 281 | + |
| 282 | +### Option B: TLS set by the strand invoker |
| 283 | + |
| 284 | +The strand's invoker loop sets a TLS variable before calling `dispatch_pending` and clears it after. Any executor's `dispatch` checks this TLS variable and posts instead of inlining when it is set. |
| 285 | + |
| 286 | +```cpp |
| 287 | +// strand invoker loop |
| 288 | +inline thread_local strand_impl* current_strand = nullptr; |
| 289 | + |
| 290 | +static strand_invoker make_invoker(strand_impl& impl) |
| 291 | +{ |
| 292 | + strand_impl* p = &impl; |
| 293 | + for(;;) |
| 294 | + { |
| 295 | + set_dispatch_thread(*p); |
| 296 | + current_strand = p; |
| 297 | + dispatch_pending(*p); |
| 298 | + current_strand = nullptr; |
| 299 | + if(try_unlock(*p)) |
| 300 | + { |
| 301 | + clear_dispatch_thread(*p); |
| 302 | + co_return; |
| 303 | + } |
| 304 | + } |
| 305 | +} |
| 306 | +``` |
| 307 | +
|
| 308 | +Every executor's `dispatch` checks: |
| 309 | +
|
| 310 | +```cpp |
| 311 | +std::coroutine_handle<> dispatch(continuation& c) const |
| 312 | +{ |
| 313 | + if(detail::current_strand) |
| 314 | + { |
| 315 | + post(c); |
| 316 | + return std::noop_coroutine(); |
| 317 | + } |
| 318 | + // normal dispatch logic |
| 319 | +} |
| 320 | +``` |
| 321 | + |
| 322 | +**For:** Fully automatic. No changes to `run`, trampolines, or any launch function. Every executor boundary crossing inside a strand batch posts instead of inlining. Covers both trips, all launch functions, and user-defined launch functions that don't know about `transfer_to`. |
| 323 | + |
| 324 | +**Against:** Every executor's `dispatch` pays a TLS read on every call, even when no strand is involved. Invasive - every concrete executor and `executor_ref::dispatch` must add the check. Nested strands need save/restore (strand B's invoker would clear the TLS set by strand A's invoker). The TLS approach also prevents legitimate inlining within a strand - if a coroutine on a strand dispatches more work to the same strand, it should inline (that is what `running_in_this_thread()` enables), but the TLS check would force it to post. Distinguishing "dispatching to a foreign executor" from "dispatching to the same strand" requires comparing the current strand pointer, adding more logic to every dispatch call. |
| 325 | + |
| 326 | +### Option C: Continuation wrapper |
| 327 | + |
| 328 | +The strand wraps every dispatched continuation in a guard that intercepts the symmetric transfer chain. When the coroutine suspends and `await_suspend` returns a handle that would leave the strand, the guard detects this and posts instead. |
| 329 | + |
| 330 | +**For:** In theory, fully automatic and encapsulated in the strand. |
| 331 | + |
| 332 | +**Against:** Not implementable with the current coroutine model. Symmetric transfer happens inside the C++ runtime - `await_suspend` returns a `std::coroutine_handle<>` and the runtime tail-calls it. There is no interception point between the return from `await_suspend` and the resumption of the target handle. The strand cannot inspect or redirect the handle after `await_suspend` returns it. The `strand_op` wrapper already wraps the target in a coroutine, but `safe_resume(target)` follows the entire symmetric transfer chain before returning - the wrapper only gets control back after the chain ends, which is too late. |
| 333 | + |
| 334 | +### Recommendation |
| 335 | + |
| 336 | +Option B (TLS) is the most automatic but too invasive and has the wrong default for same-strand dispatch. Option C is not implementable. Option A (env flag) has the right shape but the plumbing to get the flag to both trips is roughly as complex as storing `target_ex_` in the trampoline. |
| 337 | + |
| 338 | +`transfer_to` on the Executor concept remains the cleanest design. The cost is that launch function authors must call it on both trips. Since launch functions are library machinery (not user code), and capy ships the primary one (`run`), this is an acceptable constraint. The alternative is to provide `transfer_to` on the concept AND use it automatically inside `run`'s trampoline, so users who write `co_await run(ex)(f())` never think about it. Custom launch functions that want the same correctness call `transfer_to`; those that don't care about strands can use plain `dispatch`. |
| 339 | + |
| 340 | +## Test Impact |
| 341 | + |
| 342 | +`testRunExStrandFirstInstruction` verifies that `running_in_this_thread()` is true inside an inner task passed to `run(strand)`. With `transfer_to` on the forward trip, the caller's executor (pool) calls `transfer_to` which forwards to `strand.dispatch(c)`. The strand is not running on this thread, so it enqueues and posts an invoker. The inner task runs inside the invoker where `running_in_this_thread()` is true. The test should still pass. |
| 343 | + |
| 344 | +Two new tests are needed: |
| 345 | + |
| 346 | +- **Case 1 regression test:** Two coroutines on the same strand. One does `co_await run(pool_ex)(slow_task())`. The second coroutine should make progress while `slow_task` runs on the pool. |
| 347 | + |
| 348 | +- **Case 2 regression test:** A coroutine on an io_context does `co_await run(strand)(f())`. After `f()` completes and the parent resumes, verify the strand is free (not held by the parent's execution). |
0 commit comments