|
| 1 | +# Strand: Why Per-Strand Implementation |
| 2 | + |
| 3 | +A strand has two reasonable internal designs. The simpler one pools |
| 4 | +serialization state across strands; the correct one allocates state |
| 5 | +per-strand. Capy uses the per-strand design. This document explains why |
| 6 | +the simpler design is wrong and what the per-strand design costs. |
| 7 | + |
| 8 | +## The previous design |
| 9 | + |
| 10 | +Capy's original strand service held a fixed array of `strand_impl` |
| 11 | +objects, 211 slots, allocated inline in the service and never freed |
| 12 | +individually. When a user constructed a new strand, the service |
| 13 | +incremented a counter and returned a pointer to `impls_[counter % 211]`. |
| 14 | + |
| 15 | +```cpp |
| 16 | +strand_impl impls_[211]; |
| 17 | +std::size_t salt_; |
| 18 | + |
| 19 | +strand_impl* get_implementation() |
| 20 | +{ |
| 21 | + std::lock_guard lock(mutex_); |
| 22 | + return &impls_[salt_++ % 211]; |
| 23 | +} |
| 24 | +``` |
| 25 | + |
| 26 | +This is pure round-robin: the 1st strand gets slot 0, the 212th strand |
| 27 | +gets slot 0 again. Two strands that map to the same slot share the same |
| 28 | +`strand_impl` object. |
| 29 | + |
| 30 | +Each `strand_impl` holds: |
| 31 | + |
| 32 | +- a mutex (`mutex_`) |
| 33 | +- a pending operation queue (`pending_`) |
| 34 | +- a locked flag (`locked_`) |
| 35 | +- the executor identity used by whichever invoker is currently |
| 36 | + dispatching |
| 37 | + |
| 38 | +Two strands that share a slot share all of this. |
| 39 | + |
| 40 | +## What sharing actually shares |
| 41 | + |
| 42 | +Sharing a mutex is not inherently a problem. Two strands that hold the |
| 43 | +same mutex contend on push and pop operations, which are brief. They |
| 44 | +still proceed independently afterward. |
| 45 | + |
| 46 | +Sharing a queue and a locked flag is a different matter. Those are the |
| 47 | +state machine that determines which work runs, in what order, and |
| 48 | +through which executor. When two logically independent strands share |
| 49 | +this state, the following become possible: |
| 50 | + |
| 51 | +**Cross-strand blocking.** Strand A is mid-dispatch, so `locked_` is |
| 52 | +true. Strand B posts a new operation. B's post sees `locked_` already |
| 53 | +set and adds its work to the shared queue without posting a new |
| 54 | +invoker. B's work now waits behind A's entire dispatch cycle, even |
| 55 | +though A and B are supposed to be independent. |
| 56 | + |
| 57 | +**Wrong executor dispatch.** The invoker that won the unlocked-to-locked |
| 58 | +transition captures the executor of the strand that triggered it. Call |
| 59 | +this strand A. If strand B later enqueues work into the shared state, |
| 60 | +that work runs through A's executor, not B's. For strands that wrap |
| 61 | +the same underlying thread pool, this is invisible. For strands that |
| 62 | +wrap different executor layers (a metrics wrapper, a type-erased |
| 63 | +`any_executor`, a test shim), operations execute through the wrong |
| 64 | +executor, violating the invariants the user associated with B's |
| 65 | +executor. |
| 66 | + |
| 67 | +**False equality.** `operator==` on two distinct strands returns true |
| 68 | +when they map to the same slot, because equality is defined as pointer |
| 69 | +identity of the impl. |
| 70 | + |
| 71 | +## Why per-strand is the right choice |
| 72 | + |
| 73 | +The correctness argument is simple: strand isolation is part of the |
| 74 | +contract. The word "strand" implies a serialization domain that is |
| 75 | +independent of all other strands. A user who writes code against two |
| 76 | +strands is justified in expecting that progress on one does not depend |
| 77 | +on progress on the other, and that work posted to one runs through |
| 78 | +that strand's executor, not a neighbor's. |
| 79 | + |
| 80 | +The pooled design cannot provide this guarantee for more than 211 |
| 81 | +strands from the same context. |
| 82 | + |
| 83 | +One possible response is randomization: instead of pure round-robin, |
| 84 | +use a hash of the strand's address mixed with a salt counter. This |
| 85 | +spreads collisions across time so that (0, 211), (1, 212) are no longer |
| 86 | +the deterministic collision pairs. It does not remove collisions. With |
| 87 | +1000 strands from one context, roughly five collision pairs exist |
| 88 | +somewhere in the set. The bug surface is narrower and harder to trigger |
| 89 | +reproducibly, but the class of bug is identical. |
| 90 | + |
| 91 | +Randomization fixed a performance symptom (deterministic starvation) |
| 92 | +without fixing the correctness problem (shared state between independent |
| 93 | +strands). Treating these as the same fix is a category error. |
| 94 | + |
| 95 | +The per-strand design removes the impl pool entirely. Each strand |
| 96 | +allocates its own `strand_impl` via `make_shared`. Two strands never |
| 97 | +share a queue, a locked flag, or an invoker. Isolation is unconditional. |
| 98 | + |
| 99 | +The mutex pool stays. 193 mutexes for any number of strands is a real |
| 100 | +saving over allocating a mutex per strand. Unlike the impl pool, mutex |
| 101 | +sharing has no semantic consequence: the critical sections guarded by |
| 102 | +the mutex cover only push/pop and the locked flag check. Two strands |
| 103 | +that briefly contend on a shared mutex wait for each other's push/pop |
| 104 | +then proceed independently. No state crosses the boundary. |
| 105 | + |
| 106 | +The key insight is that isolation and contention are not the same |
| 107 | +problem. The impl pool conflated them. Removing the impl pool eliminates |
| 108 | +the isolation problem; keeping the mutex pool manages the contention |
| 109 | +cost without reintroducing the isolation problem. |
| 110 | + |
| 111 | +## What the per-strand design costs |
| 112 | + |
| 113 | +**One allocation per strand.** `make_shared<strand_impl>` allocates |
| 114 | +roughly 80-96 bytes on typical allocators with per-thread arenas |
| 115 | +(glibc, jemalloc, tcmalloc). For any strand that posts at least one |
| 116 | +operation, this is negligible against the work being dispatched. |
| 117 | + |
| 118 | +**One pointer of additional size per strand handle.** The strand object |
| 119 | +holds a `shared_ptr<strand_impl>` rather than a raw pointer. A |
| 120 | +`shared_ptr` is two pointers wide; a raw pointer is one. Strand objects |
| 121 | +grow by one pointer (typically 8 bytes). |
| 122 | + |
| 123 | +**Two atomic refcount operations per invoker creation/destruction.** The |
| 124 | +invoker coroutine frame holds a copy of the `shared_ptr`, so the |
| 125 | +reference count increments when the invoker starts and decrements when |
| 126 | +it finishes. These are not on the hot post path; they happen at the |
| 127 | +unlocked-to-locked transition (once per dispatch batch), not on every |
| 128 | +enqueue. |
| 129 | + |
| 130 | +The mutex pool bounds memory growth at 193 mutexes regardless of how |
| 131 | +many strands exist. A program that creates 10,000 strands does not get |
| 132 | +10,000 mutexes; it gets at most 193. |
| 133 | + |
| 134 | +## Tradeoffs we did not take |
| 135 | + |
| 136 | +**Per-strand mutex.** Allocating a mutex per strand would eliminate the |
| 137 | +mutex pool entirely and remove all cross-strand contention. The cost is |
| 138 | +roughly 40 extra bytes per strand. The benefit is marginal: the |
| 139 | +critical sections that use the pool mutex are brief, and contention |
| 140 | +between unrelated strands is unlikely in practice. This option remains |
| 141 | +open if benchmarks show real contention under specific workloads. |
| 142 | + |
| 143 | +The chosen design (per-strand impl, shared mutex pool) matches the |
| 144 | +strategy used by current executor-aware strand implementations in the |
| 145 | +C++ library space, which provides confidence that the tradeoffs are |
| 146 | +well understood. |
0 commit comments