Skip to content

Commit d1e5b5c

Browse files
committed
Strand: per-strand implementation with shared mutex pool
Replace the shared-impl strand pool with a per-strand implementation backed by a shared pool of mutexes. Removes the bucket-collision class where independent strands sharing a slot serialized against each other and compared equal. - strand_impl is per-strand, allocated by the service - Service holds a 193-mutex pool, hashed by impl address mixed with a per-service salt; collisions share a mutex, never pending work - Coroutine invoker keeps the impl alive via its frame parameter; invoker frames recycle through a single-slot per-service cache, closed under a kCacheClosed sentinel during shutdown - Service tracks live impls via intrusive_list for shutdown traversal - Service back-pointer in strand_impl is atomic so the destructor's load pairs with shutdown's store Adds doc/strand-spec.md (design contract) and doc/strand-rationale.md (why the redesign was needed). Tests: equality non-collision regression, cross-strand independence, transient strand lifetime via weak_ptr expiry, many-strands stress, deterministic mutex-pool collision isolation.
1 parent 2b3fe69 commit d1e5b5c

8 files changed

Lines changed: 984 additions & 179 deletions

File tree

doc/strand-rationale.md

Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
# Strand: Why Per-Strand Implementation
2+
3+
A strand has two reasonable internal designs. The simpler one pools
4+
serialization state across strands; the correct one allocates state
5+
per-strand. Capy uses the per-strand design. This document explains why
6+
the simpler design is wrong and what the per-strand design costs.
7+
8+
## The previous design
9+
10+
Capy's original strand service held a fixed array of `strand_impl`
11+
objects, 211 slots, allocated inline in the service and never freed
12+
individually. When a user constructed a new strand, the service
13+
incremented a counter and returned a pointer to `impls_[counter % 211]`.
14+
15+
```cpp
16+
strand_impl impls_[211];
17+
std::size_t salt_;
18+
19+
strand_impl* get_implementation()
20+
{
21+
std::lock_guard lock(mutex_);
22+
return &impls_[salt_++ % 211];
23+
}
24+
```
25+
26+
This is pure round-robin: the 1st strand gets slot 0, the 212th strand
27+
gets slot 0 again. Two strands that map to the same slot share the same
28+
`strand_impl` object.
29+
30+
Each `strand_impl` holds:
31+
32+
- a mutex (`mutex_`)
33+
- a pending operation queue (`pending_`)
34+
- a locked flag (`locked_`)
35+
- the executor identity used by whichever invoker is currently
36+
dispatching
37+
38+
Two strands that share a slot share all of this.
39+
40+
## What sharing actually shares
41+
42+
Sharing a mutex is not inherently a problem. Two strands that hold the
43+
same mutex contend on push and pop operations, which are brief. They
44+
still proceed independently afterward.
45+
46+
Sharing a queue and a locked flag is a different matter. Those are the
47+
state machine that determines which work runs, in what order, and
48+
through which executor. When two logically independent strands share
49+
this state, the following become possible:
50+
51+
**Cross-strand blocking.** Strand A is mid-dispatch, so `locked_` is
52+
true. Strand B posts a new operation. B's post sees `locked_` already
53+
set and adds its work to the shared queue without posting a new
54+
invoker. B's work now waits behind A's entire dispatch cycle, even
55+
though A and B are supposed to be independent.
56+
57+
**Wrong executor dispatch.** The invoker that won the unlocked-to-locked
58+
transition captures the executor of the strand that triggered it. Call
59+
this strand A. If strand B later enqueues work into the shared state,
60+
that work runs through A's executor, not B's. For strands that wrap
61+
the same underlying thread pool, this is invisible. For strands that
62+
wrap different executor layers (a metrics wrapper, a type-erased
63+
`any_executor`, a test shim), operations execute through the wrong
64+
executor, violating the invariants the user associated with B's
65+
executor.
66+
67+
**False equality.** `operator==` on two distinct strands returns true
68+
when they map to the same slot, because equality is defined as pointer
69+
identity of the impl.
70+
71+
## Why per-strand is the right choice
72+
73+
The correctness argument is simple: strand isolation is part of the
74+
contract. The word "strand" implies a serialization domain that is
75+
independent of all other strands. A user who writes code against two
76+
strands is justified in expecting that progress on one does not depend
77+
on progress on the other, and that work posted to one runs through
78+
that strand's executor, not a neighbor's.
79+
80+
The pooled design cannot provide this guarantee for more than 211
81+
strands from the same context.
82+
83+
One possible response is randomization: instead of pure round-robin,
84+
use a hash of the strand's address mixed with a salt counter. This
85+
spreads collisions across time so that (0, 211), (1, 212) are no longer
86+
the deterministic collision pairs. It does not remove collisions. With
87+
1000 strands from one context, roughly five collision pairs exist
88+
somewhere in the set. The bug surface is narrower and harder to trigger
89+
reproducibly, but the class of bug is identical.
90+
91+
Randomization fixed a performance symptom (deterministic starvation)
92+
without fixing the correctness problem (shared state between independent
93+
strands). Treating these as the same fix is a category error.
94+
95+
The per-strand design removes the impl pool entirely. Each strand
96+
allocates its own `strand_impl` via `make_shared`. Two strands never
97+
share a queue, a locked flag, or an invoker. Isolation is unconditional.
98+
99+
The mutex pool stays. 193 mutexes for any number of strands is a real
100+
saving over allocating a mutex per strand. Unlike the impl pool, mutex
101+
sharing has no semantic consequence: the critical sections guarded by
102+
the mutex cover only push/pop and the locked flag check. Two strands
103+
that briefly contend on a shared mutex wait for each other's push/pop
104+
then proceed independently. No state crosses the boundary.
105+
106+
The key insight is that isolation and contention are not the same
107+
problem. The impl pool conflated them. Removing the impl pool eliminates
108+
the isolation problem; keeping the mutex pool manages the contention
109+
cost without reintroducing the isolation problem.
110+
111+
## What the per-strand design costs
112+
113+
**One allocation per strand.** `make_shared<strand_impl>` allocates
114+
roughly 80-96 bytes on typical allocators with per-thread arenas
115+
(glibc, jemalloc, tcmalloc). For any strand that posts at least one
116+
operation, this is negligible against the work being dispatched.
117+
118+
**One pointer of additional size per strand handle.** The strand object
119+
holds a `shared_ptr<strand_impl>` rather than a raw pointer. A
120+
`shared_ptr` is two pointers wide; a raw pointer is one. Strand objects
121+
grow by one pointer (typically 8 bytes).
122+
123+
**Two atomic refcount operations per invoker creation/destruction.** The
124+
invoker coroutine frame holds a copy of the `shared_ptr`, so the
125+
reference count increments when the invoker starts and decrements when
126+
it finishes. These are not on the hot post path; they happen at the
127+
unlocked-to-locked transition (once per dispatch batch), not on every
128+
enqueue.
129+
130+
The mutex pool bounds memory growth at 193 mutexes regardless of how
131+
many strands exist. A program that creates 10,000 strands does not get
132+
10,000 mutexes; it gets at most 193.
133+
134+
## Tradeoffs we did not take
135+
136+
**Per-strand mutex.** Allocating a mutex per strand would eliminate the
137+
mutex pool entirely and remove all cross-strand contention. The cost is
138+
roughly 40 extra bytes per strand. The benefit is marginal: the
139+
critical sections that use the pool mutex are brief, and contention
140+
between unrelated strands is unlikely in practice. This option remains
141+
open if benchmarks show real contention under specific workloads.
142+
143+
The chosen design (per-strand impl, shared mutex pool) matches the
144+
strategy used by current executor-aware strand implementations in the
145+
C++ library space, which provides confidence that the tradeoffs are
146+
well understood.

0 commit comments

Comments
 (0)