Skip to content

Commit 18bff4d

Browse files
Performance plan.
1 parent 725193e commit 18bff4d

1 file changed

Lines changed: 209 additions & 0 deletions

File tree

plan.md

Lines changed: 209 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,209 @@
1+
# Performance Plan: Lessons from `carbon_fiber`
2+
3+
Working notes after reviewing [`yaroslav/carbon_fiber`](https://github.com/yaroslav/carbon_fiber) (Zig + libxev fiber scheduler) and reproducing its benchmarks on `hana` (Intel Xeon E3-1240v6, 4c/8t, kernel 6.19, Ruby 4.0.2 +YJIT, io_uring, 3 measured runs).
4+
5+
## Reproduced benchmarks (hana)
6+
7+
`carbon_fiber-0.1.3` (prebuilt gem) vs `async-2.38.1` + `io-event-1.16.0`, run from carbon_fiber's own `benchmarks/` harness:
8+
9+
| Workload | Carbon Fiber | Async/io-event | Δ (hana) | Δ (their README) |
10+
|---|---|---|---|---|
11+
| `http_server` | 33.43k req/s | 29.61k req/s | **+12.9%** | +60% |
12+
| `tcp_echo` | 35.14k ops/s | 30.16k ops/s | **+16.5%** | +64% |
13+
| `http_client_api` | 10.43k req/s | 8.39k req/s | **+24.3%** | +17% |
14+
| `http_client_download` | 4.47k dl/s | 4.22k dl/s | +5.9% | +19% |
15+
| `connection_pool` | 4.97k co/s | 4.58k co/s | +8.5% | +8% |
16+
| `fan_out_gather` | 2.06k cyc/s | 1.93k cyc/s | +6.7% | +5% |
17+
| `db_query_mix` | 1.66k qry/s | 1.60k qry/s | +4.1% | +2% |
18+
| `cascading_timeout` | 4.64k ops/s | 4.35k ops/s | +6.7% | +6% |
19+
20+
Carbon Fiber wins every workload, by +4% to +24%. Their headline +60/+64% numbers don't reproduce on this hardware (probably because Zen 4 on c7a.2xlarge is more scheduler-bound at higher absolute throughput); smaller +5–8% claims reproduce precisely.
21+
22+
## Optimisation catalogue (ranked by expected impact)
23+
24+
### Tier 1 — biggest wins, but need Ruby-side cooperation
25+
26+
#### 1.1 New scheduler hooks: `socket_recvmsg` / `socket_sendmsg`
27+
28+
**The plan**: add explicit socket-only hooks to the Ruby fiber scheduler protocol. With a guaranteed-socket contract on the C side we can implement the userspace fast path cleanly:
29+
30+
```c
31+
// In io-event, when the new hooks land:
32+
ssize_t IO_Event_Selector_URing_socket_recvmsg(...) {
33+
ssize_t n = recvmsg(fd, &msg, MSG_DONTWAIT);
34+
if (n >= 0) return n;
35+
if (errno != EAGAIN && errno != EWOULDBLOCK) return -errno;
36+
// Slow path: submit io_uring SQE for recvmsg and yield.
37+
}
38+
```
39+
40+
No `ENOTSOCK` worry, no `fcntl` dance, no per-fd type-caching. The contract is "this fd is a socket" by construction.
41+
42+
This is the cleanest path to the carbon_fiber-style `recvOnce`/`sendOnce` fast path. Carbon Fiber's biggest single win (~half of their `tcp_echo` / `http_server` lead) comes from skipping the io_uring round-trip when data is already buffered, and that's exactly what this enables.
43+
44+
**Effort**: depends on the Ruby side. Once the hooks exist, io-event change is ~200 lines for `recvmsg` + `sendmsg`.
45+
46+
**Expected gain**: 5–10% on `tcp_echo`, `http_server` once Ruby's socket library routes through the new hooks.
47+
48+
---
49+
50+
### Tier 2 — io-event-only changes, no Ruby-side dependency
51+
52+
#### 2.1 Fiber chaining in `IO_Event_Selector_loop_yield`
53+
54+
**Current behaviour**: every user fiber yield goes `user → loop_fiber`; then `select` calls `IO_Event_Selector_ready_flush` which does `loop_fiber → next_user_fiber`. Two `rb_fiber_transfer` calls per scheduling decision.
55+
56+
**Carbon Fiber's trick**: when a user fiber yields, peek the ready queue first. If non-empty, `rb_fiber_transfer` directly to the head of the queue, skipping the loop-fiber round-trip. One transfer per scheduling decision.
57+
58+
```
59+
// Pseudocode for IO_Event_Selector_loop_yield:
60+
if (backend->ready) {
61+
pop head;
62+
return rb_fiber_transfer(head->fiber, 0, NULL);
63+
}
64+
return rb_fiber_transfer(backend->loop, 0, NULL);
65+
```
66+
67+
**Risk**: correctness around `rb_fiber_raise` — carbon_fiber needs a `blocked_fibers` tracking set to avoid stranding fibers when a raise bypasses the normal park/resume cycle. We already have `IO_Event_Selector_resume`/`_raise` plumbing; need to audit whether chaining interacts safely with our existing in-flight io_uring completions.
68+
69+
**Effort**: ~50 lines + careful tests around interrupt-during-yield.
70+
71+
**Expected gain**: 5–10% on hot scheduling workloads (`tcp_echo`, `db_query_mix`).
72+
73+
#### 2.2 Opportunistic `io_uring_get_events` CQ peek before yielding to loop_fiber
74+
75+
**Carbon Fiber** (`doTransferToLoop` lines 1122–1135): on io_uring only, when ready queue is empty but `active_waiters > 0`, peek the completion queue once before transferring to loop_fiber. On loopback workloads (tcp_echo, http) the peer's response often lands by the time we reach the yield point — chaining saves a full loop_fiber round-trip.
76+
77+
**Compose with our recent work**: this pairs naturally with the `IORING_SETUP_TASKRUN_FLAG` work we just landed. We already gate `io_uring_get_events()` on `IORING_SQ_TASKRUN`; this is the same syscall used to absorb a free completion.
78+
79+
```
80+
// In loop_yield, before transferring to loop:
81+
if (selector->pending_io > 0) {
82+
io_uring_get_events(&selector->ring); // cheap when no work pending
83+
drain_cqes_into_ready_queue();
84+
if (backend->ready) goto chain_to_ready;
85+
}
86+
```
87+
88+
**Effort**: ~30 lines. No Ruby-side change.
89+
90+
**Expected gain**: 3–5% on workloads with localhost/short-RTT I/O.
91+
92+
#### 2.3 Push `fileno` extraction into C in `Async::Scheduler::Selector`
93+
94+
Carbon Fiber's `io_wait_object` / `io_read_object` / `io_write_object` accept the `IO` object directly and call `IO_Event_Selector_io_descriptor` (or equivalent) in C, skipping a `respond_to?(:fileno)` + `fileno` method-send pair per call.
95+
96+
io-event's native methods already take `VALUE io`. The waste is on the Async side:
97+
98+
```ruby
99+
# async/lib/async/scheduler/selector/uring.rb (or similar)
100+
def io_read(fiber, io, buffer, length, offset = 0)
101+
@selector.io_read(fiber, io.fileno, buffer, length, offset) # <- fileno in Ruby
102+
end
103+
```
104+
105+
**Effort**: trivial — drop the `.fileno` and let the C side extract it.
106+
107+
**Expected gain**: 1–3% on read/write-heavy workloads, mostly from skipping the Ruby method call frame.
108+
109+
#### 2.4 Cache "is_socket" bit in the `Descriptor` struct
110+
111+
If we go ahead with a `recv(MSG_DONTWAIT)` fast path inside `io_read` (i.e. without waiting for the new `socket_recvmsg` hook — see 3.1), the cleanest way to handle ENOTSOCK is to cache the verdict per fd. First call to a new fd costs one wasted recv on a non-socket; every subsequent call goes straight to the right syscall.
112+
113+
The URing selector already maintains a per-fd `Descriptor` struct (for close-watch state); add a `bool known_socket; bool known_not_socket;` pair (3 states: unknown, socket, not-socket).
114+
115+
**Effort**: ~30 lines.
116+
117+
**Expected gain**: makes 3.1 net-positive on mixed socket/pipe workloads. Standalone gain is zero.
118+
119+
---
120+
121+
### Tier 3 — fast path inside existing `io_read` / `io_write` hooks
122+
123+
These are the changes we'd make *without* waiting for new Ruby hooks. They overlap with Tier 1, so revisit after `socket_recvmsg`/`socket_sendmsg` land.
124+
125+
#### 3.1 Userspace `recv(MSG_DONTWAIT)` probe before io_uring submission
126+
127+
**Carbon Fiber's `ioRead`**: try `recv(MSG_DONTWAIT)` first; if `> 0` return immediately, if `ENOTSOCK` fall back to `read(2)`, if `EAGAIN` go to io_uring slow path.
128+
129+
For io-event, the safe order is:
130+
1. If descriptor known-not-socket: skip to `read(2)` probe.
131+
2. Else `recv(MSG_DONTWAIT)`:
132+
- `> 0`: return (optionally `drainRecv` more — see 3.3).
133+
- `0`: EOF, return 0.
134+
- `ENOTSOCK`: mark known-not-socket, fall through to `read(2)`.
135+
- `EAGAIN`/`EWOULDBLOCK`: fall through to io_uring slow path.
136+
- Other errno: return as error.
137+
138+
**Risk**: changes observable behaviour (different syscalls show up in strace/audit/seccomp). For mainstream callers it's invisible.
139+
140+
**Effort**: ~80 lines in `uring.c`'s `IO_Event_Selector_URing_io_read` (and symmetric for `io_write`).
141+
142+
**Expected gain**: 8–15% on `tcp_echo`, `http_server`. This is the single biggest win identified.
143+
144+
**Supersedes**: the existing `length == 0` (readpartial) `fcntl` dance — that path becomes one wasted recv instead of two fcntl calls on a non-socket, and zero overhead on a socket.
145+
146+
#### 3.2 Adaptive per-fd probe-skip
147+
148+
After 3.1 is in: track consecutive `EAGAIN` returns per-fd (indexed by `fd & 0xFF` in carbon_fiber, or stored in our `Descriptor` struct). After N (carbon_fiber uses 3) consecutive misses, skip the userspace probe and go straight to io_uring for that slot. Reset on a successful hit.
149+
150+
Without this, long-polling workloads (websocket idle, long-lived keep-alive) pay one wasted `recv` per `io_read`. With it, the probe self-tunes.
151+
152+
**Effort**: ~20 lines on top of 3.1.
153+
154+
**Expected gain**: prevents 3.1 from regressing on long-poll workloads. Standalone: zero.
155+
156+
#### 3.3 `drainRecv` after a successful first recv
157+
158+
After `recvOnce` returns N bytes, opportunistically try one more `recv(MSG_DONTWAIT)` to drain any remaining kernel buffer in the same `io_read` call. Skipped for `length == 0` (readpartial) since the caller only wants what's immediately available.
159+
160+
**Effort**: ~10 lines.
161+
162+
**Expected gain**: 1–2% on streaming workloads where messages span multiple TCP segments.
163+
164+
---
165+
166+
### Tier 4 — smaller, more situational wins
167+
168+
#### 4.1 Combined R|W `io_wait` → poll-writable-only
169+
170+
Carbon Fiber: when `io_wait(events = READABLE | WRITABLE)` is called (common for `connect()` completion), only register a writable poll — writability implies the connect finished, return both flags. Saves one SQE.
171+
172+
**Effort**: trivial. **Expected gain**: marginal except for connect-heavy workloads.
173+
174+
#### 4.2 Skip `MSG_PEEK` readiness probe in `io_wait`
175+
176+
When Ruby calls `io_wait`, the calling code has already seen EAGAIN, so a `MSG_PEEK` "is it ready?" probe is guaranteed to fail. io-event doesn't do this peek today, but worth noting we shouldn't add it.
177+
178+
#### 4.3 Near-deadline busy-spin for sub-3ms timers
179+
180+
Carbon Fiber's `SPIN_THRESHOLD = 3 ms`: for timers that close to firing, busy-spin instead of releasing GVL + re-arming a kernel timer. Specific to GVL release cost.
181+
182+
**Effort**: ~30 lines. **Expected gain**: ~2% on `cascading_timeout`. Probably not worth the complexity for that workload alone.
183+
184+
---
185+
186+
### Tier 5 — DO NOT COPY
187+
188+
#### 5.1 Net::HTTP keep-alive `wait_readable(0)` short-circuit
189+
190+
Carbon Fiber's `Scheduler#poll_io_now` returns `false` unconditionally for `BasicSocket.wait_readable(0)`. This is a deliberate Net::HTTP keep-alive optimisation — skips the MSG_PEEK staleness probe Net::HTTP performs before every request. Comment in the source acknowledges it costs one extra request's latency on a genuinely stale connection.
191+
192+
This accounts for some chunk of their `http_client_api` win. It's *functionally* correct on the happy path but it's a benchmark-tuned shortcut that would be wrong for io-event to copy at the framework level — too aggressive for a general-purpose library.
193+
194+
---
195+
196+
## Suggested order
197+
198+
1. **Land [`socket_recvmsg`/`socket_sendmsg`](https://bugs.ruby-lang.org/) hooks in Ruby + io-event** (1.1). Cleanest foundation, no compromises.
199+
2. **Fiber chaining** (2.1) — biggest scheduler-side win that's independent of the hook work. Needs careful tests.
200+
3. **CQ peek before loop yield** (2.2) — composes with (1.1) and (2.1), small lines-of-code, fits naturally into the post-PR-#166 code path.
201+
4. **`fileno` extraction in C** in Async's adapter (2.3) — trivial change, may as well land alongside.
202+
5. **Userspace fast path inside `io_read`/`io_write`** (3.1 + 3.2 + 2.4) — only if (1.1) is delayed. Otherwise prefer the new hook path.
203+
6. Re-benchmark on hana + c7a.2xlarge after each step.
204+
205+
## Open questions
206+
207+
- Status of `socket_recvmsg`/`socket_sendmsg` hooks upstream — do they need to be designed, or are they already proposed?
208+
- Does Ruby's `BasicSocket` need to opt in, or do we get coverage automatically through `Socket#recvmsg` / `#sendmsg`?
209+
- For (2.1), is there an existing test that exercises `Fiber#raise` against a fiber suspended via `io_wait`? If not, we should write one before changing `loop_yield`.

0 commit comments

Comments
 (0)