|
| 1 | +# Performance Plan: Lessons from `carbon_fiber` |
| 2 | + |
| 3 | +Working notes after reviewing [`yaroslav/carbon_fiber`](https://github.com/yaroslav/carbon_fiber) (Zig + libxev fiber scheduler) and reproducing its benchmarks on `hana` (Intel Xeon E3-1240v6, 4c/8t, kernel 6.19, Ruby 4.0.2 +YJIT, io_uring, 3 measured runs). |
| 4 | + |
| 5 | +## Reproduced benchmarks (hana) |
| 6 | + |
| 7 | +`carbon_fiber-0.1.3` (prebuilt gem) vs `async-2.38.1` + `io-event-1.16.0`, run from carbon_fiber's own `benchmarks/` harness: |
| 8 | + |
| 9 | +| Workload | Carbon Fiber | Async/io-event | Δ (hana) | Δ (their README) | |
| 10 | +|---|---|---|---|---| |
| 11 | +| `http_server` | 33.43k req/s | 29.61k req/s | **+12.9%** | +60% | |
| 12 | +| `tcp_echo` | 35.14k ops/s | 30.16k ops/s | **+16.5%** | +64% | |
| 13 | +| `http_client_api` | 10.43k req/s | 8.39k req/s | **+24.3%** | +17% | |
| 14 | +| `http_client_download` | 4.47k dl/s | 4.22k dl/s | +5.9% | +19% | |
| 15 | +| `connection_pool` | 4.97k co/s | 4.58k co/s | +8.5% | +8% | |
| 16 | +| `fan_out_gather` | 2.06k cyc/s | 1.93k cyc/s | +6.7% | +5% | |
| 17 | +| `db_query_mix` | 1.66k qry/s | 1.60k qry/s | +4.1% | +2% | |
| 18 | +| `cascading_timeout` | 4.64k ops/s | 4.35k ops/s | +6.7% | +6% | |
| 19 | + |
| 20 | +Carbon Fiber wins every workload, by +4% to +24%. Their headline +60/+64% numbers don't reproduce on this hardware (probably because Zen 4 on c7a.2xlarge is more scheduler-bound at higher absolute throughput); smaller +5–8% claims reproduce precisely. |
| 21 | + |
| 22 | +## Optimisation catalogue (ranked by expected impact) |
| 23 | + |
| 24 | +### Tier 1 — biggest wins, but need Ruby-side cooperation |
| 25 | + |
| 26 | +#### 1.1 New scheduler hooks: `socket_recvmsg` / `socket_sendmsg` |
| 27 | + |
| 28 | +**The plan**: add explicit socket-only hooks to the Ruby fiber scheduler protocol. With a guaranteed-socket contract on the C side we can implement the userspace fast path cleanly: |
| 29 | + |
| 30 | +```c |
| 31 | +// In io-event, when the new hooks land: |
| 32 | +ssize_t IO_Event_Selector_URing_socket_recvmsg(...) { |
| 33 | + ssize_t n = recvmsg(fd, &msg, MSG_DONTWAIT); |
| 34 | + if (n >= 0) return n; |
| 35 | + if (errno != EAGAIN && errno != EWOULDBLOCK) return -errno; |
| 36 | + // Slow path: submit io_uring SQE for recvmsg and yield. |
| 37 | +} |
| 38 | +``` |
| 39 | +
|
| 40 | +No `ENOTSOCK` worry, no `fcntl` dance, no per-fd type-caching. The contract is "this fd is a socket" by construction. |
| 41 | +
|
| 42 | +This is the cleanest path to the carbon_fiber-style `recvOnce`/`sendOnce` fast path. Carbon Fiber's biggest single win (~half of their `tcp_echo` / `http_server` lead) comes from skipping the io_uring round-trip when data is already buffered, and that's exactly what this enables. |
| 43 | +
|
| 44 | +**Effort**: depends on the Ruby side. Once the hooks exist, io-event change is ~200 lines for `recvmsg` + `sendmsg`. |
| 45 | +
|
| 46 | +**Expected gain**: 5–10% on `tcp_echo`, `http_server` once Ruby's socket library routes through the new hooks. |
| 47 | +
|
| 48 | +--- |
| 49 | +
|
| 50 | +### Tier 2 — io-event-only changes, no Ruby-side dependency |
| 51 | +
|
| 52 | +#### 2.1 Fiber chaining in `IO_Event_Selector_loop_yield` |
| 53 | +
|
| 54 | +**Current behaviour**: every user fiber yield goes `user → loop_fiber`; then `select` calls `IO_Event_Selector_ready_flush` which does `loop_fiber → next_user_fiber`. Two `rb_fiber_transfer` calls per scheduling decision. |
| 55 | +
|
| 56 | +**Carbon Fiber's trick**: when a user fiber yields, peek the ready queue first. If non-empty, `rb_fiber_transfer` directly to the head of the queue, skipping the loop-fiber round-trip. One transfer per scheduling decision. |
| 57 | +
|
| 58 | +``` |
| 59 | +// Pseudocode for IO_Event_Selector_loop_yield: |
| 60 | +if (backend->ready) { |
| 61 | + pop head; |
| 62 | + return rb_fiber_transfer(head->fiber, 0, NULL); |
| 63 | +} |
| 64 | +return rb_fiber_transfer(backend->loop, 0, NULL); |
| 65 | +``` |
| 66 | +
|
| 67 | +**Risk**: correctness around `rb_fiber_raise` — carbon_fiber needs a `blocked_fibers` tracking set to avoid stranding fibers when a raise bypasses the normal park/resume cycle. We already have `IO_Event_Selector_resume`/`_raise` plumbing; need to audit whether chaining interacts safely with our existing in-flight io_uring completions. |
| 68 | +
|
| 69 | +**Effort**: ~50 lines + careful tests around interrupt-during-yield. |
| 70 | +
|
| 71 | +**Expected gain**: 5–10% on hot scheduling workloads (`tcp_echo`, `db_query_mix`). |
| 72 | +
|
| 73 | +#### 2.2 Opportunistic `io_uring_get_events` CQ peek before yielding to loop_fiber |
| 74 | +
|
| 75 | +**Carbon Fiber** (`doTransferToLoop` lines 1122–1135): on io_uring only, when ready queue is empty but `active_waiters > 0`, peek the completion queue once before transferring to loop_fiber. On loopback workloads (tcp_echo, http) the peer's response often lands by the time we reach the yield point — chaining saves a full loop_fiber round-trip. |
| 76 | +
|
| 77 | +**Compose with our recent work**: this pairs naturally with the `IORING_SETUP_TASKRUN_FLAG` work we just landed. We already gate `io_uring_get_events()` on `IORING_SQ_TASKRUN`; this is the same syscall used to absorb a free completion. |
| 78 | +
|
| 79 | +``` |
| 80 | +// In loop_yield, before transferring to loop: |
| 81 | +if (selector->pending_io > 0) { |
| 82 | + io_uring_get_events(&selector->ring); // cheap when no work pending |
| 83 | + drain_cqes_into_ready_queue(); |
| 84 | + if (backend->ready) goto chain_to_ready; |
| 85 | +} |
| 86 | +``` |
| 87 | +
|
| 88 | +**Effort**: ~30 lines. No Ruby-side change. |
| 89 | +
|
| 90 | +**Expected gain**: 3–5% on workloads with localhost/short-RTT I/O. |
| 91 | +
|
| 92 | +#### 2.3 Push `fileno` extraction into C in `Async::Scheduler::Selector` |
| 93 | +
|
| 94 | +Carbon Fiber's `io_wait_object` / `io_read_object` / `io_write_object` accept the `IO` object directly and call `IO_Event_Selector_io_descriptor` (or equivalent) in C, skipping a `respond_to?(:fileno)` + `fileno` method-send pair per call. |
| 95 | +
|
| 96 | +io-event's native methods already take `VALUE io`. The waste is on the Async side: |
| 97 | +
|
| 98 | +```ruby |
| 99 | +# async/lib/async/scheduler/selector/uring.rb (or similar) |
| 100 | +def io_read(fiber, io, buffer, length, offset = 0) |
| 101 | + @selector.io_read(fiber, io.fileno, buffer, length, offset) # <- fileno in Ruby |
| 102 | +end |
| 103 | +``` |
| 104 | + |
| 105 | +**Effort**: trivial — drop the `.fileno` and let the C side extract it. |
| 106 | + |
| 107 | +**Expected gain**: 1–3% on read/write-heavy workloads, mostly from skipping the Ruby method call frame. |
| 108 | + |
| 109 | +#### 2.4 Cache "is_socket" bit in the `Descriptor` struct |
| 110 | + |
| 111 | +If we go ahead with a `recv(MSG_DONTWAIT)` fast path inside `io_read` (i.e. without waiting for the new `socket_recvmsg` hook — see 3.1), the cleanest way to handle ENOTSOCK is to cache the verdict per fd. First call to a new fd costs one wasted recv on a non-socket; every subsequent call goes straight to the right syscall. |
| 112 | + |
| 113 | +The URing selector already maintains a per-fd `Descriptor` struct (for close-watch state); add a `bool known_socket; bool known_not_socket;` pair (3 states: unknown, socket, not-socket). |
| 114 | + |
| 115 | +**Effort**: ~30 lines. |
| 116 | + |
| 117 | +**Expected gain**: makes 3.1 net-positive on mixed socket/pipe workloads. Standalone gain is zero. |
| 118 | + |
| 119 | +--- |
| 120 | + |
| 121 | +### Tier 3 — fast path inside existing `io_read` / `io_write` hooks |
| 122 | + |
| 123 | +These are the changes we'd make *without* waiting for new Ruby hooks. They overlap with Tier 1, so revisit after `socket_recvmsg`/`socket_sendmsg` land. |
| 124 | + |
| 125 | +#### 3.1 Userspace `recv(MSG_DONTWAIT)` probe before io_uring submission |
| 126 | + |
| 127 | +**Carbon Fiber's `ioRead`**: try `recv(MSG_DONTWAIT)` first; if `> 0` return immediately, if `ENOTSOCK` fall back to `read(2)`, if `EAGAIN` go to io_uring slow path. |
| 128 | + |
| 129 | +For io-event, the safe order is: |
| 130 | +1. If descriptor known-not-socket: skip to `read(2)` probe. |
| 131 | +2. Else `recv(MSG_DONTWAIT)`: |
| 132 | + - `> 0`: return (optionally `drainRecv` more — see 3.3). |
| 133 | + - `0`: EOF, return 0. |
| 134 | + - `ENOTSOCK`: mark known-not-socket, fall through to `read(2)`. |
| 135 | + - `EAGAIN`/`EWOULDBLOCK`: fall through to io_uring slow path. |
| 136 | + - Other errno: return as error. |
| 137 | + |
| 138 | +**Risk**: changes observable behaviour (different syscalls show up in strace/audit/seccomp). For mainstream callers it's invisible. |
| 139 | + |
| 140 | +**Effort**: ~80 lines in `uring.c`'s `IO_Event_Selector_URing_io_read` (and symmetric for `io_write`). |
| 141 | + |
| 142 | +**Expected gain**: 8–15% on `tcp_echo`, `http_server`. This is the single biggest win identified. |
| 143 | + |
| 144 | +**Supersedes**: the existing `length == 0` (readpartial) `fcntl` dance — that path becomes one wasted recv instead of two fcntl calls on a non-socket, and zero overhead on a socket. |
| 145 | + |
| 146 | +#### 3.2 Adaptive per-fd probe-skip |
| 147 | + |
| 148 | +After 3.1 is in: track consecutive `EAGAIN` returns per-fd (indexed by `fd & 0xFF` in carbon_fiber, or stored in our `Descriptor` struct). After N (carbon_fiber uses 3) consecutive misses, skip the userspace probe and go straight to io_uring for that slot. Reset on a successful hit. |
| 149 | + |
| 150 | +Without this, long-polling workloads (websocket idle, long-lived keep-alive) pay one wasted `recv` per `io_read`. With it, the probe self-tunes. |
| 151 | + |
| 152 | +**Effort**: ~20 lines on top of 3.1. |
| 153 | + |
| 154 | +**Expected gain**: prevents 3.1 from regressing on long-poll workloads. Standalone: zero. |
| 155 | + |
| 156 | +#### 3.3 `drainRecv` after a successful first recv |
| 157 | + |
| 158 | +After `recvOnce` returns N bytes, opportunistically try one more `recv(MSG_DONTWAIT)` to drain any remaining kernel buffer in the same `io_read` call. Skipped for `length == 0` (readpartial) since the caller only wants what's immediately available. |
| 159 | + |
| 160 | +**Effort**: ~10 lines. |
| 161 | + |
| 162 | +**Expected gain**: 1–2% on streaming workloads where messages span multiple TCP segments. |
| 163 | + |
| 164 | +--- |
| 165 | + |
| 166 | +### Tier 4 — smaller, more situational wins |
| 167 | + |
| 168 | +#### 4.1 Combined R|W `io_wait` → poll-writable-only |
| 169 | + |
| 170 | +Carbon Fiber: when `io_wait(events = READABLE | WRITABLE)` is called (common for `connect()` completion), only register a writable poll — writability implies the connect finished, return both flags. Saves one SQE. |
| 171 | + |
| 172 | +**Effort**: trivial. **Expected gain**: marginal except for connect-heavy workloads. |
| 173 | + |
| 174 | +#### 4.2 Skip `MSG_PEEK` readiness probe in `io_wait` |
| 175 | + |
| 176 | +When Ruby calls `io_wait`, the calling code has already seen EAGAIN, so a `MSG_PEEK` "is it ready?" probe is guaranteed to fail. io-event doesn't do this peek today, but worth noting we shouldn't add it. |
| 177 | + |
| 178 | +#### 4.3 Near-deadline busy-spin for sub-3ms timers |
| 179 | + |
| 180 | +Carbon Fiber's `SPIN_THRESHOLD = 3 ms`: for timers that close to firing, busy-spin instead of releasing GVL + re-arming a kernel timer. Specific to GVL release cost. |
| 181 | + |
| 182 | +**Effort**: ~30 lines. **Expected gain**: ~2% on `cascading_timeout`. Probably not worth the complexity for that workload alone. |
| 183 | + |
| 184 | +--- |
| 185 | + |
| 186 | +### Tier 5 — DO NOT COPY |
| 187 | + |
| 188 | +#### 5.1 Net::HTTP keep-alive `wait_readable(0)` short-circuit |
| 189 | + |
| 190 | +Carbon Fiber's `Scheduler#poll_io_now` returns `false` unconditionally for `BasicSocket.wait_readable(0)`. This is a deliberate Net::HTTP keep-alive optimisation — skips the MSG_PEEK staleness probe Net::HTTP performs before every request. Comment in the source acknowledges it costs one extra request's latency on a genuinely stale connection. |
| 191 | + |
| 192 | +This accounts for some chunk of their `http_client_api` win. It's *functionally* correct on the happy path but it's a benchmark-tuned shortcut that would be wrong for io-event to copy at the framework level — too aggressive for a general-purpose library. |
| 193 | + |
| 194 | +--- |
| 195 | + |
| 196 | +## Suggested order |
| 197 | + |
| 198 | +1. **Land [`socket_recvmsg`/`socket_sendmsg`](https://bugs.ruby-lang.org/) hooks in Ruby + io-event** (1.1). Cleanest foundation, no compromises. |
| 199 | +2. **Fiber chaining** (2.1) — biggest scheduler-side win that's independent of the hook work. Needs careful tests. |
| 200 | +3. **CQ peek before loop yield** (2.2) — composes with (1.1) and (2.1), small lines-of-code, fits naturally into the post-PR-#166 code path. |
| 201 | +4. **`fileno` extraction in C** in Async's adapter (2.3) — trivial change, may as well land alongside. |
| 202 | +5. **Userspace fast path inside `io_read`/`io_write`** (3.1 + 3.2 + 2.4) — only if (1.1) is delayed. Otherwise prefer the new hook path. |
| 203 | +6. Re-benchmark on hana + c7a.2xlarge after each step. |
| 204 | + |
| 205 | +## Open questions |
| 206 | + |
| 207 | +- Status of `socket_recvmsg`/`socket_sendmsg` hooks upstream — do they need to be designed, or are they already proposed? |
| 208 | +- Does Ruby's `BasicSocket` need to opt in, or do we get coverage automatically through `Socket#recvmsg` / `#sendmsg`? |
| 209 | +- For (2.1), is there an existing test that exercises `Fiber#raise` against a fiber suspended via `io_wait`? If not, we should write one before changing `loop_yield`. |
0 commit comments