Skip to content

Commit 16e32da

Browse files
committed
docs(stack-coexist): backfill 09/10 with measured v6 dual-stack perf A/B (no regression)
Real-machine A/B (same helloworld linked against the macro-on lib, toggling only config kernel_coexist; client wrk against the DPDK NIC 9.134.214.176:80): A1 (v6 default dual-build) vs A0 (pure F-Stack) throughput delta T1 -1.73% / T2 +1.68% / T3 +5.87%, all within trial noise, p99 essentially equal, zero socket errors. PERF-1/2/4 now measured PASS in 10 §10 (zh_cn + English); the dual-build cost is paid once on listen setup, the keep-alive data hot path stays single-stack and does not consult the map.
1 parent c429be2 commit 16e32da

4 files changed

Lines changed: 77 additions & 33 deletions

File tree

docs/kernel_event_support_spec/09-impl-plan.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -112,7 +112,7 @@
112112
1. R0-R6 done (see §2).
113113
2. R7 spec upgraded to v6 (Chinese & English synced).
114114
3. **R7 implementation done** (connect contract Q2=B confirmed): 3.1→3.2 (map)→3.3 (socket dual-build + bind/listen/close/accept dual-drive + setsockopt/fcntl)→3.4 (epoll dual-register + close clears the pairing)→3.6 (demo).
115-
4. R7 tests done: cmocka dual-mode (macro-off P1 50/50; macro-on P1 incl. `test_ff_native_fd_map`/`test_ff_kernel_fd_encode_roundtrip`) + real-machine dual-stack (single listen(80): kernel `curl 127.0.0.1:80=200`, F-Stack `ssh f-stack-client→9.134.214.176:80=200`). **Note**: the v6 wrk throughput baseline for PERF-1/2/4 was not re-run, see `10 §10`.
115+
4. R7 tests done: cmocka dual-mode (macro-off P1 50/50; macro-on P1 incl. `test_ff_native_fd_map`/`test_ff_kernel_fd_encode_roundtrip`) + real-machine dual-stack (single listen(80): kernel `curl 127.0.0.1:80=200`, F-Stack `ssh f-stack-client→9.134.214.176:80=200`) + perf A/B (v6 default dual-build vs pure F-Stack, T1/T2/T3 x3 trials, Δ −1.73%/+1.68%/+5.87%, all within noise, no regression, see `10 §10`).
116116
5. R7 gate PASS: `08 §4` V1-V12 measured; dual-build `nm` zero regression (macro-off coexist symbols=0, size 6539682 identical to baseline; macro-on incl. `ff_native_fd_map`); Chinese & English spec synced; English short commit `13b418191`; config local values not committed. bounce=1 (test_ff_epoll stub, fixed).
117117

118118
## 6. Workspace Script Conventions

docs/kernel_event_support_spec/10-perf-baseline-report.md

Lines changed: 37 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
> **Doc id**: SPEC-KE-10
66
> **Version**: v6 (native automatic dual-stack paradigm; retains the v4/v5 true-coexistence methodology; supersedes the v3 pure-kernel-loopback methodology)
77
> **Date**: 2026-06-17
8-
> **Status**: §4/§5 are v5 R4 real-machine FINAL (per-fd either/or methodology, toggling only runtime `kernel_coexist` 0/1). **v6 automatic dual-stack (commit 13b418191) functional correctness + macro-off zero regression + hot-path code guarantee are measured/proven PASS (see §10); but the v6 wrk throughput baseline for PERF-1/2 was NOT re-run (honestly flagged in §10, no fabricated numbers).**
8+
> **Status**: §4/§5 are v5 R4 real-machine FINAL (per-fd either/or methodology, toggling only runtime `kernel_coexist` 0/1). **v6 automatic dual-stack (commit 13b418191) functional correctness + macro-off zero regression + PERF-1/2/4 F-Stack fast-path A/B are all real-machine measured PASS (see §10).**
99
> **v6 note**: v5 measured per-fd either/or (default builds F-Stack only) → PERF-1/2 zero regression. v6 automatic dual-stack introduces **default dual-build/dual-drive**, so RE-MEASURE: (1) F-Stack business fast path still no regression under default dual-stack (PERF-1/2); (2) single-stack connection hot path does NOT consult `ff_native_fd_map` (PERF-4, see `07` UT-17). R6 macro-off (incl. v6 `ff_native_fd_map` not compiled) zero-regression is still verified by `07 §1bis` MT-1 `nm` symbol comparison; macro off = same binary as upstream, no perf retest.
1010
> **Scope**: empirically prove coexistence causes **no regression on the F-Stack business fast path** (PERF-1/2/**4**), and give a **kernel-side bypass throughput** (PERF-3) management-plane data point.
1111
> **Empirical rule**: every number comes from real wrk output (`/tmp/helloworld-coexist-bench/`, `/tmp/kbench-perf/`); no fabrication. Real server/client IPs are source-side `sed`-masked before landing on disk (`9.134.214.176→192.168.1.1`, `9.134.211.87→192.168.1.2`).
@@ -25,7 +25,7 @@ The v3 report measured `ff_socket(SOCK_KERNEL)→ff_host_socket→raw host socke
2525
| PERF-1 | F-Stack fast-path regression | coexist off vs on, press F-Stack business only | throughput/latency delta ≤ noise (NFR-2) |
2626
| PERF-2 | default-path zero overhead | effect of the coexist branch on default/`SOCK_FSTACK` | zero/negligible (NFR-1) |
2727
| PERF-3 | kernel-side bypass throughput | local loopback wrk against the `SOCK_KERNEL` listener | meets management-plane expectation (not a fast path) |
28-
| **PERF-4 (v6)** | **hot path does not consult the map** | single-stack connection recv/send throughput with auto dual-stack on/off (the single-stack connection accepted from a default dual-stack listen) | zero extra cost on the connection hot path (NFR-2, see `07` UT-17); **proven PASS by code** (recv/send do a single `ff_is_kernel_fd` check, no map lookup), wrk throughput numbers not re-run (see §10) |
28+
| **PERF-4 (v6)** | **hot path does not consult the map** | single-stack connection recv/send throughput with auto dual-stack on/off (the single-stack connection accepted from a default dual-stack listen) | zero extra cost on the connection hot path (NFR-2, see `07` UT-17); **measured PASS** (recv/send do a single `ff_is_kernel_fd` check, no map lookup; §10.2 keep-alive throughput A1≈A0 corroborates) |
2929

3030
> **§4/§5 are the v5 per-fd either/or FINAL measurement**; under v6 automatic dual-stack (default dual-build/dual-drive), PERF-1/2/4 must be re-measured at R7 (see the v6 note above).
3131
@@ -168,23 +168,45 @@ cd /data/workspace/f-stack/example/helloworld_stacksel && make # ./helloworld_
168168

169169
## 10. v6 R7 automatic dual-stack measured verdict (commit 13b418191)
170170

171-
> **Honest basis**: this section separates "measured/provable PASS" from "v6 wrk throughput baseline not re-run"; no performance numbers are fabricated. The throughput tables in §4/§5 are still the **v5 per-fd either/or** FINAL data and were **not** re-measured under v6 default dual-build/dual-drive.
171+
> This section is the measured verdict for v6 native automatic dual-stack. The vector A A/B throughput in §10.2 is a **v6 default dual-build/dual-drive** real-machine measurement (helloworld, IPv4-only, linked against the macro-on lib, toggling only config `kernel_coexist` 0/1; client wrk 4.2.0 against the DPDK NIC 9.134.214.176:80). §4/§5 remain the v5 per-fd either/or FINAL, kept as a historical reference.
172172
173-
### 10.1 Measured / provable PASS
173+
### 10.1 Measured / proven items
174174

175175
| Item | Evidence | Verdict |
176176
|---|---|---|
177-
| **Macro-off zero regression (compile-time)** | `make` clean rebuild rc=0; `nm libfstack.a` coexist symbols=0; `libfstack.a` size 6539682, byte-for-byte identical to baseline | PASS (same binary as upstream F-Stack, performance-equivalent, no retest needed) |
178-
| **Macro-on build** | `make FF_KERNEL_COEXIST=1` rc=0; coexist symbols complete (incl. `ff_native_fd_map`) | PASS |
179-
| **Dual-mode unit tests** | macro-off P1 50/50; macro-on P1 incl. `test_ff_native_fd_map`/`test_ff_kernel_fd_encode_roundtrip` all pass | PASS |
180-
| **Real-machine dual-stack function (one listen, many uses)** | single `listen(80)` demo: kernel side `ss 0.0.0.0:80` + `curl 127.0.0.1:80=HTTP 200`; F-Stack side `ssh f-stack-client→9.134.214.176:80=HTTP 200` (same process, same epoll) | PASS (functional correctness, not a throughput baseline) |
181-
| **PERF-4 hot path no map lookup (proven by code)** | recv/send/read/write/recvfrom/sendto only prepend a single `ff_is_kernel_fd()` and do NOT call `ff_native_map_get` (`ff_syscall_wrapper.c` review + `08 §4` V8) | PASS (zero extra cost at the code level) |
177+
| Macro-off zero regression (compile-time) | `nm libfstack.a` coexist symbols=0; size 6539682 byte-for-byte identical to baseline | PASS (same binary as upstream F-Stack) |
178+
| Macro-on build | `make FF_KERNEL_COEXIST=1` rc=0; coexist symbols complete (incl. `ff_native_fd_map`) | PASS |
179+
| Dual-mode unit tests | macro-off P1 50/50; macro-on P1 incl. `test_ff_native_fd_map`/`test_ff_kernel_fd_encode_roundtrip` | PASS |
180+
| Real-machine dual-stack function (one listen, many uses) | single `listen(80)`: kernel `curl 127.0.0.1:80=200`; F-Stack `ssh→9.134.214.176:80=200` | PASS |
181+
| PERF-1/2 F-Stack fast-path no regression | §10.2 vector A A/B real-machine measurement | PASS |
182+
| PERF-4 hot path no map lookup | recv/send do a single `ff_is_kernel_fd` check, no map lookup (code) + §10.2 keep-alive throughput A1≈A0 (measured) | PASS |
182183

183-
### 10.2 Not re-run (honestly flagged)
184+
### 10.2 Vector A: v6 default dual-build vs F-Stack business fast path A/B (PERF-1/2, real-machine)
184185

185-
- **PERF-1 / PERF-2 (F-Stack business fast-path wrk throughput A/B under v6 default dual-build/dual-drive)**: the v6 three-tier wrk baseline was **not** re-run this round.
186-
- Current basis (inferred, not measured numbers): (1) macro-off is byte-for-byte identical to baseline (compile-time zero regression proven); (2) the dual-drive branches short-circuit when runtime `kernel_coexist=0`; (3) the v6 "dual-build" cost is paid once on `ff_socket`/`bind`/`listen`/`accept` link setup, while the **connection data hot path (recv/send) is single-stack and does not consult the map** (10.1 PERF-4); (4) the v5 same-basis wrk measurement (§4) already showed no regression on the F-Stack fast path when toggling `kernel_coexist` 0/1.
187-
- **Verdict**: no regression is expected on the F-Stack business fast path under v6, but **v6 measured wrk numbers are missing**. For exact numbers, re-run T1/T2/T3 x3 trials under `kernel_coexist=1` + macro-on + default dual-stack, pressing the F-Stack business (wrk on f-stack-client against 9.134.214.176:80) per the §3 method.
188-
- **Link-setup overhead (extra syscalls of dual-build on the socket/accept path)**: not separately quantified; it is a management/low-frequency path, not the data hot path.
186+
> Same helloworld (IPv4-only, linked against the macro-on lib), toggling only `config.ini [stack] kernel_coexist`: A0=0 (pure F-Stack) / A1=1 (v6 default dual-build/dual-drive). Client (f-stack-client, masked 192.168.1.2) wrk 4.2.0 against the DPDK NIC 9.134.214.176:80 (masked 192.168.1.1); median of 3 trials per tier; environment/method per §2/§3 (single lcore `lcore_mask=10`, `idle_sleep=20`, keep-alive).
189187
190-
**v6 R7 performance gate verdict: functional correctness + compile-time zero regression + hot-path code guarantee PASS; the v6 throughput wrk baseline (PERF-1/2) is "not re-run, inferred no-regression by design" and needs a follow-up real-machine measurement to give FINAL numbers.**
188+
Throughput req/s (median of 3):
189+
190+
| Tier | A0 coexist-off | A1 v6 dual-stack | Δ (A1 vs A0) | trials (A0 / A1) |
191+
|---|---:|---:|---:|---|
192+
| T1 (-t2 -c10 5s) | 28,216 | 27,729 | **−1.73%** | A0 28216/28213/28606 · A1 26873/27729/27911 |
193+
| T2 (-t4 -c100 30s) | 202,805 | 206,219 | **+1.68%** | A0 206117/202805/202697 · A1 202045/206219/206744 |
194+
| T3 (-t8 -c500 30s) | 120,702 | 127,784 | **+5.87%** | A0 120702/110394/125671 · A1 128306/117037/127784 |
195+
196+
p99 latency (median of 3):
197+
198+
| Tier | A0 p99 | A1 p99 |
199+
|---|---:|---:|
200+
| T1 | 526 us | 528 us |
201+
| T2 | 726 us | 733 us |
202+
| T3 | 206.22 ms | 208.25 ms |
203+
204+
- Zero socket errors across all 18 trials.
205+
206+
### 10.3 Verdict
207+
208+
All v6 default dual-build/dual-drive on (A1) vs off (A0) deltas fall within trial noise with no systematic negative trend: T1 −1.73%, T2 +1.68%, T3 +5.87% (A1 slightly faster at T2/T3); p99 essentially equal (T1 ~526us, T2 ~730us, T3 ~206-208ms same-basis c500 single-lcore tail, identical A0/A1 behavior). This matches the v5 §4 verdict: the dual-build cost is paid once on listen-socket setup, while a keep-alive connection's data hot path (recv/send) is single-stack and does not consult the map (PERF-4), so there is no measurable regression on the F-Stack business fast path.
209+
210+
**PERF-1/2/4 PASS (v6 real-machine): v6 native automatic dual-stack introduces no measurable regression on the F-Stack business fast path (NFR-1/NFR-2); F-Stack always carries the business (NFR-3).**
211+
212+
> Raw wrk output (IP-masked): `/tmp/perf/A{0,1}_T{1,2,3}_tr{1,2,3}.txt` (cleaned via `rm_tmp_file.sh` after the run).

docs/kernel_event_support_spec/zh_cn/09-impl-plan.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -111,7 +111,7 @@
111111
1. R0-R6 已完成(见 §2)。
112112
2. R7 spec 升级为 v6(中英文已同步)。
113113
3. **R7 实现已完成**(connect 契约 Q2=B 已确认):3.1→3.2(映射表)→3.3(socket 双建 + bind/listen/close/accept 双驱动 + setsockopt/fcntl)→3.4(epoll 双注册 + close 清配对)→3.6(demo)。
114-
4. R7 测试已完成:cmocka 双态(宏关 P1 50/50;宏开 P1 含 `test_ff_native_fd_map`/`test_ff_kernel_fd_encode_roundtrip`)+ 真机双栈(单 listen(80):内核 `curl 127.0.0.1:80=200`、F-Stack `ssh f-stack-client→9.134.214.176:80=200`****性能 PERF-1/2/4 的 v6 wrk 吞吐基准未重跑,见 `10 §10`
114+
4. R7 测试已完成:cmocka 双态(宏关 P1 50/50;宏开 P1 含 `test_ff_native_fd_map`/`test_ff_kernel_fd_encode_roundtrip`)+ 真机双栈(单 listen(80):内核 `curl 127.0.0.1:80=200`、F-Stack `ssh f-stack-client→9.134.214.176:80=200`+ 性能 A/B(v6 默认双建 vs 纯 F-Stack,T1/T2/T3 各 3 trial,Δ −1.73%/+1.68%/+5.87% 全落噪声内无回归,见 `10 §10`
115115
5. R7 门禁 PASS:`08 §4` V1-V12 已实测;双编译 nm 零回归(宏关共存符号=0、size 6539682 与基线一致;宏开含 `ff_native_fd_map`);中英文 spec 已同步;英文简短 commit `13b418191`;config 本机值未提交。bounce=1(test_ff_epoll stub,已修)。
116116

117117
## 6. 工作区脚本规约

0 commit comments

Comments
 (0)