docs(stack-coexist): backfill 09/10 with measured v6 dual-stack perf A/B (no regression)

jfb8856606 · jfb8856606 · commit 16e32da3aa37 · 2026-06-17T18:46:59.000+08:00
Real-machine A/B (same helloworld linked against the macro-on lib, toggling only config kernel_coexist; client wrk against the DPDK NIC 9.134.214.176:80): A1 (v6 default dual-build) vs A0 (pure F-Stack) throughput delta T1 -1.73% / T2 +1.68% / T3 +5.87%, all within trial noise, p99 essentially equal, zero socket errors. PERF-1/2/4 now measured PASS in 10 §10 (zh_cn + English); the dual-build cost is paid once on listen setup, the keep-alive data hot path stays single-stack and does not consult the map.
diff --git a/docs/kernel_event_support_spec/09-impl-plan.md b/docs/kernel_event_support_spec/09-impl-plan.md
@@ -112,7 +112,7 @@
 1. R0-R6 done (see §2).
 2. R7 spec upgraded to v6 (Chinese & English synced).
 3. **R7 implementation done** (connect contract Q2=B confirmed): 3.1→3.2 (map)→3.3 (socket dual-build + bind/listen/close/accept dual-drive + setsockopt/fcntl)→3.4 (epoll dual-register + close clears the pairing)→3.6 (demo).
-4. R7 tests done: cmocka dual-mode (macro-off P1 50/50; macro-on P1 incl. `test_ff_native_fd_map`/`test_ff_kernel_fd_encode_roundtrip`) + real-machine dual-stack (single listen(80): kernel `curl 127.0.0.1:80=200`, F-Stack `ssh f-stack-client→9.134.214.176:80=200`). **Note**: the v6 wrk throughput baseline for PERF-1/2/4 was not re-run, see `10 §10`.
+4. R7 tests done: cmocka dual-mode (macro-off P1 50/50; macro-on P1 incl. `test_ff_native_fd_map`/`test_ff_kernel_fd_encode_roundtrip`) + real-machine dual-stack (single listen(80): kernel `curl 127.0.0.1:80=200`, F-Stack `ssh f-stack-client→9.134.214.176:80=200`) + perf A/B (v6 default dual-build vs pure F-Stack, T1/T2/T3 x3 trials, Δ −1.73%/+1.68%/+5.87%, all within noise, no regression, see `10 §10`).
 5. R7 gate PASS: `08 §4` V1-V12 measured; dual-build `nm` zero regression (macro-off coexist symbols=0, size 6539682 identical to baseline; macro-on incl. `ff_native_fd_map`); Chinese & English spec synced; English short commit `13b418191`; config local values not committed. bounce=1 (test_ff_epoll stub, fixed).
 
 ## 6. Workspace Script Conventions
diff --git a/docs/kernel_event_support_spec/10-perf-baseline-report.md b/docs/kernel_event_support_spec/10-perf-baseline-report.md
@@ -5,7 +5,7 @@
 > **Doc id**: SPEC-KE-10
 > **Version**: v6 (native automatic dual-stack paradigm; retains the v4/v5 true-coexistence methodology; supersedes the v3 pure-kernel-loopback methodology)
 > **Date**: 2026-06-17
-> **Status**: §4/§5 are v5 R4 real-machine FINAL (per-fd either/or methodology, toggling only runtime `kernel_coexist` 0/1). **v6 automatic dual-stack (commit 13b418191) functional correctness + macro-off zero regression + hot-path code guarantee are measured/proven PASS (see §10); but the v6 wrk throughput baseline for PERF-1/2 was NOT re-run (honestly flagged in §10, no fabricated numbers).**
+> **Status**: §4/§5 are v5 R4 real-machine FINAL (per-fd either/or methodology, toggling only runtime `kernel_coexist` 0/1). **v6 automatic dual-stack (commit 13b418191) functional correctness + macro-off zero regression + PERF-1/2/4 F-Stack fast-path A/B are all real-machine measured PASS (see §10).**
 > **v6 note**: v5 measured per-fd either/or (default builds F-Stack only) → PERF-1/2 zero regression. v6 automatic dual-stack introduces **default dual-build/dual-drive**, so RE-MEASURE: (1) F-Stack business fast path still no regression under default dual-stack (PERF-1/2); (2) single-stack connection hot path does NOT consult `ff_native_fd_map` (PERF-4, see `07` UT-17). R6 macro-off (incl. v6 `ff_native_fd_map` not compiled) zero-regression is still verified by `07 §1bis` MT-1 `nm` symbol comparison; macro off = same binary as upstream, no perf retest.
 > **Scope**: empirically prove coexistence causes **no regression on the F-Stack business fast path** (PERF-1/2/**4**), and give a **kernel-side bypass throughput** (PERF-3) management-plane data point.
 > **Empirical rule**: every number comes from real wrk output (`/tmp/helloworld-coexist-bench/`, `/tmp/kbench-perf/`); no fabrication. Real server/client IPs are source-side `sed`-masked before landing on disk (`9.134.214.176→192.168.1.1`, `9.134.211.87→192.168.1.2`).
@@ -25,7 +25,7 @@ The v3 report measured `ff_socket(SOCK_KERNEL)→ff_host_socket→raw host socke
 | PERF-1 | F-Stack fast-path regression | coexist off vs on, press F-Stack business only | throughput/latency delta ≤ noise (NFR-2) |
 | PERF-2 | default-path zero overhead | effect of the coexist branch on default/`SOCK_FSTACK` | zero/negligible (NFR-1) |
 | PERF-3 | kernel-side bypass throughput | local loopback wrk against the `SOCK_KERNEL` listener | meets management-plane expectation (not a fast path) |
-| **PERF-4 (v6)** | **hot path does not consult the map** | single-stack connection recv/send throughput with auto dual-stack on/off (the single-stack connection accepted from a default dual-stack listen) | zero extra cost on the connection hot path (NFR-2, see `07` UT-17); **proven PASS by code** (recv/send do a single `ff_is_kernel_fd` check, no map lookup), wrk throughput numbers not re-run (see §10) |
+| **PERF-4 (v6)** | **hot path does not consult the map** | single-stack connection recv/send throughput with auto dual-stack on/off (the single-stack connection accepted from a default dual-stack listen) | zero extra cost on the connection hot path (NFR-2, see `07` UT-17); **measured PASS** (recv/send do a single `ff_is_kernel_fd` check, no map lookup; §10.2 keep-alive throughput A1≈A0 corroborates) |
 
 > **§4/§5 are the v5 per-fd either/or FINAL measurement**; under v6 automatic dual-stack (default dual-build/dual-drive), PERF-1/2/4 must be re-measured at R7 (see the v6 note above).
 
@@ -168,23 +168,45 @@ cd /data/workspace/f-stack/example/helloworld_stacksel && make   # ./helloworld_
 
 ## 10. v6 R7 automatic dual-stack measured verdict (commit 13b418191)
 
-> **Honest basis**: this section separates "measured/provable PASS" from "v6 wrk throughput baseline not re-run"; no performance numbers are fabricated. The throughput tables in §4/§5 are still the **v5 per-fd either/or** FINAL data and were **not** re-measured under v6 default dual-build/dual-drive.
+> This section is the measured verdict for v6 native automatic dual-stack. The vector A A/B throughput in §10.2 is a **v6 default dual-build/dual-drive** real-machine measurement (helloworld, IPv4-only, linked against the macro-on lib, toggling only config `kernel_coexist` 0/1; client wrk 4.2.0 against the DPDK NIC 9.134.214.176:80). §4/§5 remain the v5 per-fd either/or FINAL, kept as a historical reference.
 
-### 10.1 Measured / provable PASS
+### 10.1 Measured / proven items
 
 | Item | Evidence | Verdict |
 |---|---|---|
-| **Macro-off zero regression (compile-time)** | `make` clean rebuild rc=0; `nm libfstack.a` coexist symbols=0; `libfstack.a` size 6539682, byte-for-byte identical to baseline | PASS (same binary as upstream F-Stack, performance-equivalent, no retest needed) |
-| **Macro-on build** | `make FF_KERNEL_COEXIST=1` rc=0; coexist symbols complete (incl. `ff_native_fd_map`) | PASS |
-| **Dual-mode unit tests** | macro-off P1 50/50; macro-on P1 incl. `test_ff_native_fd_map`/`test_ff_kernel_fd_encode_roundtrip` all pass | PASS |
-| **Real-machine dual-stack function (one listen, many uses)** | single `listen(80)` demo: kernel side `ss 0.0.0.0:80` + `curl 127.0.0.1:80=HTTP 200`; F-Stack side `ssh f-stack-client→9.134.214.176:80=HTTP 200` (same process, same epoll) | PASS (functional correctness, not a throughput baseline) |
-| **PERF-4 hot path no map lookup (proven by code)** | recv/send/read/write/recvfrom/sendto only prepend a single `ff_is_kernel_fd()` and do NOT call `ff_native_map_get` (`ff_syscall_wrapper.c` review + `08 §4` V8) | PASS (zero extra cost at the code level) |
+| Macro-off zero regression (compile-time) | `nm libfstack.a` coexist symbols=0; size 6539682 byte-for-byte identical to baseline | PASS (same binary as upstream F-Stack) |
+| Macro-on build | `make FF_KERNEL_COEXIST=1` rc=0; coexist symbols complete (incl. `ff_native_fd_map`) | PASS |
+| Dual-mode unit tests | macro-off P1 50/50; macro-on P1 incl. `test_ff_native_fd_map`/`test_ff_kernel_fd_encode_roundtrip` | PASS |
+| Real-machine dual-stack function (one listen, many uses) | single `listen(80)`: kernel `curl 127.0.0.1:80=200`; F-Stack `ssh→9.134.214.176:80=200` | PASS |
+| PERF-1/2 F-Stack fast-path no regression | §10.2 vector A A/B real-machine measurement | PASS |
+| PERF-4 hot path no map lookup | recv/send do a single `ff_is_kernel_fd` check, no map lookup (code) + §10.2 keep-alive throughput A1≈A0 (measured) | PASS |
 
-### 10.2 Not re-run (honestly flagged)
+### 10.2 Vector A: v6 default dual-build vs F-Stack business fast path A/B (PERF-1/2, real-machine)
 
-- **PERF-1 / PERF-2 (F-Stack business fast-path wrk throughput A/B under v6 default dual-build/dual-drive)**: the v6 three-tier wrk baseline was **not** re-run this round.
-  - Current basis (inferred, not measured numbers): (1) macro-off is byte-for-byte identical to baseline (compile-time zero regression proven); (2) the dual-drive branches short-circuit when runtime `kernel_coexist=0`; (3) the v6 "dual-build" cost is paid once on `ff_socket`/`bind`/`listen`/`accept` link setup, while the **connection data hot path (recv/send) is single-stack and does not consult the map** (10.1 PERF-4); (4) the v5 same-basis wrk measurement (§4) already showed no regression on the F-Stack fast path when toggling `kernel_coexist` 0/1.
-  - **Verdict**: no regression is expected on the F-Stack business fast path under v6, but **v6 measured wrk numbers are missing**. For exact numbers, re-run T1/T2/T3 x3 trials under `kernel_coexist=1` + macro-on + default dual-stack, pressing the F-Stack business (wrk on f-stack-client against 9.134.214.176:80) per the §3 method.
-- **Link-setup overhead (extra syscalls of dual-build on the socket/accept path)**: not separately quantified; it is a management/low-frequency path, not the data hot path.
+> Same helloworld (IPv4-only, linked against the macro-on lib), toggling only `config.ini [stack] kernel_coexist`: A0=0 (pure F-Stack) / A1=1 (v6 default dual-build/dual-drive). Client (f-stack-client, masked 192.168.1.2) wrk 4.2.0 against the DPDK NIC 9.134.214.176:80 (masked 192.168.1.1); median of 3 trials per tier; environment/method per §2/§3 (single lcore `lcore_mask=10`, `idle_sleep=20`, keep-alive).
 
-→ **v6 R7 performance gate verdict: functional correctness + compile-time zero regression + hot-path code guarantee PASS; the v6 throughput wrk baseline (PERF-1/2) is "not re-run, inferred no-regression by design" and needs a follow-up real-machine measurement to give FINAL numbers.**
+Throughput req/s (median of 3):
+
+| Tier | A0 coexist-off | A1 v6 dual-stack | Δ (A1 vs A0) | trials (A0 / A1) |
+|---|---:|---:|---:|---|
+| T1 (-t2 -c10 5s)   | 28,216 | 27,729 | **−1.73%** | A0 28216/28213/28606 · A1 26873/27729/27911 |
+| T2 (-t4 -c100 30s) | 202,805 | 206,219 | **+1.68%** | A0 206117/202805/202697 · A1 202045/206219/206744 |
+| T3 (-t8 -c500 30s) | 120,702 | 127,784 | **+5.87%** | A0 120702/110394/125671 · A1 128306/117037/127784 |
+
+p99 latency (median of 3):
+
+| Tier | A0 p99 | A1 p99 |
+|---|---:|---:|
+| T1 | 526 us | 528 us |
+| T2 | 726 us | 733 us |
+| T3 | 206.22 ms | 208.25 ms |
+
+- Zero socket errors across all 18 trials.
+
+### 10.3 Verdict
+
+All v6 default dual-build/dual-drive on (A1) vs off (A0) deltas fall within trial noise with no systematic negative trend: T1 −1.73%, T2 +1.68%, T3 +5.87% (A1 slightly faster at T2/T3); p99 essentially equal (T1 ~526us, T2 ~730us, T3 ~206-208ms same-basis c500 single-lcore tail, identical A0/A1 behavior). This matches the v5 §4 verdict: the dual-build cost is paid once on listen-socket setup, while a keep-alive connection's data hot path (recv/send) is single-stack and does not consult the map (PERF-4), so there is no measurable regression on the F-Stack business fast path.
+
+→ **PERF-1/2/4 PASS (v6 real-machine): v6 native automatic dual-stack introduces no measurable regression on the F-Stack business fast path (NFR-1/NFR-2); F-Stack always carries the business (NFR-3).**
+
+> Raw wrk output (IP-masked): `/tmp/perf/A{0,1}_T{1,2,3}_tr{1,2,3}.txt` (cleaned via `rm_tmp_file.sh` after the run).
diff --git a/docs/kernel_event_support_spec/zh_cn/09-impl-plan.md b/docs/kernel_event_support_spec/zh_cn/09-impl-plan.md
@@ -111,7 +111,7 @@
 1. R0-R6 已完成（见 §2）。
 2. R7 spec 升级为 v6（中英文已同步）。
 3. **R7 实现已完成**（connect 契约 Q2=B 已确认）：3.1→3.2（映射表）→3.3（socket 双建 + bind/listen/close/accept 双驱动 + setsockopt/fcntl）→3.4（epoll 双注册 + close 清配对）→3.6（demo）。
-4. R7 测试已完成：cmocka 双态（宏关 P1 50/50；宏开 P1 含 `test_ff_native_fd_map`/`test_ff_kernel_fd_encode_roundtrip`）+ 真机双栈（单 listen(80)：内核 `curl 127.0.0.1:80=200`、F-Stack `ssh f-stack-client→9.134.214.176:80=200`）。**注**：性能 PERF-1/2/4 的 v6 wrk 吞吐基准未重跑，见 `10 §10`。
+4. R7 测试已完成：cmocka 双态（宏关 P1 50/50；宏开 P1 含 `test_ff_native_fd_map`/`test_ff_kernel_fd_encode_roundtrip`）+ 真机双栈（单 listen(80)：内核 `curl 127.0.0.1:80=200`、F-Stack `ssh f-stack-client→9.134.214.176:80=200`）+ 性能 A/B（v6 默认双建 vs 纯 F-Stack，T1/T2/T3 各 3 trial，Δ −1.73%/+1.68%/+5.87% 全落噪声内无回归，见 `10 §10`）。
 5. R7 门禁 PASS：`08 §4` V1-V12 已实测；双编译 nm 零回归（宏关共存符号=0、size 6539682 与基线一致；宏开含 `ff_native_fd_map`）；中英文 spec 已同步；英文简短 commit `13b418191`；config 本机值未提交。bounce=1（test_ff_epoll stub，已修）。
 
 ## 6. 工作区脚本规约
diff --git a/docs/kernel_event_support_spec/zh_cn/10-perf-baseline-report.md b/docs/kernel_event_support_spec/zh_cn/10-perf-baseline-report.md