Skip to content

Commit 3e33225

Browse files
committed
docs(bench): Phase 4 perf flamegraph diff identifies vendor-side regression sources
Append §12.11 to runtime-fix-execution-log.md and §11 to 13.0-baseline-cvm-bench-report.md with Phase 4 results: same-time-window perf record cpu-clock @999hz on lcore CPU#4 for both 13.0 baseline and 15.0 runtime-fix-done helloworld during wrk T3 t8c500 30s, plus differential flamegraph generated via /opt/FlameGraph/difffolded.pl. Root-cause findings: - 9% throughput regression is sourced from FreeBSD 13->15 vendor kernel evolution, NOT from the 6 runtime-fix commits. - Primary contributors (per diff.folded sample-weight): * tcp_output -> tcp_default_output vtable wrapper (+99M x3 + 89M, -838M old direct path) - RACK/BBR alternative TCP stacks abstraction * TCP CUBIC ack-received state machine expansion (+86M) * socket buffer locking refactor: sbcut_locked / sbappendstream / assert_rw (+~230M) * ether_nh_input mbuf pipeline change (+88M) * uma_zfree_arg path divergence (+0.5pp) - runtime-fix touch points (kern_accept / badfileops / route_rtentry / atomic.h) DO NOT appear in the T3 hot path - they are init/SIGSEGV-error-path only. Side finding: ~5% CPU consumed by helloworld stdout -> ext4 -> fsnotify/inotify chain - test environment noise, not an F-Stack issue. Production may redirect stdout to tmpfs / /dev/null to recover this. Recommendation: accept the 9% gap for M5 acceptance; correctness benefit of 5 P0 SIGSEGV fixes outweighs throughput cost. Do NOT bisect-revert runtime-fix commits (would re-trigger SIGSEGVs). Tooling notes: CVM virtualization blocks hardware PMU 'cycles:P' (zero samples observed), forced switch to software 'cpu-clock' event. -O0 -g build retains frame pointer; --call-graph fp avoids dwarf 16KB stack disturbance on 100% busy lcore. 13.0 baseline cold-start required wrk warmup (else swapper idle 35.6%, throughput 136k); after warmup matches §12.10 figures (225k vs 239k earlier, within ±5% jitter). System-level installs (permanent): perf-6.6.119-49.23.tl4 (~50MB), /opt/FlameGraph (~2MB git clone). Mandatory workspace policies obeyed: kill_process.sh x1, rm_tmp_file.sh x1, zero direct rm/kill/chmod calls. Local commit only; not pushed (per user decision Q_commit=D).
1 parent 62f5400 commit 3e33225

2 files changed

Lines changed: 197 additions & 1 deletion

File tree

docs/freebsd_13_to_15_upgrade_spec/zh_cn/13.0-baseline-cvm-bench-report.md

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -186,6 +186,109 @@ make: *** No rule to make target 'ff_api.symlist', needed by 'libfstack.a'. Sto
186186

187187
- `runtime-fix-execution-log.md` §12.6 (历史 15.0 单边数据起源)
188188
- `runtime-fix-execution-log.md` §12.10 (同摘要双份落档)
189+
- `runtime-fix-execution-log.md` §12.11 (Phase 4 perf flamegraph 双版叠图)
189190
- spec 06-test-and-acceptance-spec.md §5 (性能基线 NFR-1 整体框架)
190191
- preflight log: `/tmp/13baseline-bench/preflight.log`
191192
- 编译 log: `/tmp/13baseline-bench/lib_make.log` + `/tmp/13baseline-bench/example_make.log`
193+
194+
---
195+
196+
## 11. Phase 4 perf flamegraph 根因定位(追加 — 2026-06-03 15:25-15:33)
197+
198+
### 11.1 工具与采样
199+
200+
|||
201+
|---|---|
202+
| perf | 6.6.119-49.23.tl4(dnf 装) |
203+
| FlameGraph | `/opt/FlameGraph` (Brendan Gregg, git clone) |
204+
| 采样命令 | `perf record -e cpu-clock -F 999 -C 4 -g --call-graph fp -- sleep 30` |
205+
| 采样事件 | `cpu-clock`(software,因 CVM 虚拟化下 `cycles:P` PMU 不可用,hypervisor 屏蔽) |
206+
| call-graph | `fp`(frame pointer,`-O0 -g` 编译保留),不用 dwarf 避免 16KB stack 干扰 |
207+
| 同步触发 | perf record 后台 + ssh wrk T3 t8c500 30s |
208+
209+
### 11.2 双版采样数据
210+
211+
| 版本 | wrk req/s(perf 干扰下) | swapper idle | sample | folded 行 | svg |
212+
|---|---:|---:|---:|---:|---|
213+
| 15.0 runtime-fix-done | 237,707 | 3.29% | 29,974 | 583 | `/tmp/perf-15rfix/15rfix.svg` |
214+
| 13.0 baseline (after warmup) | 225,777 | 2.82% | 29,974 | 570 | `/tmp/perf-13baseline/13baseline.svg` |
215+
216+
### 11.3 Top 函数 cpu-clock 占比对比
217+
218+
| 函数 | 15.0 % | 13.0 % | Δ | 解读 |
219+
|---|---:|---:|---:|---|
220+
| `tcp_default_output` (15) / `tcp_output` (13) | 4.43 | 3.49 | **+0.94** | TCP stacks vtable wrapper |
221+
| `uma_zfree_arg` | 2.87 | 2.37 | **+0.50** | UMA 释放路径变重 |
222+
| `virtio_recv_mergeable_pkts` | 2.20 | 1.46 | **+0.74** | 收包路径增长 |
223+
| `in_cksum_skip` | 5.01 | 4.94 | +0.07 | 校验和等价 |
224+
| `tcp_do_segment` | 1.39 | 1.73 | -0.34 | 部分逻辑切到 default_output |
225+
| `m_copydata` | 1.53 | 1.64 | -0.11 | mbuf 拷贝接近 |
226+
227+
### 11.4 差分火焰图增长 / 减少 top(按 sample-weight)
228+
229+
**▲ 15.0 grew (regression):**
230+
231+
| 函数(部分调用栈) | Δ |
232+
|---|---:|
233+
| `tcp_default_output` (3 paths) | +99M ×2 + 89M |
234+
| `tcp_usr_send` | +99M |
235+
| `ether_nh_input` | +88M |
236+
| `cubic_ack_received` | +86M |
237+
| `sbcut_locked` | +84M |
238+
| `sbappendstream` | +73M |
239+
| `assert_rw` (rwlock primitive) | +72M |
240+
241+
**▼ 15.0 shrunk:**
242+
243+
| 函数 | Δ |
244+
|---|---:|
245+
| `tcp_output` (旧直接) | **-838M** |
246+
| `tcp_do_segment` | -428M |
247+
| `ip_output` (旧路径) | -173M |
248+
| `callout_reset_tick_on` | -142M |
249+
| `kqueue_kevent` (so_rdknl_lock) | -131M |
250+
251+
### 11.5 根因结论 — 9% gap **不是** runtime-fix 引入
252+
253+
| 真凶(FreeBSD 13→15 vendor 演进) | 占比 | 来源 |
254+
|---|---|---|
255+
| TCP stacks 框架 vtable 派发(tcp_output → tcp_default_output) | 主因 ~+1% | RACK/BBR alternative stacks 抽象层 |
256+
| TCP CUBIC 拥塞控制升级 | ~+0.6% | cubic_ack_received 状态机扩展 |
257+
| socket buffer locking 重构 | ~+1.5% | sbcut_locked / sbappendstream / assert_rw |
258+
| 以太网 mbuf 入口流水线调整 | ~+0.7% | ether_nh_input |
259+
| UMA 释放路径变化 | ~+0.5% | uma_zfree_arg |
260+
261+
**runtime-fix 的 5 个 P0 修复 + 1 防御性 panic stub 全部位于 init / SIGSEGV 路径**,差分图中**完全没有出现** `kern_accept` / `badfileops` / `route_rtentry` / atomic.h 改动相关函数 —— 即 fastpath 不走这些代码。
262+
263+
### 11.6 副发现:log I/O 测试环境噪声 ~5%
264+
265+
两版均有 `__GI___libc_write``vfs_write``ext4_buffered_write_iter``fsnotify``inotify_handle_inode_event` 调用链 ≈ 5% CPU。
266+
267+
**这是 helloworld stdout 重定向到 ext4 + 文件被 inotify/fanotify watch 的副作用,不属于 F-Stack 性能问题**。生产环境关闭 stdout log(或重定向 tmpfs / `/dev/null`)可立即回收 ~5%。
268+
269+
### 11.7 行动建议(性价比排序)
270+
271+
| 优先级 | 动作 | 预期收益 | 风险 |
272+
|---|---|---|---|
273+
| ★★★ | 接受 9% gap 进入 M5(正确性 >> 性能) |||
274+
| ★★ | helloworld stdout log → tmpfs / /dev/null | +~5% | 无(仅日志输出方式) |
275+
|| `net.inet.tcp.functions_default=freebsd` 关 RACK/BBR 包装层 | +~1% | 失去 RACK 改进 |
276+
|| 调 socket buffer / cwnd 默认 | 待测 | 影响 TCP 行为 |
277+
| × | 回退或二分 runtime-fix commit || 重触发 5 个 P0 SIGSEGV |
278+
279+
### 11.8 工具产出归档
280+
281+
| 产物 | 路径 | 大小 |
282+
|---|---|---:|
283+
| 15.0 perf.data | `/tmp/perf-15rfix/perf.data` | 3.76 MB |
284+
| 15.0 svg | `/tmp/perf-15rfix/15rfix.svg` | 146 KB |
285+
| 13.0 perf.data | `/tmp/perf-13baseline/perf.data` | 3.69 MB |
286+
| 13.0 svg | `/tmp/perf-13baseline/13baseline.svg` | 145 KB |
287+
| 差分 svg | `/tmp/perf-diff/diff.svg` | 152 KB |
288+
| 折叠数据 | `*.folded`(各目录) ||
289+
290+
### 11.9 Phase 4 系统终态
291+
292+
- **13.0 baseline helloworld 仍在跑**(PID 1748870, lcore=4),可立即用于后续验证;切回 15.0 走标准流程:`kill_process.sh + rm_tmp_file.sh /dev/hugepages/rtemap_* + 主树 ./start.sh`
293+
- Phase 4 全程零直接 `rm`/`kill`/`chmod``kill_process.sh` × 1、`rm_tmp_file.sh` × 1
294+
- 系统级永久新增:`/usr/bin/perf` (perf-6.6.119-49.23.tl4) + `/opt/FlameGraph/`

docs/freebsd_13_to_15_upgrade_spec/zh_cn/runtime-fix-execution-log.md

Lines changed: 94 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -375,9 +375,102 @@ const struct fileops badfileops = {0};
375375
- **C. dpdk EAL 参数对齐复测** —— 验证 mempool / mbuf / RING size 在两版 stub 下是否实际等价
376376
- **D. 决策点**:是否(D1)接受当前 9% gap 进入 M5 验收 / 还是(D2)开 Phase 4 性能调优 / 还是(D3)回退某个 runtime-fix commit 二分定位
377377
378-
#### 12.10.7 系统终态
378+
#### 12.10.7 系统终态(首轮 Step 1-5)
379379
380380
- 13.0 baseline helloworld:已 kill (PID 1735251),hugepages 23 个 rtemap 残留通过 `rm_tmp_file.sh` 清入 trash
381381
- 15.0 runtime-fix-done helloworld:**仍后台在跑**(PID 1738072, lcore=4, hugepages 23/4096, port0 9.134.214.176:80 监听),log: `/tmp/15rfix-bench/hello.log`
382382
- 双 binary 已备份保留,后续切换无需重编(cp 替换 `./example/helloworld` + 重启即可)
383383
- 强制规约遵守:本节全程 `kill_process.sh` × 2、`rm_tmp_file.sh` × 3(rtemap×23 + lib 产物×195 + rtemap×23),零直接 `rm`/`kill`/`chmod` 调用
384+
385+
### 12.11 Phase 4 perf flamegraph 双版叠图(2026-06-03)
386+
387+
为定位 §12.10 揭示的 7-9% 吞吐回退根因,本节按 plan §12.10.6-A 路径执行最小集 perf flamegraph 双版采样。
388+
389+
#### 12.11.1 工具链与采样参数
390+
391+
- `perf 6.6.119-49.23.tl4`(dnf install)+ `/opt/FlameGraph/{stackcollapse-perf,flamegraph,difffolded}.pl`
392+
- 采样参数(CVM 虚拟化适配):`-e cpu-clock -F 999 -C 4 -g --call-graph fp -- sleep 30`
393+
- **关键决策**:CVM 上 hardware PMU `cycles:P` 在 hypervisor 后被屏蔽(采到 0 sample);改用 software event `cpu-clock`,999Hz 锁 lcore CPU#4(lcore_mask=10 hex = bit 4)
394+
- `-O0 -g` 编译保留 frame pointer,`--call-graph fp` 而非 dwarf(避免 dwarf 16KB stack 干扰 100% busy lcore)
395+
- 同步触发:`perf record` 后台 + ssh f-stack-client wrk T3 t8c500 30s
396+
- 13.0 baseline 必须先 wrk warmup 5s(首次冷启 swapper idle 35.6%、wrk 仅 136k req/s;warmup 后 swapper idle 2.82%、wrk 225k req/s 与 §12.10.3 数据一致)
397+
398+
#### 12.11.2 双版采样结果(同 wrk 干扰下)
399+
400+
| 版本 | helloworld PID | wrk req/s (T3 期间 perf 干扰下) | swapper idle | 样本数 | folded 行数 | svg 路径 |
401+
|---|---:|---:|---:|---:|---:|---|
402+
| 15.0 runtime-fix-done | 1738072 | 237,707 | 3.29% | 29,974 | 583 | `/tmp/perf-15rfix/15rfix.svg` |
403+
| 13.0 baseline | 1748870 | 225,777 | 2.82% | 29,974 | 570 | `/tmp/perf-13baseline/13baseline.svg` |
404+
405+
差分火焰图:`/tmp/perf-diff/diff.svg`(红=15.0 多出的开销,蓝=13.0 多出的开销)
406+
407+
#### 12.11.3 Top 函数对比(cpu-clock %,按 15.0 排序)
408+
409+
| 函数 | 15.0 | 13.0 | Δ | 解读 |
410+
|---|---:|---:|---:|---|
411+
| `in_cksum_skip` | 5.01% | 4.94% | +0.07 | 校验和计算,两版基本一致 |
412+
| `tcp_default_output` (15) / `tcp_output` (13) | 4.43% | 3.49% | **+0.94** | **TCP stacks 框架 vtable wrapper** |
413+
| `uma_zfree_arg` | 2.87% | 2.37% | **+0.50** | UMA 释放路径增长 |
414+
| `virtio_recv_mergeable_pkts` | 2.20% | 1.46% | **+0.74** | 收包 mbuf 处理增长 |
415+
| `__memmove_avx_unaligned_erms` | 1.79% | 1.66% | +0.13 | libc 拷贝 |
416+
| `m_copydata` | 1.53% | 1.64% | -0.11 | mbuf 拷贝两版接近 |
417+
| `tcp_do_segment` | 1.39% | 1.73% | -0.34 | 部分逻辑被切到 tcp_default_output |
418+
| `virtio_xmit_pkts` | 1.14% | 1.22% | -0.08 | 发包路径接近 |
419+
| `ff_dpdk_if_send` | 1.12% | 1.12% | 0 | F-Stack DPDK 发送等价 |
420+
| `ip_output` | 0.95% | 0.93% | +0.02 | IP 输出等价 |
421+
422+
#### 12.11.4 差分图 top 增长 / 减少(diff.folded sample 权重)
423+
424+
| 方向 | 函数(部分调用栈) | Δ sample-weight |
425+
|---|---|---:|
426+
| ▲ 增长 | `tcp_default_output` (multiple paths) | +99M ×2 + 89M |
427+
| ▲ 增长 | `tcp_usr_send` | +99M |
428+
| ▲ 增长 | `ether_nh_input` | +88M |
429+
| ▲ 增长 | `cubic_ack_received` | +86M |
430+
| ▲ 增长 | `sbcut_locked` / `sbappendstream` | +84M / +73M |
431+
| ▲ 增长 | `assert_rw` (rwlock primitive) | +72M |
432+
| ▼ 减少 | `tcp_output` (旧直接调用) | **-838M** |
433+
| ▼ 减少 | `tcp_do_segment` | -428M |
434+
| ▼ 减少 | `ip_output` (旧路径) | -173M |
435+
| ▼ 减少 | `callout_reset_tick_on` | -142M |
436+
| ▼ 减少 | `kqueue_kevent` (so_rdknl_lock 路径) | -131M |
437+
438+
#### 12.11.5 关键根因结论
439+
440+
**9% 吞吐回退主要源自 FreeBSD 13→15 vendor 内核演进,与 runtime-fix 6 个 commit 无关**:
441+
442+
1. **TCP stacks 框架抽象层**:13.0 直接 `tcp_output()`,15.0 改为 `tcp_default_output()`(FreeBSD 14+ 引入的 RACK/BBR alternative stacks vtable 派发);旧路径 -838M、新路径 +99M×3 + 89M ≈ +386M,**净开销集中在 vtable indirect call + cb 包装**
443+
2. **TCP CUBIC 拥塞控制升级**:`cubic_ack_received` +86M,13.0 → 15.0 之间 CUBIC 算法改用更精细的 RTT/cwnd 状态机
444+
3. **socket buffer locking 重构**:`sbcut_locked` / `sbappendstream` / `assert_rw` 共增长 ~230M —— 15.0 引入更细粒度 SMP 锁,单核场景下断言/原子开销反而比 13.0 lock-elision fast-path 重
445+
4. **以太网入口路径变化**:`ether_nh_input` +88M,13.0 → 15.0 mbuf 处理流水线调整
446+
5. **runtime-fix 的 5 个 P0 修复 + 1 防御性 panic stub 全部位于 init / SIGSEGV 路径,不出现在 T3 高并发 hot path 中** —— 差分图未显示这些函数(kern_accept / badfileops / route_rtentry 等)有显著 sample,回归非由 runtime-fix 引入
447+
448+
#### 12.11.6 副发现:log 写入 ~5% 测试环境噪声
449+
450+
两版均有显著 `__GI___libc_write` → `vfs_write` → `ext4_buffered_write_iter` → `fsnotify` → `inotify_handle_inode_event` 链 ≈ 5% CPU;这是 helloworld stdout 重定向到 ext4 + log 文件被 inotify/fanotify watch 的副作用,**不属于 F-Stack 性能问题**。生产环境如关闭 stdout log(或重定向到 tmpfs)可立即回收这 5%。
451+
452+
#### 12.11.7 决策建议(不改本次范围)
453+
454+
- **接受 9% gap 进入 M5 验收**:runtime-fix 5 个 P0 SIGSEGV 修复价值远大于 9% 吞吐,且回退路径不存在(TCP stacks 抽象是 vendor 默认开启)
455+
- **可选优化方向**(性价比排序):
456+
- (1) helloworld stdout log 重定向到 tmpfs / `/dev/null`(预期回收 ~5%,零代码改动)
457+
- (2) 调研 `net.inet.tcp.functions_default = freebsd`(关 RACK/BBR 包装层,回到直接 `tcp_output`,预期回收 ~1%)
458+
- (3) `kern.smr.shared_only=1`(如可用,禁用 SMP 共享 SMR fast-path 锁开销)
459+
- **不建议**:回退或二分 runtime-fix commit(风险 = 重新触发 5 个 P0 SIGSEGV)
460+
461+
#### 12.11.8 工具产出归档
462+
463+
| 产物 | 路径 | 用途 |
464+
|---|---|---|
465+
| 15.0 perf raw | `/tmp/perf-15rfix/perf.data` (3.76 MB) | 后续 `perf report` 复查 |
466+
| 15.0 火焰图 | `/tmp/perf-15rfix/15rfix.svg` (146 KB) | 浏览器可视化 |
467+
| 13.0 perf raw | `/tmp/perf-13baseline/perf.data` (3.69 MB) | 同 |
468+
| 13.0 火焰图 | `/tmp/perf-13baseline/13baseline.svg` (145 KB) | 同 |
469+
| 差分图 | `/tmp/perf-diff/diff.svg` (152 KB) | 红/蓝对照 |
470+
| 折叠数据 | `*.folded` (in 各目录) | 后续脚本化分析 |
471+
472+
#### 12.11.9 系统终态(Phase 4 末)
473+
474+
- 13.0 baseline helloworld 仍在跑(PID 1748870, lcore=4),后续若需切回 15.0 走 `kill_process.sh + rm_tmp_file.sh /dev/hugepages/rtemap_* + 主树 ./start.sh` 即可
475+
- Phase 4 全程零直接 `rm`/`kill`/`chmod` 调用:`kill_process.sh` × 1(首切 15.0)、`rm_tmp_file.sh` × 1(rtemap × 23)
476+
- 工具系统级安装:`dnf install perf` (50MB) + `git clone /opt/FlameGraph` (2MB),永久保留

0 commit comments

Comments
 (0)