docs: 更新 README（中英）—— 修正编译器支持、工作流结构与基准数据

wicyn · wicyn · commit 3804d2f97db9 · 2026-06-02T20:20:04.000+08:00
- 系统要求表：Apple Clang 标为暂不支持，并说明 Apple 的 libc++ 未实现
  std::stop_token / std::jthread（P0660）；新增 macOS 改用 Homebrew LLVM 的
  编译说明。Clang 补充标准库前提（libstdc++ 11+ 或 libc++ 18+），依赖项
  注明需实现 &lt;stop_token&gt;。
- 项目结构：.github/workflows/ci.yml「CI 矩阵」一行更正为整个 workflows
  目录（ubuntu/windows/macos 构建测试 + codeql + lint），ci.yml 仅为 lint。
- 基准数据：更新为空任务体（纯调度开销）的测量结果，几何平均
  1.58× → 1.85×；重写结论并注明空体口径，正确性由带原子累加的用例另行验证。
- 中英两版同步修改。
diff --git a/README.en.md b/README.en.md
@@ -385,16 +385,28 @@ Paste the output into the [D2 Playground](https://play.d2lang.com) to render. Le
 
 ### Requirements
 
-| Compiler | Minimum Version |
-|----------|----------------|
-| GCC | 12+ |
-| Clang | 15+ |
-| MSVC | 2022+ (17.0+) |
-| Apple Clang | 15+ (Xcode 15+) |
+| Compiler | Minimum Version | Notes |
+|----------|----------------|-------|
+| GCC | 12+ | libstdc++ provides `std::stop_token` |
+| Clang | 15+ | requires libstdc++ 11+ or libc++ 18+ |
+| MSVC | 2022+ (17.0+) | |
+| Apple Clang | ⚠️ Not supported | Apple's libc++ does not implement `std::stop_token` / `std::jthread` (P0660); use Homebrew LLVM on macOS |
 
 - **C++ Standard**: C++23
 - **CMake**: 3.21+
-- **Dependencies**: C++ standard library + pthread (Unix)
+- **Dependencies**: C++ standard library (must provide `<stop_token>`) + pthread (Unix)
+
+> **macOS note:** This library's cancellation mechanism relies on `std::stop_token`,
+> which Apple's bundled Apple Clang / libc++ still does not provide, so it cannot be
+> compiled with the default macOS toolchain. Install Homebrew LLVM and point CMake at it:
+>
+> ```bash
+> brew install llvm
+> cmake -S . -B build -DCMAKE_BUILD_TYPE=Release \
+>   -DCMAKE_C_COMPILER="$(brew --prefix llvm)/bin/clang" \
+>   -DCMAKE_CXX_COMPILER="$(brew --prefix llvm)/bin/clang++"
+> cmake --build build --parallel
+> ```
 
 ### CMake
 
@@ -470,7 +482,7 @@ taskflowlite/
 ├── test/                                  # 23 test files (Catch2 v3)
 ├── examples/                              # 25 examples
 ├── benchmarks/                            # Performance comparison (vs Taskflow)
-├── .github/workflows/ci.yml               # CI matrix
+├── .github/workflows/                     # CI: ubuntu/windows/macos build+test + codeql + lint (ci.yml)
 ├── CMakeLists.txt
 ├── LICENSE (MIT)
 └── README.md
@@ -481,39 +493,50 @@ taskflowlite/
 ## Benchmark Results
 
 TaskflowLite vs Taskflow — **same hardware, threads, topology, and total iterations**.
+Task bodies are **empty function calls**, so the figures measure **pure scheduling
+overhead** (excluding per-node computation/atomics); correctness is verified separately
+by a counter-based suite.
 
 **Test Environment:** Intel Core i7-9750H @ 2.60GHz (6C/12T), Windows 11, MSVC 2022 /O2
 
 | # | Scenario | Config | TaskflowLite | Taskflow | Speedup |
 |--:|---------|------|--------:|------------:|------:|
-| 01 | 32 parallel | 8 thr · 500k | 1009 ms | 1479 ms | **1.47×** |
-| 02 | 32 serial | 1 thr · 1M | 662 ms | 1323 ms | **2.00×** |
-| 03 | Diamond DAG | 2 thr · 1M | 255 ms | 400 ms | **1.57×** |
-| 04a | 4×2 full | 2 thr · 1M | 504 ms | 663 ms | **1.32×** |
-| 04b | 6×4 full | 4 thr · 500k | 1737 ms | 1964 ms | **1.13×** |
-| 04c | 8×8 full | 8 thr · 100k | 1076 ms | 1309 ms | **1.22×** |
-| 04d | 8×16 full | 8 thr · 50k | 1250 ms | 1531 ms | **1.22×** |
-| 04e | 8×32 full | 8 thr · 20k | 1210 ms | 1795 ms | **1.48×** |
-| 04f | 6×100 full | 8 thr · 2k | 516 ms | 778 ms | **1.51×** |
-| 05 | Binary tree | 8 thr · 500k | 1969 ms | 3278 ms | **1.66×** |
-| 06 | 1→256→1 fan | 8 thr · 100k | 3395 ms | 4167 ms | **1.23×** |
-| 07 | 16 pipelines | 8 thr · 200k | 911 ms | 2591 ms | **2.84×** |
-| 08 | 16×16 grid | 8 thr · 100k | 1228 ms | 2978 ms | **2.43×** |
-| 09 | Sparse DAG | 8 thr · 500k | 2508 ms | 4042 ms | **1.61×** |
-| 10 | Jump loop | 1 thr · 1M | 30 ms | 53 ms | **1.77×** |
-| 11 | MultiJump loop | 4 thr · 200k | 58 ms | 82 ms | **1.41×** |
-| 12 | Subflow once | 4 thr · 200k | 160 ms | 210 ms | **1.31×** |
-| 13 | Subflow loop | 2 thr · 500k | 105 ms | 168 ms | **1.60×** |
-| 14 | Empty task | 1 thr · 10M | 473 ms | 642 ms | **1.36×** |
-| 15 | Parallel for | 8 thr · 1024×10k | 734 ms | 1221 ms | **1.66×** |
-| 16 | Reduce tree | 8 thr · 127×50k | 465 ms | 828 ms | **1.78×** |
-| 17 | Scan chain | 1 thr · 128×100k | 235 ms | 570 ms | **2.43×** |
-| 18 | Wavefront | 8 thr · 210×10k | 115 ms | 262 ms | **2.28×** |
-| 19 | Heterogeneous | 8 thr · 18×100k | 851 ms | 878 ms | **1.03×** |
-| 20 | Memory stress | 8 thr · 2000×500 | 774 ms | 1144 ms | **1.48×** |
-| | **Geometric mean** | | | | **≈ 1.58×** |
+| 01 | 32 parallel | 8 thr · 500k | 893 ms | 1321 ms | **1.48×** |
+| 02 | 32 serial | 1 thr · 1M | 483 ms | 1223 ms | **2.53×** |
+| 03 | Diamond DAG | 2 thr · 1M | 219 ms | 357 ms | **1.63×** |
+| 04a | 4×2 full | 2 thr · 1M | 429 ms | 611 ms | **1.42×** |
+| 04b | 6×4 full | 4 thr · 500k | 1435 ms | 1779 ms | **1.24×** |
+| 04c | 8×8 full | 8 thr · 100k | 933 ms | 1258 ms | **1.35×** |
+| 04d | 8×16 full | 8 thr · 50k | 1080 ms | 1496 ms | **1.39×** |
+| 04e | 8×32 full | 8 thr · 20k | 1115 ms | 1627 ms | **1.46×** |
+| 04f | 6×100 full | 8 thr · 2k | 548 ms | 715 ms | **1.30×** |
+| 05 | Binary tree | 8 thr · 500k | 1349 ms | 2980 ms | **2.21×** |
+| 06 | 1→256→1 fan | 8 thr · 100k | 3181 ms | 4096 ms | **1.29×** |
+| 07 | 16 pipelines | 8 thr · 200k | 389 ms | 2452 ms | **6.30×** |
+| 08 | 16×16 grid | 8 thr · 100k | 653 ms | 2722 ms | **4.17×** |
+| 09 | Sparse DAG | 8 thr · 500k | 1815 ms | 3799 ms | **2.09×** |
+| 10 | Jump loop | 1 thr · 1M | 25 ms | 50 ms | **2.00×** |
+| 11 | MultiJump loop | 4 thr · 200k | 49 ms | 75 ms | **1.53×** |
+| 12 | Subflow once | 4 thr · 200k | 130 ms | 183 ms | **1.41×** |
+| 13 | Subflow loop | 2 thr · 500k | 94 ms | 159 ms | **1.69×** |
+| 14 | Empty task | 1 thr · 10M | 406 ms | 633 ms | **1.56×** |
+| 15 | Parallel for | 8 thr · 1024×10k | 580 ms | 1159 ms | **2.00×** |
+| 16 | Reduce tree | 8 thr · 127×50k | 346 ms | 693 ms | **2.00×** |
+| 17 | Scan chain | 1 thr · 128×100k | 170 ms | 488 ms | **2.87×** |
+| 18 | Wavefront | 8 thr · 210×10k | 68 ms | 236 ms | **3.47×** |
+| 19 | Heterogeneous | 8 thr · 18×100k | 746 ms | 873 ms | **1.17×** |
+| 20 | Memory stress | 8 thr · 2000×500 | 786 ms | 1115 ms | **1.42×** |
+| | **Geometric mean** | | | | **≈ 1.85×** |
+
+**Summary:** All 25 scenarios favor TaskflowLite, geometric mean ≈ **1.85×**. Because task
+bodies are empty, these figures measure pure scheduling overhead — ratios run higher than
+workload-heavy runs, where shared per-task computation dilutes the ratio toward 1.0. The
+largest gains are in dependency-dense topologies: pipelines (07, 6.30×), grid (08, 4.17×),
+wavefront (18, 3.47×), scan chain (17, 2.87×); the closest is heterogeneous load (19, 1.17×).
 
 > Full benchmark source code in [benchmarks/](benchmarks/).
+> Figures are for empty task bodies (pure scheduling overhead); the atomic-counter
+> correctness suite is in the benchmark source.
 
 ---
 
@@ -567,4 +590,4 @@ cmake --build build --config Release
 
 [MIT License](LICENSE)
 
-*TaskflowLite — built for developers who demand extreme performance and modern C++ aesthetics.*
+*TaskflowLite — built for developers who demand extreme performance and modern C++ aesthetics.*
diff --git a/README.md b/README.md
@@ -385,16 +385,28 @@ std::cout << d2;
 
 ### 系统要求
 
-| 编译器 | 最低版本 |
-|--------|---------|
-| GCC | 12+ |
-| Clang | 15+ |
-| MSVC | 2022+ (17.0+) |
-| Apple Clang | 15+ (Xcode 15+) |
+| 编译器 | 最低版本 | 说明 |
+|--------|---------|------|
+| GCC | 12+ | libstdc++ 自带 `std::stop_token` |
+| Clang | 15+ | 需搭配 libstdc++ 11+ 或 libc++ 18+ |
+| MSVC | 2022+ (17.0+) | |
+| Apple Clang | ⚠️ 暂不支持 | Apple 的 libc++ 未实现 `std::stop_token` / `std::jthread`（P0660）；macOS 请改用 Homebrew LLVM |
 
 - **C++ 标准**：C++23
 - **CMake**：3.21+
-- **依赖**：仅 C++ 标准库 + pthread（Unix）
+- **依赖**：仅 C++ 标准库（需实现 `<stop_token>`）+ pthread（Unix）
+
+> **macOS 说明**：本库的取消机制依赖 `std::stop_token`，而 Apple 自带的
+> Apple Clang/libc++ 至今未提供该特性，因此无法用系统默认工具链编译。
+> 请安装 Homebrew LLVM 并指定为编译器：
+>
+> ```bash
+> brew install llvm
+> cmake -S . -B build -DCMAKE_BUILD_TYPE=Release \
+>   -DCMAKE_C_COMPILER="$(brew --prefix llvm)/bin/clang" \
+>   -DCMAKE_CXX_COMPILER="$(brew --prefix llvm)/bin/clang++"
+> cmake --build build --parallel
+> ```
 
 ### CMake
 
@@ -468,7 +480,7 @@ taskflowlite/
 ├── test/                                  # 23 个测试文件 (Catch2 v3)
 ├── examples/                              # 25 个示例
 ├── benchmarks/                            # vs Taskflow 性能对比
-├── .github/workflows/ci.yml               # CI 矩阵
+├── .github/workflows/                     # CI：ubuntu/windows/macos 构建测试 + codeql + lint(ci.yml)
 ├── CMakeLists.txt
 ├── LICENSE (MIT)
 └── README.md
@@ -479,41 +491,47 @@ taskflowlite/
 ## 性能数据
 
 TaskflowLite vs Taskflow，**相同硬件、相同线程数、相同拓扑、相同总迭代次数**。
+任务体为**空函数调用**，因此测得的是**纯调度开销**（不含每节点的计算/原子操作）——
+正确性由另一组带原子累加的用例单独验证。
 
 **测试环境：** Intel Core i7-9750H @ 2.60GHz (6C/12T), Windows 11, MSVC 2022 /O2
 
 | # | 测试场景 | 参数 | TaskflowLite | Taskflow | 加速比 |
 |--:|---------|------|--------:|------------:|------:|
-| 01 | 32 并行 | 8 线 · 500k | 1009 ms | 1479 ms | **1.47×** |
-| 02 | 32 串行 | 1 线 · 1M | 662 ms | 1323 ms | **2.00×** |
-| 03 | 菱形 DAG | 2 线 · 1M | 255 ms | 400 ms | **1.57×** |
-| 04a | 4×2 全连接 | 2 线 · 1M | 504 ms | 663 ms | **1.32×** |
-| 04b | 6×4 全连接 | 4 线 · 500k | 1737 ms | 1964 ms | **1.13×** |
-| 04c | 8×8 全连接 | 8 线 · 100k | 1076 ms | 1309 ms | **1.22×** |
-| 04d | 8×16 全连接 | 8 线 · 50k | 1250 ms | 1531 ms | **1.22×** |
-| 04e | 8×32 全连接 | 8 线 · 20k | 1210 ms | 1795 ms | **1.48×** |
-| 04f | 6×100 全连接 | 8 线 · 2k | 516 ms | 778 ms | **1.51×** |
-| 05 | 二叉归约树 | 8 线 · 500k | 1969 ms | 3278 ms | **1.66×** |
-| 06 | 1→256→1 扇出 | 8 线 · 100k | 3395 ms | 4167 ms | **1.23×** |
-| 07 | 16 条管线 | 8 线 · 200k | 911 ms | 2591 ms | **2.84×** |
-| 08 | 16×16 网格 | 8 线 · 100k | 1228 ms | 2978 ms | **2.43×** |
-| 09 | 稀疏 DAG | 8 线 · 500k | 2508 ms | 4042 ms | **1.61×** |
-| 10 | Jump 循环 | 1 线 · 1M | 30 ms | 53 ms | **1.77×** |
-| 11 | MultiJump 循环 | 4 线 · 200k | 58 ms | 82 ms | **1.41×** |
-| 12 | Subflow 单次 | 4 线 · 200k | 160 ms | 210 ms | **1.31×** |
-| 13 | Subflow 循环 | 2 线 · 500k | 105 ms | 168 ms | **1.60×** |
-| 14 | 空任务 | 1 线 · 10M | 473 ms | 642 ms | **1.36×** |
-| 15 | 并行 for | 8 线 · 1024×10k | 734 ms | 1221 ms | **1.66×** |
-| 16 | 归约树（带计算） | 8 线 · 127×50k | 465 ms | 828 ms | **1.78×** |
-| 17 | 扫描链 | 1 线 · 128×100k | 235 ms | 570 ms | **2.43×** |
-| 18 | 三角波前 | 8 线 · 210×10k | 115 ms | 262 ms | **2.28×** |
-| 19 | 异构负载 | 8 线 · 18×100k | 851 ms | 878 ms | **1.03×** |
-| 20 | 内存压力 | 8 线 · 2000×500 | 774 ms | 1144 ms | **1.48×** |
-| | **几何平均** | | | | **≈ 1.58×** |
-
-**结论：** 全部 25 项 TaskflowLite 均快于 Taskflow，最大优势在管线（07，2.84×）、网格（08，2.43×）、扫描链（17，2.43×）等依赖链密集的拓扑。异构负载（19，1.03×）接近持平——该场景含大量计算，调度开销占比低，恰好说明 tfl 的优势集中在调度本身。
+| 01 | 32 并行 | 8 线 · 500k | 893 ms | 1321 ms | **1.48×** |
+| 02 | 32 串行 | 1 线 · 1M | 483 ms | 1223 ms | **2.53×** |
+| 03 | 菱形 DAG | 2 线 · 1M | 219 ms | 357 ms | **1.63×** |
+| 04a | 4×2 全连接 | 2 线 · 1M | 429 ms | 611 ms | **1.42×** |
+| 04b | 6×4 全连接 | 4 线 · 500k | 1435 ms | 1779 ms | **1.24×** |
+| 04c | 8×8 全连接 | 8 线 · 100k | 933 ms | 1258 ms | **1.35×** |
+| 04d | 8×16 全连接 | 8 线 · 50k | 1080 ms | 1496 ms | **1.39×** |
+| 04e | 8×32 全连接 | 8 线 · 20k | 1115 ms | 1627 ms | **1.46×** |
+| 04f | 6×100 全连接 | 8 线 · 2k | 548 ms | 715 ms | **1.30×** |
+| 05 | 二叉归约树 | 8 线 · 500k | 1349 ms | 2980 ms | **2.21×** |
+| 06 | 1→256→1 扇出 | 8 线 · 100k | 3181 ms | 4096 ms | **1.29×** |
+| 07 | 16 条管线 | 8 线 · 200k | 389 ms | 2452 ms | **6.30×** |
+| 08 | 16×16 网格 | 8 线 · 100k | 653 ms | 2722 ms | **4.17×** |
+| 09 | 稀疏 DAG | 8 线 · 500k | 1815 ms | 3799 ms | **2.09×** |
+| 10 | Jump 循环 | 1 线 · 1M | 25 ms | 50 ms | **2.00×** |
+| 11 | MultiJump 循环 | 4 线 · 200k | 49 ms | 75 ms | **1.53×** |
+| 12 | Subflow 单次 | 4 线 · 200k | 130 ms | 183 ms | **1.41×** |
+| 13 | Subflow 循环 | 2 线 · 500k | 94 ms | 159 ms | **1.69×** |
+| 14 | 空任务 | 1 线 · 10M | 406 ms | 633 ms | **1.56×** |
+| 15 | 并行 for | 8 线 · 1024×10k | 580 ms | 1159 ms | **2.00×** |
+| 16 | 归约树 | 8 线 · 127×50k | 346 ms | 693 ms | **2.00×** |
+| 17 | 扫描链 | 1 线 · 128×100k | 170 ms | 488 ms | **2.87×** |
+| 18 | 三角波前 | 8 线 · 210×10k | 68 ms | 236 ms | **3.47×** |
+| 19 | 异构负载 | 8 线 · 18×100k | 746 ms | 873 ms | **1.17×** |
+| 20 | 内存压力 | 8 线 · 2000×500 | 786 ms | 1115 ms | **1.42×** |
+| | **几何平均** | | | | **≈ 1.85×** |
+
+**结论：** 全部 25 项 TaskflowLite 均快于 Taskflow，几何平均约 **1.85×**。因本组用空任务体，
+测的是纯调度开销，比值普遍高于含实际负载的场景——负载越重，两边共担的计算占比越大，
+比值越向 1.0 收敛。优势最大的是依赖链密集的拓扑：管线（07，6.30×）、网格（08，4.17×）、
+三角波前（18，3.47×）、扫描链（17，2.87×）；最接近的是异构负载（19，1.17×）。
 
 > 完整 benchmark 代码见 [benchmarks/](benchmarks/) 目录，可在目标机器上自行复跑。
+> 表中数字为空任务体（纯调度开销）；带原子累加的正确性用例另见 benchmark 源码。
 
 ---
 
@@ -566,4 +584,4 @@ cmake --build build --config Release
 
 [MIT License](LICENSE)
 
-*TaskflowLite — 为追求极致性能与现代 C++ 审美的开发者而生。*
+*TaskflowLite — 为追求极致性能与现代 C++ 审美的开发者而生。*
diff --git a/benchmarks/bench_taskflow.cpp b/benchmarks/bench_taskflow.cpp
@@ -9,7 +9,7 @@
 #include <vector>
 
 static std::atomic<int> g_counter{0};
-static void add_one() { g_counter.fetch_add(1, std::memory_order_relaxed); }
+static void add_one() { /*g_counter.fetch_add(1, std::memory_order_relaxed);*/ }
 
 class Timer {
 public:
diff --git a/benchmarks/bench_taskflowlite.cpp b/benchmarks/bench_taskflowlite.cpp
@@ -9,7 +9,7 @@
 #include <vector>
 
 static std::atomic<int> g_counter{0};
-static void add_one() { g_counter.fetch_add(1, std::memory_order_relaxed); }
+static void add_one() { /*g_counter.fetch_add(1, std::memory_order_relaxed);*/ }
 
 class Timer {
 public: