@@ -385,16 +385,28 @@ Paste the output into the [D2 Playground](https://play.d2lang.com) to render. Le
385385
386386### Requirements
387387
388- | Compiler | Minimum Version |
389- |----------|----------------|
390- | GCC | 12+ |
391- | Clang | 15+ |
392- | MSVC | 2022+ (17.0+) |
393- | Apple Clang | 15+ (Xcode 15+) |
388+ | Compiler | Minimum Version | Notes |
389+ |----------|----------------|-------|
390+ | GCC | 12+ | libstdc++ provides `std::stop_token` |
391+ | Clang | 15+ | requires libstdc++ 11+ or libc++ 18+ |
392+ | MSVC | 2022+ (17.0+) | |
393+ | Apple Clang | ⚠️ Not supported | Apple's libc++ does not implement `std::stop_token` / `std::jthread` (P0660); use Homebrew LLVM on macOS |
394394
395395- **C++ Standard**: C++23
396396- **CMake**: 3.21+
397- - **Dependencies**: C++ standard library + pthread (Unix)
397+ - **Dependencies**: C++ standard library (must provide `<stop_token>`) + pthread (Unix)
398+
399+ > **macOS note:** This library's cancellation mechanism relies on `std::stop_token`,
400+ > which Apple's bundled Apple Clang / libc++ still does not provide, so it cannot be
401+ > compiled with the default macOS toolchain. Install Homebrew LLVM and point CMake at it:
402+ >
403+ > ```bash
404+ > brew install llvm
405+ > cmake -S . -B build -DCMAKE_BUILD_TYPE=Release \
406+ > -DCMAKE_C_COMPILER="$(brew --prefix llvm)/bin/clang" \
407+ > -DCMAKE_CXX_COMPILER="$(brew --prefix llvm)/bin/clang++"
408+ > cmake --build build --parallel
409+ > ```
398410
399411### CMake
400412
@@ -470,7 +482,7 @@ taskflowlite/
470482├── test/ # 23 test files (Catch2 v3)
471483├── examples/ # 25 examples
472484├── benchmarks/ # Performance comparison (vs Taskflow)
473- ├── .github/workflows/ci.yml # CI matrix
485+ ├── .github/workflows/ # CI: ubuntu/windows/macos build+test + codeql + lint (ci.yml)
474486├── CMakeLists.txt
475487├── LICENSE (MIT)
476488└── README.md
@@ -481,39 +493,50 @@ taskflowlite/
481493## Benchmark Results
482494
483495TaskflowLite vs Taskflow — ** same hardware, threads, topology, and total iterations** .
496+ Task bodies are ** empty function calls** , so the figures measure ** pure scheduling
497+ overhead** (excluding per-node computation/atomics); correctness is verified separately
498+ by a counter-based suite.
484499
485500** Test Environment:** Intel Core i7-9750H @ 2.60GHz (6C/12T), Windows 11, MSVC 2022 /O2
486501
487502| # | Scenario | Config | TaskflowLite | Taskflow | Speedup |
488503| --:| ---------| ------| --------:| ------------:| ------:|
489- | 01 | 32 parallel | 8 thr · 500k | 1009 ms | 1479 ms | ** 1.47×** |
490- | 02 | 32 serial | 1 thr · 1M | 662 ms | 1323 ms | ** 2.00×** |
491- | 03 | Diamond DAG | 2 thr · 1M | 255 ms | 400 ms | ** 1.57×** |
492- | 04a | 4×2 full | 2 thr · 1M | 504 ms | 663 ms | ** 1.32×** |
493- | 04b | 6×4 full | 4 thr · 500k | 1737 ms | 1964 ms | ** 1.13×** |
494- | 04c | 8×8 full | 8 thr · 100k | 1076 ms | 1309 ms | ** 1.22×** |
495- | 04d | 8×16 full | 8 thr · 50k | 1250 ms | 1531 ms | ** 1.22×** |
496- | 04e | 8×32 full | 8 thr · 20k | 1210 ms | 1795 ms | ** 1.48×** |
497- | 04f | 6×100 full | 8 thr · 2k | 516 ms | 778 ms | ** 1.51×** |
498- | 05 | Binary tree | 8 thr · 500k | 1969 ms | 3278 ms | ** 1.66×** |
499- | 06 | 1→256→1 fan | 8 thr · 100k | 3395 ms | 4167 ms | ** 1.23×** |
500- | 07 | 16 pipelines | 8 thr · 200k | 911 ms | 2591 ms | ** 2.84×** |
501- | 08 | 16×16 grid | 8 thr · 100k | 1228 ms | 2978 ms | ** 2.43×** |
502- | 09 | Sparse DAG | 8 thr · 500k | 2508 ms | 4042 ms | ** 1.61×** |
503- | 10 | Jump loop | 1 thr · 1M | 30 ms | 53 ms | ** 1.77×** |
504- | 11 | MultiJump loop | 4 thr · 200k | 58 ms | 82 ms | ** 1.41×** |
505- | 12 | Subflow once | 4 thr · 200k | 160 ms | 210 ms | ** 1.31×** |
506- | 13 | Subflow loop | 2 thr · 500k | 105 ms | 168 ms | ** 1.60×** |
507- | 14 | Empty task | 1 thr · 10M | 473 ms | 642 ms | ** 1.36×** |
508- | 15 | Parallel for | 8 thr · 1024×10k | 734 ms | 1221 ms | ** 1.66×** |
509- | 16 | Reduce tree | 8 thr · 127×50k | 465 ms | 828 ms | ** 1.78×** |
510- | 17 | Scan chain | 1 thr · 128×100k | 235 ms | 570 ms | ** 2.43×** |
511- | 18 | Wavefront | 8 thr · 210×10k | 115 ms | 262 ms | ** 2.28×** |
512- | 19 | Heterogeneous | 8 thr · 18×100k | 851 ms | 878 ms | ** 1.03×** |
513- | 20 | Memory stress | 8 thr · 2000×500 | 774 ms | 1144 ms | ** 1.48×** |
514- | | ** Geometric mean** | | | | ** ≈ 1.58×** |
504+ | 01 | 32 parallel | 8 thr · 500k | 893 ms | 1321 ms | ** 1.48×** |
505+ | 02 | 32 serial | 1 thr · 1M | 483 ms | 1223 ms | ** 2.53×** |
506+ | 03 | Diamond DAG | 2 thr · 1M | 219 ms | 357 ms | ** 1.63×** |
507+ | 04a | 4×2 full | 2 thr · 1M | 429 ms | 611 ms | ** 1.42×** |
508+ | 04b | 6×4 full | 4 thr · 500k | 1435 ms | 1779 ms | ** 1.24×** |
509+ | 04c | 8×8 full | 8 thr · 100k | 933 ms | 1258 ms | ** 1.35×** |
510+ | 04d | 8×16 full | 8 thr · 50k | 1080 ms | 1496 ms | ** 1.39×** |
511+ | 04e | 8×32 full | 8 thr · 20k | 1115 ms | 1627 ms | ** 1.46×** |
512+ | 04f | 6×100 full | 8 thr · 2k | 548 ms | 715 ms | ** 1.30×** |
513+ | 05 | Binary tree | 8 thr · 500k | 1349 ms | 2980 ms | ** 2.21×** |
514+ | 06 | 1→256→1 fan | 8 thr · 100k | 3181 ms | 4096 ms | ** 1.29×** |
515+ | 07 | 16 pipelines | 8 thr · 200k | 389 ms | 2452 ms | ** 6.30×** |
516+ | 08 | 16×16 grid | 8 thr · 100k | 653 ms | 2722 ms | ** 4.17×** |
517+ | 09 | Sparse DAG | 8 thr · 500k | 1815 ms | 3799 ms | ** 2.09×** |
518+ | 10 | Jump loop | 1 thr · 1M | 25 ms | 50 ms | ** 2.00×** |
519+ | 11 | MultiJump loop | 4 thr · 200k | 49 ms | 75 ms | ** 1.53×** |
520+ | 12 | Subflow once | 4 thr · 200k | 130 ms | 183 ms | ** 1.41×** |
521+ | 13 | Subflow loop | 2 thr · 500k | 94 ms | 159 ms | ** 1.69×** |
522+ | 14 | Empty task | 1 thr · 10M | 406 ms | 633 ms | ** 1.56×** |
523+ | 15 | Parallel for | 8 thr · 1024×10k | 580 ms | 1159 ms | ** 2.00×** |
524+ | 16 | Reduce tree | 8 thr · 127×50k | 346 ms | 693 ms | ** 2.00×** |
525+ | 17 | Scan chain | 1 thr · 128×100k | 170 ms | 488 ms | ** 2.87×** |
526+ | 18 | Wavefront | 8 thr · 210×10k | 68 ms | 236 ms | ** 3.47×** |
527+ | 19 | Heterogeneous | 8 thr · 18×100k | 746 ms | 873 ms | ** 1.17×** |
528+ | 20 | Memory stress | 8 thr · 2000×500 | 786 ms | 1115 ms | ** 1.42×** |
529+ | | ** Geometric mean** | | | | ** ≈ 1.85×** |
530+
531+ ** Summary:** All 25 scenarios favor TaskflowLite, geometric mean ≈ ** 1.85×** . Because task
532+ bodies are empty, these figures measure pure scheduling overhead — ratios run higher than
533+ workload-heavy runs, where shared per-task computation dilutes the ratio toward 1.0. The
534+ largest gains are in dependency-dense topologies: pipelines (07, 6.30×), grid (08, 4.17×),
535+ wavefront (18, 3.47×), scan chain (17, 2.87×); the closest is heterogeneous load (19, 1.17×).
515536
516537> Full benchmark source code in [ benchmarks/] ( benchmarks/ ) .
538+ > Figures are for empty task bodies (pure scheduling overhead); the atomic-counter
539+ > correctness suite is in the benchmark source.
517540
518541---
519542
@@ -567,4 +590,4 @@ cmake --build build --config Release
567590
568591[ MIT License] ( LICENSE )
569592
570- * TaskflowLite — built for developers who demand extreme performance and modern C++ aesthetics.*
593+ * TaskflowLite — built for developers who demand extreme performance and modern C++ aesthetics.*
0 commit comments