[NFC] cache repeated tree walks to avoid O(N^2) in optimizeTerminatingTails in CodeFolding (#8602)

Changqing-JING · web-flow · commit e9b4b4c45684 · 2026-05-08T10:30:57.000-07:00
Cache the result of getBranchTargets(getFunction()->body) in optimizeTerminatingTails so that recursive calls share the same computed set rather than each re-walking the entire function body. This avoids O(N²) behavior where N is the size of the function body, since the recursive calls previously each performed an O(N) tree walk. The cached targets are computed lazily on first need and passed through to the canMove overload that accepts pre-computed branch targets. ## Benmark data For the test case in #7319 (comment) Main head: ```shell time ./build/bin/wasm-opt --code-folding --enable-bulk-memory --enable-multivalue --enable-reference-types --enable-gc --enable-tail-call --enable-exception-handling -o /dev/null ./test3.wasm real 5m45.996s user 6m6.267s sys 0m3.798s ``` This PR: ```shell time ./build/bin/wasm-opt --code-folding --enable-bulk-memory --enable-multivalue --enable-reference-types --enable-gc --enable-tail-call --enable-exception-handling -o /dev/null ./test3.wasm real 2m2.380s user 2m25.700s sys 0m2.449s ``` ## Benchmark regression test Test case: https://jetbrains.github.io/kotlinconf-app/73cbe24d7cf5a54d37ad.wasm On main ```shell Performance counter stats for 'build/bin/wasm-opt 73cbe24d7cf5a54d37ad.wasm -all --code-folding -o /dev/null' (10 runs): 4837936912 task-clock # 1.445 CPUs utilized ( +- 0.51% ) 114 context-switches # 23.564 /sec ( +- 7.58% ) 7 cpu-migrations # 1.447 /sec ( +- 16.88% ) 46271 page-faults # 9.564 K/sec ( +- 0.00% ) 13431328103 instructions # 1.21 insn per cycle ( +- 0.01% ) 11125222873 cycles # 2.300 GHz ( +- 0.51% ) 64641504 branch-misses ( +- 1.26% ) 3.3484 +- 0.0221 seconds time elapsed ( +- 0.66% ) ``` On current PR ```shell Performance counter stats for 'build/bin/wasm-opt 73cbe24d7cf5a54d37ad.wasm -all --code-folding -o /dev/null' (10 runs): 4802304211 task-clock # 1.437 CPUs utilized ( +- 0.47% ) 125 context-switches # 26.029 /sec ( +- 6.50% ) 8 cpu-migrations # 1.666 /sec ( +- 14.20% ) 46272 page-faults # 9.635 K/sec ( +- 0.00% ) 13391520427 instructions # 1.21 insn per cycle ( +- 0.01% ) 11043221889 cycles # 2.300 GHz ( +- 0.47% ) 59021679 branch-misses ( +- 1.24% ) 3.3427 +- 0.0207 seconds time elapsed ( +- 0.62% ) ```
diff --git a/src/passes/CodeFolding.cpp b/src/passes/CodeFolding.cpp
@@ -398,7 +398,14 @@ struct CodeFolding
   // if one of the items has a branch to something inside outOf that is not
   // inside that item
   bool canMove(const std::vector<Expression*>& items, Expression* outOf) {
-    auto allTargets = BranchUtils::getBranchTargets(outOf);
+    return canMove(items, outOf, BranchUtils::getBranchTargets(outOf));
+  }
+
+  // Overload that accepts pre-computed branch targets to avoid redundant
+  // O(N) getBranchTargets calls.
+  bool canMove(const std::vector<Expression*>& items,
+               Expression* outOf,
+               const BranchUtils::NameSet& allTargets) {
     for (auto* item : items) {
       auto exiting = BranchUtils::getExitingBranches(item);
       std::vector<Name> intersection;
@@ -632,11 +639,18 @@ struct CodeFolding
   // we are just starting; num > 0 means that tails is guaranteed to be
   // equal in the last num items, so we can merge there, but we look for
   // deeper merges first.
+  // bodyTargets is lazily computed on first need and then passed to recursive
+  // calls to avoid repeated O(N) getBranchTargets walks over the function body.
   // returns whether we optimized something.
-  bool optimizeTerminatingTails(std::vector<Tail>& tails, Index num = 0) {
+  bool optimizeTerminatingTails(std::vector<Tail>& tails,
+                                Index num = 0,
+                                BranchUtils::NameSet* bodyTargets = nullptr) {
     if (tails.size() < 2) {
       return false;
     }
+    // Storage for body branch targets, declared here so it outlives the
+    // pointer stored in bodyTargets.
+    BranchUtils::NameSet localBodyTargets;
     // remove things that are untoward and cannot be optimized
     tails.erase(
       std::remove_if(tails.begin(),
@@ -697,9 +711,11 @@ struct CodeFolding
       // can be removed, though
       cost += WORTH_ADDING_BLOCK_TO_REMOVE_THIS_MUCH;
       // if we cannot merge to the end, then we definitely need 2 blocks,
-      // and a branch
-      // TODO: efficiency, entire body
-      if (!canMove(items, getFunction()->body)) {
+      // and a branch. Use the pre-computed bodyTargets to avoid repeated
+      // O(N) getBranchTargets calls.
+      assert(bodyTargets);
+      bool canMoveItems = canMove(items, getFunction()->body, *bodyTargets);
+      if (!canMoveItems) {
         cost += 1 + WORTH_ADDING_BLOCK_TO_REMOVE_THIS_MUCH;
         // TODO: to do this, we need to maintain a map of element=>parent,
         //       so that we can insert the new blocks in the right place
@@ -795,7 +811,14 @@ struct CodeFolding
             // as the changes may influence us. we leave further opts to further
             // passes (as this is rare in practice, it's generally not a perf
             // issue, but TODO optimize)
-            if (optimizeTerminatingTails(explore, num + 1)) {
+            // Compute body branch targets once and share across recursive
+            // calls to avoid repeated O(N) tree walks.
+            if (!bodyTargets) {
+              localBodyTargets =
+                BranchUtils::getBranchTargets(getFunction()->body);
+              bodyTargets = &localBodyTargets;
+            }
+            if (optimizeTerminatingTails(explore, num + 1, bodyTargets)) {
               return true;
             }
           }