optimize CPU inference with Array-Based Tree Traversal by razdoburdin · Pull Request #11519 · dmlc/xgboost

razdoburdin · 2025-06-20T13:50:25Z

This PR introduces optimization for CPU inference. For each tree, the top N levels are transformed into a compact array-based layout. This allows for a branchless node indexing rule: idx = 2 * idx + int(val < split_cond). To minimize memory overhead, this transformation from the standard tree structure to the array layout is performed on-the-fly for each block of data being processed. Even with this additional calculations, improved data locality in the cache-friendly array layout leads to inference speed up to ~2x (x1.4 on average).

trivialfis · 2025-06-21T01:33:21Z

Thank you for the optimization on the inference. Please unmark the "draft" status and ping me when the PR is ready for testing.

Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>

…rdin/xgboost into dev/cpu/eytzinger_layout

Vika-F

Cosmetic changes.

The next possible step would be to convert the trees into array-based representation only once, and not to do it for each block of data.

Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>

razdoburdin · 2025-07-21T12:29:58Z

Still trying to understand the code, will give it a try later. In the meanwhile, could you please craft some specific unittests for the new inference algorithm?

I added some unit-tests.

trivialfis

I'm still trying to understand the code, in the meantime, let me do some refactoring in this and the next week to accommodate the new optimization. We need a better structure to handle all these:

Predict with scalar leaf.
Predict with vector leaf.
Array predict with scalar leaf.
Array predict with vector leaf.
Column split with scalar leaf.

I think I will split up the CPU predictor into multiple pieces.

trivialfis · 2025-07-31T18:09:34Z

+   */
+  std::array<bst_node_t, kNodesCount + 1> nidx_in_tree_;
+
+  static bool IsLeaf(const RegTree& tree, bst_node_t nidx) {


Is there a benefit of doing this C++ overloading rather than the simpler tree.IsLeaf? How much faster are we seeing?

I did the overload to handle both RegTree and MultiTargetTree cases. Is there a better option?

Use RegTree without extracting the Multi-target tree when populating the buffer, and delegate the dispatching to RegTree::LeftChild(bst_node_t nidx) instead of using the RegTree::Node::LeftChild. There's a check inside the RegTree::LeftChild:

[[nodiscard]] bst_node_t LeftChild(bst_node_t nidx) const { if (IsMultiTarget()) { return this->p_mt_tree_->LeftChild(nidx); } return (*this)[nidx].LeftChild(); }

Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>

trivialfis · 2025-08-05T19:17:25Z

I'm trying to cleanup the CPU predictor. I will update this PR once it is finished.

trivialfis · 2025-08-07T18:54:04Z

I need to fix a perf regression caused by the new ordinal encoder.

trivialfis · 2025-08-20T20:57:50Z

I need to fix a perf regression caused by the new ordinal encoder.

This has been fixed. I will look deeper into this PR.

trivialfis · 2025-08-20T21:14:46Z

+   */
+  std::array<bst_node_t, kNodesCount + 1> nidx_in_tree_;
+
+  static bool IsLeaf(const RegTree& tree, bst_node_t nidx) {


Use RegTree without extracting the Multi-target tree when populating the buffer, and delegate the dispatching to RegTree::LeftChild(bst_node_t nidx) instead of using the RegTree::Node::LeftChild. There's a check inside the RegTree::LeftChild:

[[nodiscard]] bst_node_t LeftChild(bst_node_t nidx) const { if (IsMultiTarget()) { return this->p_mt_tree_->LeftChild(nidx); } return (*this)[nidx].LeftChild(); }

trivialfis · 2025-08-20T21:28:50Z

Thank you for expanding the tree layout. In the future (when you can prioritize it), do you think it's possible to create and store the layout inside the RegTree structure as an opt-in method call? My reasoning is as follows:

The existing RegTree and the multi-target tree already use a very similar layout, minus the dummy nodes. It might be easier/cleaner to do it there.
We can avoid complicating the predictor too much.
We can cache the result in the regtree structure to avoid repeated initialization.

You can define a std::unique_ptr<ArrayTree> inside the RegTree, set it to nullptr. Define a method to create the array tree when needed, and reset it back to nullptr if any non-const method is called.

Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>

razdoburdin · 2025-09-04T06:38:20Z

Thank you for expanding the tree layout. In the future (when you can prioritize it), do you think it's possible to create and store the layout inside the RegTree structure as an opt-in method call? My reasoning is as follows:

The existing RegTree and the multi-target tree already use a very similar layout, minus the dummy nodes. It might be easier/cleaner to do it there.

We can avoid complicating the predictor too much.

We can cache the result in the regtree structure to avoid repeated initialization.

You can define a std::unique_ptr<ArrayTree> inside the RegTree, set it to nullptr. Define a method to create the array tree when needed, and reset it back to nullptr if any non-const method is called.

Do you think memory overhead (about 1KB per tree) is acceptable for storing the layout? If so, it would be the natural next optimization step.

trivialfis · 2025-09-04T10:57:37Z

Do you think memory overhead (about 1KB per tree) is acceptable for storing the layout?

I think this should be fine since the size of the layout is bound by depth. The implementation here falls back to the original tree after certain level is reached.

razdoburdin · 2025-09-05T16:08:37Z

Do you think memory overhead (about 1KB per tree) is acceptable for storing the layout?

I think this should be fine since the size of the layout is bound by depth. The implementation here falls back to the original tree after certain level is reached.

Can we merge the current implementation and postpone buffering of the layout?

trivialfis · 2025-09-05T19:27:21Z

Can we merge the current implementation and postpone buffering of the layout?

We can. I will look into this PR.

trivialfis

Thank you for the excellent optimization!

I can understand the code (mostly), and it should be cleaner after merging into the regtree. I will merge this PR once the CI is green.

trivialfis · 2025-10-16T18:45:33Z

Revisiting this thread since I have been looking into the predictor.

I think fast inference should be left to a dedicated library, or at least a dedicated class in XGBoost. There are too many potential memory layouts with different tradeoff. Also, we need a better way to ensure thread safety and thread-safe memory allocation. The booster is not the right place to do it as it's mutable and contains too many unneeded data for inference.

This note is mostly for future reference if there's a plan to continue the optimization. (No change request)

trivialfis · 2025-10-16T19:42:37Z

That said, optimizations on inference are welcome. Just mentioning that if more memory layout or cache is on the pipeline, consider creating something outside of the booster class.

Dmitry Razdoburdin and others added 12 commits May 28, 2025 04:53

basic implementation

e64e20c

optimisations

60c2ffe

fix compilation error

8f6dfe3

perf optimzation

bd13491

add categorial

3827a49

add multitarget

7334bd2

linting

8356855

perf

165b34a

fix perf

52eee0c

refactoring

cb28530

add comments

7ae3a42

more comments

2799644

razdoburdin marked this pull request as draft June 20, 2025 13:50

fix and tildy

a8bb91e

Vika-F reviewed Jun 23, 2025

View reviewed changes

Comment thread src/predictor/array_tree_layout.h Outdated

razdoburdin and others added 7 commits June 23, 2025 15:22

Update src/predictor/array_tree_layout.h

6d94176

Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>

add static assertions

e34becc

fix randome state usage in sycl training_continuation test

a2f2c75

Merge branch 'master' into dev/cpu/eytzinger_layout

2afad25

check if right child is valid

92ac69e

Merge branch 'dev/cpu/eytzinger_layout' of https://github.com/razdobu…

e2b0f05

…rdin/xgboost into dev/cpu/eytzinger_layout

use signed ints for node indxes

87bee15

Vika-F reviewed Jun 24, 2025

View reviewed changes

razdoburdin and others added 6 commits June 24, 2025 12:53

Update src/predictor/array_tree_layout.h

c3c1c85

Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>

Update src/predictor/array_tree_layout.h

d270ee7

Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>

Update src/predictor/array_tree_layout.h

2a7e575

Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>

Update src/predictor/array_tree_layout.h

3539ec0

Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>

Update src/predictor/array_tree_layout.h

709d233

Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>

Update src/predictor/array_tree_layout.h

40be7e2

Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>

lint

92b5069

trivialfis reviewed Jul 31, 2025

View reviewed changes

razdoburdin and others added 2 commits August 4, 2025 13:35

Update src/predictor/cpu_predictor.cc

b0eaa85

Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>

Merge branch 'master' into dev/cpu/eytzinger_layout

790a98e

trivialfis reviewed Aug 20, 2025

View reviewed changes

razdoburdin and others added 2 commits September 3, 2025 17:01

Update src/predictor/array_tree_layout.h

89e56b7

Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>

Inplace predict always use block.

2f88dce

Dmitry Razdoburdin added 7 commits September 5, 2025 06:06

Merge branch 'master' into dev/cpu/eytzinger_layout

bcbb223

merge master

bb322c6

clean up

32ed633

clean up

0269d3c

fix

13b2011

include <array>

6d26173

remove overloading

8b89b91

trivialfis added 3 commits September 10, 2025 07:56

Small cleanup.

db37a3c

Cleanup inline.

d7cf260

comments.

b8cd8c0

trivialfis approved these changes Sep 10, 2025

View reviewed changes

trivialfis merged commit 446e3b9 into dmlc:master Sep 10, 2025
82 of 84 checks passed

Uh oh!

Conversation

razdoburdin commented Jun 20, 2025

Uh oh!

trivialfis commented Jun 21, 2025

Uh oh!

Uh oh!

Vika-F left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

razdoburdin commented Jul 21, 2025

Uh oh!

trivialfis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

trivialfis Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

razdoburdin Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

trivialfis Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

razdoburdin Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

trivialfis commented Aug 5, 2025

Uh oh!

trivialfis commented Aug 7, 2025

Uh oh!

trivialfis commented Aug 20, 2025

Uh oh!

Uh oh!

Uh oh!

trivialfis Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

trivialfis commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

razdoburdin commented Sep 4, 2025

Uh oh!

trivialfis commented Sep 4, 2025

Uh oh!

razdoburdin commented Sep 5, 2025

Uh oh!

trivialfis commented Sep 5, 2025

Uh oh!

trivialfis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

trivialfis commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trivialfis commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

trivialfis commented Aug 20, 2025 •

edited

Loading

trivialfis commented Oct 16, 2025 •

edited

Loading