optimize CPU inference with Array-Based Tree Traversal#11519
Conversation
|
Thank you for the optimization on the inference. Please unmark the "draft" status and ping me when the PR is ready for testing. |
Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>
…rdin/xgboost into dev/cpu/eytzinger_layout
Vika-F
left a comment
There was a problem hiding this comment.
Cosmetic changes.
The next possible step would be to convert the trees into array-based representation only once, and not to do it for each block of data.
Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>
Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>
Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>
Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>
Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>
Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>
I added some unit-tests. |
trivialfis
left a comment
There was a problem hiding this comment.
I'm still trying to understand the code, in the meantime, let me do some refactoring in this and the next week to accommodate the new optimization. We need a better structure to handle all these:
- Predict with scalar leaf.
- Predict with vector leaf.
- Array predict with scalar leaf.
- Array predict with vector leaf.
- Column split with scalar leaf.
I think I will split up the CPU predictor into multiple pieces.
| */ | ||
| std::array<bst_node_t, kNodesCount + 1> nidx_in_tree_; | ||
|
|
||
| static bool IsLeaf(const RegTree& tree, bst_node_t nidx) { |
There was a problem hiding this comment.
Is there a benefit of doing this C++ overloading rather than the simpler tree.IsLeaf? How much faster are we seeing?
There was a problem hiding this comment.
I did the overload to handle both RegTree and MultiTargetTree cases. Is there a better option?
There was a problem hiding this comment.
Use RegTree without extracting the Multi-target tree when populating the buffer, and delegate the dispatching to RegTree::LeftChild(bst_node_t nidx) instead of using the RegTree::Node::LeftChild. There's a check inside the RegTree::LeftChild:
[[nodiscard]] bst_node_t LeftChild(bst_node_t nidx) const {
if (IsMultiTarget()) {
return this->p_mt_tree_->LeftChild(nidx);
}
return (*this)[nidx].LeftChild();
}Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
|
I'm trying to cleanup the CPU predictor. I will update this PR once it is finished. |
|
I need to fix a perf regression caused by the new ordinal encoder. |
This has been fixed. I will look deeper into this PR. |
| */ | ||
| std::array<bst_node_t, kNodesCount + 1> nidx_in_tree_; | ||
|
|
||
| static bool IsLeaf(const RegTree& tree, bst_node_t nidx) { |
There was a problem hiding this comment.
Use RegTree without extracting the Multi-target tree when populating the buffer, and delegate the dispatching to RegTree::LeftChild(bst_node_t nidx) instead of using the RegTree::Node::LeftChild. There's a check inside the RegTree::LeftChild:
[[nodiscard]] bst_node_t LeftChild(bst_node_t nidx) const {
if (IsMultiTarget()) {
return this->p_mt_tree_->LeftChild(nidx);
}
return (*this)[nidx].LeftChild();
}|
Thank you for expanding the tree layout. In the future (when you can prioritize it), do you think it's possible to create and store the layout inside the
You can define a |
Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
Do you think memory overhead (about 1KB per tree) is acceptable for storing the layout? If so, it would be the natural next optimization step. |
I think this should be fine since the size of the layout is bound by depth. The implementation here falls back to the original tree after certain level is reached. |
Can we merge the current implementation and postpone buffering of the layout? |
We can. I will look into this PR. |
trivialfis
left a comment
There was a problem hiding this comment.
Thank you for the excellent optimization!
I can understand the code (mostly), and it should be cleaner after merging into the regtree. I will merge this PR once the CI is green.
|
Revisiting this thread since I have been looking into the predictor. I think fast inference should be left to a dedicated library, or at least a dedicated class in XGBoost. There are too many potential memory layouts with different tradeoff. Also, we need a better way to ensure thread safety and thread-safe memory allocation. The booster is not the right place to do it as it's mutable and contains too many unneeded data for inference. This note is mostly for future reference if there's a plan to continue the optimization. (No change request) |
|
That said, optimizations on inference are welcome. Just mentioning that if more memory layout or cache is on the pipeline, consider creating something outside of the booster class. |
This PR introduces optimization for CPU inference. For each tree, the top N levels are transformed into a compact array-based layout. This allows for a branchless node indexing rule: idx = 2 * idx + int(val < split_cond). To minimize memory overhead, this transformation from the standard tree structure to the array layout is performed on-the-fly for each block of data being processed. Even with this additional calculations, improved data locality in the cache-friendly array layout leads to inference speed up to ~2x (x1.4 on average).
