Add Booster.compute_leaf_similarity() method#11926
Conversation
Compute similarity between observations based on leaf node co-occurrence across trees. Similar to Random Forest proximity matrices. - Two weight types: 'gain' (default) and 'cover' - Returns similarity matrix with values in [0, 1] - Self-similarity is 1.0 Closes dmlc#11919
|
Thank you for opening the PR, I will see if I can run some experiments with it and avoid the use of the tree to dataframe (that method itself is quite difficult to extend). |
|
note to myself:
|
|
@mfdel Could you please grant me permission to push to this branch for some code refactoring?
|
|
I am willing to further expand and optimize this method. The issues I observed during testing are that multi-target paths are not currently supported and that a dependency on pandas is introduced. I hope to design a more suitable method to calculate tree weights to achieve better compatibility and performance. There is only one question to ask: is it acceptable to add new C API interfaces to output the weights? I believe adding minimal C API interfaces is necessary because there is currently a lack of an interface to output the "weight" information of different trees. I think after accessing the new C API interface, the current method should be compatible with the vast majority of configurations, only |
|
#12157 This is the new PR I submitted, which includes a minimal C API and a faster similarity calculation implementation. It does not use the tree to dataframe. I also conducted performance tests: compared with random forest, and evaluated the performance of various weighting methods. |


Description
This PR adds a new method
Booster.compute_leaf_similarity()to compute similarity between observations based on leaf node co-occurrence across trees, similar to Random Forest proximity matrices.Closes #11919
API
Parameters:
data(DMatrix): Query dataset (m samples)reference(DMatrix): Reference dataset (n samples)weight_type(str):"gain"(default) or"cover"Returns:
ndarrayof shape (m, n) with values in [0, 1]Formula
Where:
Weight Types
Following the suggestion in #11919 to reuse feature importance definitions:
"gain"(default): Sum of loss reduction across all splits in the tree. Trees that contribute more to model improvement are weighted higher."cover": Sum of hessian values across all splits. For regression (hessian=1), this equals sample count. For classification (hessian=p(1-p)), this emphasizes trees that process more uncertain samples.Implementation Notes
pred_leaf=Trueandtrees_to_dataframe()Tests
Added
tests/python/test_leaf_similarity.pywith tests for:gainValueError