Add Booster.compute_leaf_similarity() method by mfdel · Pull Request #11926 · dmlc/xgboost

mfdel · 2026-01-14T20:22:55Z

Description

This PR adds a new method Booster.compute_leaf_similarity() to compute similarity between observations based on leaf node co-occurrence across trees, similar to Random Forest proximity matrices.

Closes #11919

API

similarity = booster.compute_leaf_similarity(data, reference, weight_type="gain")

Parameters:

data (DMatrix): Query dataset (m samples)
reference (DMatrix): Reference dataset (n samples)
weight_type (str): "gain" (default) or "cover"

Returns: ndarray of shape (m, n) with values in [0, 1]

Formula

$$S_w(a, b) = \frac{\sum_{t=1}^{T} w_t \cdot \mathbf{1}[\phi_t(a) = \phi_t(b)]}{\sum_{t=1}^{T} w_t}$$

Where:

$\phi_t(x)$ = leaf index for sample $x$ in tree $t$
$w_t$ = weight of tree $t$

Weight Types

Following the suggestion in #11919 to reuse feature importance definitions:

"gain" (default): Sum of loss reduction across all splits in the tree. Trees that contribute more to model improvement are weighted higher.
"cover": Sum of hessian values across all splits. For regression (hessian=1), this equals sample count. For classification (hessian=p(1-p)), this emphasizes trees that process more uncertain samples.

Implementation Notes

Pure Python using pred_leaf=True and trees_to_dataframe()
Row-by-row computation for memory efficiency with large datasets
Properties: self-similarity = 1.0, symmetric when data==reference, range [0,1]

Tests

Added tests/python/test_leaf_similarity.py with tests for:

Shape and range validation
Self-similarity (diagonal = 1.0)
Both weight types work correctly
Default is gain
Invalid weight_type raises ValueError

Compute similarity between observations based on leaf node co-occurrence across trees. Similar to Random Forest proximity matrices. - Two weight types: 'gain' (default) and 'cover' - Returns similarity matrix with values in [0, 1] - Self-similarity is 1.0 Closes dmlc#11919

trivialfis · 2026-01-16T08:52:44Z

Thank you for opening the PR, I will see if I can run some experiments with it and avoid the use of the tree to dataframe (that method itself is quite difficult to extend).

trivialfis · 2026-01-19T14:51:39Z

note to myself:
todos:

Test multi-class/target.
Gradient boosting random forest.
Non-DMatrix inputs.
strict_shape.

trivialfis · 2026-01-27T14:35:53Z

Just an update: I'm experimenting with the similarity matrix. I need to learn more about how well XGBoost is measuring the samples (not very well at the moment)

trivialfis · 2026-02-05T15:13:01Z

@mfdel Could you please grant me permission to push to this branch for some code refactoring?

ZhuYizhou2333 · 2026-04-12T03:38:26Z

I am willing to further expand and optimize this method. The issues I observed during testing are that multi-target paths are not currently supported and that a dependency on pandas is introduced. I hope to design a more suitable method to calculate tree weights to achieve better compatibility and performance. There is only one question to ask: is it acceptable to add new C API interfaces to output the weights? I believe adding minimal C API interfaces is necessary because there is currently a lack of an interface to output the "weight" information of different trees.

I think after accessing the new C API interface, the current method should be compatible with the vast majority of configurations, only gblinear is not supported because it uses a non-tree model.

ZhuYizhou2333 · 2026-04-12T11:01:23Z

#12157 This is the new PR I submitted, which includes a minimal C API and a faster similarity calculation implementation. It does not use the tree to dataframe. I also conducted performance tests: compared with random forest, and evaluated the performance of various weighting methods.

github-actions · 2026-07-18T04:16:12Z

Closing this PR as stale due to inactivity. It can be reopened if someone wants to continue it.

mfdel mentioned this pull request Jan 14, 2026

Find similar observations using leaf node matching #11919

Open

trivialfis mentioned this pull request Jan 22, 2026

[mt] Feature importance variants. #11950

Merged

1 task

github-actions Bot added the stale label Jul 18, 2026

github-actions Bot closed this Jul 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add Booster.compute_leaf_similarity() method#11926

Add Booster.compute_leaf_similarity() method#11926
mfdel wants to merge 1 commit into
dmlc:masterfrom
mfdel:feature/leaf-similarity

mfdel commented Jan 14, 2026

Uh oh!

trivialfis commented Jan 16, 2026

Uh oh!

trivialfis commented Jan 19, 2026 •

edited

Loading

Uh oh!

trivialfis commented Jan 27, 2026 •

edited

Loading

Uh oh!

trivialfis commented Feb 5, 2026

Uh oh!

ZhuYizhou2333 commented Apr 12, 2026

Uh oh!

ZhuYizhou2333 commented Apr 12, 2026

Uh oh!

github-actions Bot commented Jul 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Uh oh!

Conversation

mfdel commented Jan 14, 2026

Description

API

Formula

Weight Types

Implementation Notes

Tests

Uh oh!

trivialfis commented Jan 16, 2026

Uh oh!

trivialfis commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trivialfis commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trivialfis commented Feb 5, 2026

Uh oh!

ZhuYizhou2333 commented Apr 12, 2026

Uh oh!

ZhuYizhou2333 commented Apr 12, 2026

Uh oh!

github-actions Bot commented Jul 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

trivialfis commented Jan 19, 2026 •

edited

Loading

trivialfis commented Jan 27, 2026 •

edited

Loading