Optimizations for wide histogram building by siqi-he · Pull Request #12158 · dmlc/xgboost

siqi-he · 2026-04-13T00:06:14Z

Optimizations for wide histogram building using column block tiling

Motivation

For wide datasets (e.g. >500 features), the per-thread histogram buffer in CPU hist tree building exceeds L2 cache. Each row scatters gradient updates across the full buffer, causing heavy cache misses. The existing ColsWiseBuildHistKernel mitigates this for dense data by iterating column-by-column, but suffers from poor gradient-pair reuse (reloads gpair for every column). There is no mitigation for the sparse (any-missing) row-wise path.

Changes

Add column-block tiling with a thread-local local buffer to both histogram kernels. Instead of scattering into the full histogram, each thread accumulates into a small buffer covering ~32 columns worth of bins (~128 KB, fits in L2), then flushes to the full histogram. This localizes writes and amortizes gradient-pair loads across multiple columns per row.

Benchmark methodology

All benchmarks use tree_method='hist', max_depth=8, max_bin=256, 100 rounds, 3 repeats (average of runs 2-3). CPU pinning via taskset. The benchmarks were run using aws ec2 c6i.32xl instance, using all physical cores (i.e. nthread=64).

Datasets include real (Epsilon, Bosch, Santander) and synthetic (HIGGS with PolynomialFeatures expansion, various sparsity levels via injected NaN at fixed seed). Sparse datasets force the row-wise kernel path (IsDense()=false). Predictions were verified to be identical (measured by np.allclose) between master (b2f15e6) and tiling branch across all datasets at 100 rounds.

Results

Dataset	Features	Hist size	Sparsity	Baseline	Tiled	Speedup
epsilon-dense	2000	7.8 MB	0%	38.75s	36.42s	1.06x
epsilon-sparse0.1%	2000	7.8 MB	0.1%	138.74s	46.94s	2.95x
higgs3-sparse0.1%	4494	17.6 MB	0.1%	74.58s	41.25s	1.81x
higgs3-sparse1%	4494	17.6 MB	1%	77.58s	43.48s	1.78x
higgs3-1kf-sparse0.1%	1000	3.9 MB	0.1%	12.41s	10.03s	1.24x
higgs3-1kf-sparse1%	1000	3.9 MB	1%	11.03s	9.49s	1.16x
higgs3-1kf-sparse10%	1000	3.9 MB	10%	12.43s	10.01s	1.24x
higgs3-1kf-sparse49.99%	1000	3.9 MB	49.99%	9.99s	9.23s	1.08x
higgs3-500f-dense	500	2.0 MB	0%	3.50s	3.38s	1.04x
santander	200	0.8 MB	0%	1.65s	1.74s	0.95x
bosch	968	3.8 MB	81%	5.17s	5.16s	1.00x

full benchmark script

#!/usr/bin/env python
"""Benchmark script for tiling PR.
Reproduces the results table in the PR description.

Usage:
    taskset -c 0-63 python bench_pr.py <label> [nthreads]

Datasets required:
    - HIGGS.csv (Kaggle Higgs)
    - data/epsilon.bz2 (LIBSVM format, from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/)
    - data/santander_train.csv (Kaggle Santander)
    - train_numeric.csv (Kaggle Bosch)
"""
import datetime
import gc
import os
import sys

import numpy as np
import xgboost as xgb

LABEL = sys.argv[1] if len(sys.argv) > 1 else "unknown"
N_THREADS = int(sys.argv[2]) if len(sys.argv) > 2 else os.cpu_count() // 2
NUM_ROUND = 100
N_REPEATS = 3

PARAM = {
    "objective": "binary:logistic",
    "eta": 0.1,
    "max_depth": 8,
    "nthread": N_THREADS,
    "tree_method": "hist",
    "max_bin": 256,
}


def log(msg):
    print(msg, flush=True)


def inject_sparse(X, fraction, seed=123):
    rng = np.random.RandomState(seed)
    mask = rng.random(X.shape) < fraction
    X_sp = X.copy()
    X_sp[mask] = np.nan
    return X_sp


def bench(name, dm, n_features):
    hist_mb = n_features * 256 * 16 / (1024 * 1024)
    log(f"\n  {name}: hist={hist_mb:.1f}MB")
    xgb.train(PARAM, dm, 5)
    times = []
    for r in range(N_REPEATS):
        st = datetime.datetime.now()
        xgb.train(PARAM, dm, NUM_ROUND)
        elapsed = (datetime.datetime.now() - st).total_seconds()
        times.append(elapsed)
        log(f"    Run {r + 1}: {elapsed:.2f}s")
    avg = np.mean(times[1:])
    std = np.std(times[1:])
    log(f"    >> {name}: {avg:.2f}s +/-{std:.2f}")


log(f"{'=' * 70}")
log(f"Benchmark: {LABEL}")
log(f"Threads: {N_THREADS}, Rounds: {NUM_ROUND}, Repeats: {N_REPEATS}")
log(f"{'=' * 70}")

# ---------- Epsilon ----------
log("\nLoading Epsilon...")
from sklearn.datasets import load_svmlight_file

X_eps, y_eps = load_svmlight_file("data/epsilon.bz2", n_features=2000)
X_eps = X_eps.toarray().astype(np.float32)
y_eps = ((y_eps + 1) / 2).astype(np.float32)

bench("epsilon-dense", xgb.DMatrix(X_eps, label=y_eps), 2000)
bench(
    "epsilon-sparse0.1%",
    xgb.DMatrix(inject_sparse(X_eps, 0.001), label=y_eps, missing=np.nan),
    2000,
)
del X_eps, y_eps
gc.collect()

# ---------- HIGGS poly-3 ----------
log("\nLoading HIGGS poly-3...")
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures

df = pd.read_csv("HIGGS.csv", header=None, nrows=100_000)
data_raw = df.values[:, 1:].astype(np.float32)
label_h = df.values[:, 0].astype(np.float32)
poly = PolynomialFeatures(degree=3, interaction_only=False, include_bias=False)
X_h3 = poly.fit_transform(data_raw).astype(np.float32)
del df, data_raw, poly
gc.collect()

# 4494 features
bench(
    "higgs3-sparse0.1%",
    xgb.DMatrix(inject_sparse(X_h3, 0.001), label=label_h, missing=np.nan),
    4494,
)
bench(
    "higgs3-sparse1%",
    xgb.DMatrix(inject_sparse(X_h3, 0.01), label=label_h, missing=np.nan),
    4494,
)

# 1000 features
X_1k = X_h3[:, :1000].copy()
del X_h3
gc.collect()

bench(
    "higgs3-1kf-sparse0.1%",
    xgb.DMatrix(inject_sparse(X_1k, 0.001), label=label_h, missing=np.nan),
    1000,
)
bench(
    "higgs3-1kf-sparse1%",
    xgb.DMatrix(inject_sparse(X_1k, 0.01), label=label_h, missing=np.nan),
    1000,
)
bench(
    "higgs3-1kf-sparse10%",
    xgb.DMatrix(inject_sparse(X_1k, 0.10), label=label_h, missing=np.nan),
    1000,
)
bench(
    "higgs3-1kf-sparse49.99%",
    xgb.DMatrix(inject_sparse(X_1k, 0.4999), label=label_h, missing=np.nan),
    1000,
)

# 500 features dense
X_500 = X_1k[:, :500].copy()
del X_1k
gc.collect()
bench("higgs3-500f-dense", xgb.DMatrix(X_500, label=label_h), 500)
del X_500, label_h
gc.collect()

# ---------- Santander ----------
log("\nLoading Santander...")
df_s = pd.read_csv("data/santander_train.csv")
y_s = df_s["target"].values.astype(np.float32)
X_s = df_s.drop(columns=["ID_code", "target"]).values.astype(np.float32)
bench("santander", xgb.DMatrix(X_s, label=y_s), 200)
del X_s, y_s, df_s
gc.collect()

# ---------- Bosch ----------
log("\nLoading Bosch...")
df_b = pd.read_csv("train_numeric.csv")
y_b = df_b["Response"].values.astype(np.float32)
X_b = df_b.drop(columns=["Id", "Response"]).values.astype(np.float32)
bench("bosch", xgb.DMatrix(X_b, label=y_b, missing=np.nan), 968)
del X_b, y_b, df_b
gc.collect()

log(f"\n{'=' * 70}")
log("Done")

…ty gate

siqi-he · 2026-04-13T05:45:14Z

+      constexpr double kMinDensityForTiling = 0.5;
+      bool bin_sorted = !BuildingManager::kAnyMissing || gmat.RowsSortedByBin();


The local buffer is flushed entirely every column block. When the data is very sparse, only a few bins are actually hit. Therefore doing a full sweep would actually slow things down. The 0.5 threshold is a rough heuristic. The idea is that denser data tend to benefit more from tiling.

For the tiled kernel to work, entries within a row need to be in ascending bin order. It seems that this is the case for standard SparsePage but not guaranteed for CSRArrayAdapter as it accepts user-provided CSR data where column indices may not be sorted. This guard is thus added to avoid silent failures.

I don't think the histogram kernel needs to handle CSRArrayAdapter?

You're right, the kernel doesn't handle CSRArrayAdapter itself. I mentioned that as a possible data flow path where unsorted columns could be passed into the kernel. That said, after digging deeper, it seems that CSRArrayAdapter is probably a bad example, as it will be eventually converted to SimpleDmatrix that sorts indices. However, it seems that there are other code path where unsorted column indices could end up here, for example: IterativeDMatrix CPU path pushes adapter batches directly into GHistIndexMatrix via PushAdapterBatch → PushBatchImpl → SetIndexData (here), bypassing SimpleDMatrix.SortIndices() call. If the adapter provides unsorted column indices in those paths, the bin indices in GHistIndexMatrix would be unsorted, which would break the cursor-based sparse tiling and cause silent failures.

I did explore an order-agnostic approach (bin-range filtering — iterating all entries per row per column block and checking if (gidx >= bin_lo && gidx < bin_hi)). Unfortunately it's ~10x slower than cursors on wide data due to redundant scanning (n_blocks x nnz_per_row comparisons vs one pass with cursors). The cursor approach is lightweight and covers the common case where column indices are sorted — which is guaranteed by SimpleDMatrix and typical in practice. The RowsSortedByBin guard ensures correctness for edge cases by falling back to the original kernel.

siqi-he · 2026-04-20T21:24:12Z

@trivialfis Hi, just wanted to gently follow up on this PR and see if there is any concern about the changes. We have many use cases that would benefit from these optimizations, so we’d really appreciate it if we could move this forward. Thanks!

RAMitchell

Thank you for the PR. The speedups look nice but there are some concerns.

Adding member variables creates complexity through state.
These optimisations are implemented as special cases. If we optimise in this way the code base grows into a bunch of branches that becomes harder and harder to maintain. What I prefer is replacing existing paths with code that performs better in a general way to hit a balance between complexity and good performance.

In short, make it simpler if you can.

Copilot

Pull request overview

This PR optimizes CPU hist tree building for wide datasets by adding column-block tiling to improve cache locality and gradient-pair reuse, including for the sparse (any-missing) row-wise path. It also adds per-page metadata (RowsSortedByBin, Density) to safely gate the new sparse tiled kernel and extends the gradient-index cache format accordingly.

Changes:

Add a cache-aware wide_hist path to histogram building and introduce tiled histogram kernels (row-wise + column-wise) with a thread-local accumulation buffer.
Track RowsSortedByBin() and Density() in GHistIndexMatrix, persist RowsSortedByBin in the raw cache format, and recompute density on cache read.
Add/extend C++ tests to validate tiled sparse histogram correctness and cache round-tripping of the new metadata.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`src/common/hist_util.cc`	Implements column-block tiling kernels and gates sparse tiling on density + per-row bin-sort order.
`src/tree/hist/histogram.h`	Propagates `wide_hist` into `BuildHist` and computes it from cache-size heuristics.
`src/data/gradient_index.h`	Computes/stores `RowsSortedByBin` and `Density` during index construction; adds accessors + recompute helper.
`src/data/gradient_index_format.cc`	Appends `RowsSortedByBin` to cached pages and recomputes density after reading.
`tests/cpp/tree/hist/test_histogram.cc`	Adds a regression test comparing tiled vs non-tiled histograms on sparse data.
`tests/cpp/data/test_gradient_index_page_raw_format.cc`	Extends cache IO test to assert `RowsSortedByBin` and `Density` behavior.
`tests/cpp/data/test_gradient_index.cc`	Adds targeted tests for `RowsSortedByBin()` semantics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+  auto const &gmat =
+      *(p_fmat->GetBatches<GHistIndexMatrix>(&ctx, BatchParam{kMaxBins, 0.5}).begin());
+


+    std::vector<size_t> row_starts(size);
+    std::vector<size_t> row_sizes(size);
+    for (size_t i = 0; i < size; ++i) {
+      row_starts[i] = get_row_ptr(rid[i]);
+      row_sizes[i] = get_row_ptr(rid[i] + 1) - row_starts[i];
+    }


+    // blocks. Since entries are sorted by global bin index and blocks are
+    // processed in ascending bin order, the cursor for block k+1 starts where
+    // block k left off. Total work across all blocks = one pass per row.
+    std::vector<uint32_t> cursors(size, 0);


… gate

trivialfis

I agree with @RAMitchell . Now that you have good results, could you please consider restarting the work in a more organic way to make the code more robust?

I think LLMs should be trained to help remove code.

trivialfis · 2026-04-27T10:26:39Z

+    max_block_bins = std::max(max_block_bins, bins);
+  }
+
+  static thread_local std::vector<double> tl_cols_buf;


It's not a good sign to just inject a tloc variable.

siqi-he · 2026-04-27T22:08:09Z

@RAMitchell @trivialfis Thanks a lot for the feedback. I've removed all the member variables previously added. Bin index ordering still needs to be enforced for the kernel to work, so I moved that into the data construction path instead.

On simplifying the kernel, I've explored two approaches and would value your guidance:

This PR: adds a helper function for tiling, gated on estimated work. The fall-through path leaves master's existing kernel untouched. Shows consistent improvements across all datasets in our benchmarks.
tiling-v2: replaces the row-wise kernel entirely with one cursor-based loop, which is probably closer to the generic-implementation idea. This implementation regresses on a few smaller workloads (notably bosch, santander) due to per-call cursor-state allocation.
From a downstream-user standpoint, the consistent gains in this PR are what we'd hope to see, and the dispatch-based-on-estimated-workload pattern feels reasonably aligned with how the kernel already chooses between row-wise and column-wise paths. That said, tiling-v2 is the more literal answer to "replace existing paths with code that performs better in a general way," and we're happy to go in either direction. Would appreciate your input on which would be a better fit for the project.

Here are the updated benchmark results:

Dataset	Master	Tiling	Tiling-v2	Tiling speedup	Tiling-v2 speedup
epsilon-dense	38.75s	33.91s	35.79s	1.14x	1.08x
epsilon-sparse0.1%	138.74s	56.02s	51.30s	2.48x	2.70x
higgs3-sparse0.1%	74.58s	43.61s	44.03s	1.71x	1.69x
higgs3-sparse1%	77.58s	46.03s	44.24s	1.69x	1.75x
higgs3-1kf-sparse0.1%	12.41s	9.46s	10.23s	1.31x	1.21x
higgs3-1kf-sparse1%	11.03s	9.75s	9.57s	1.13x	1.15x
higgs3-1kf-sparse10%	12.43s	10.95s	9.93s	1.14x	1.25x
higgs3-1kf-sparse49.99%	9.99s	10.00s	9.59s	1.00x	1.04x
higgs3-500f-dense	3.50s	3.21s	3.24s	1.09x	1.08x
santander	1.65s	1.64s	1.78s	1.01x	0.93x
bosch	5.17s	4.70s	5.51s	1.10x	0.94x

RAMitchell · 2026-04-28T08:37:21Z

In the second version the branching is just moved to inside the function - I don't think its really any simpler.

I looked deeper into the first version. After some thought, I think we can tolerate the specialisation for the improved sparse performance.

siqi-he · 2026-05-04T16:48:49Z

@RAMitchell Thanks again for accepting this direction and for the feedback along the way. We have quite a few downstream use cases that would benefit from this optimization, so we’re very interested in the time and cost savings it could unlock.

When you have a chance, could you share your sense of the timeline for getting this released? That would help us plan around it. Thanks!

siqi-he added 12 commits April 12, 2026 03:50

add tiling and local buffer to ColsWiseBuildHistKernel

bbd3759

add RowsWiseBuildHistKernelTiled with cache-aware threshold and densi…

9cde078

…ty gate

add RowsSortedByBin safety guard

185e07f

update tests

8cda8f3

add density to GHistIndexMatrix

8daf812

add RecomputeDensity

e2e62d4

update tests

7f09c18

undo unrelated changes introduced in formatting

cedf05f

clean up tiled kernel

4e89c68

replace tiled_threshold with use_tiled

17e3b53

clang format

57aeab9

rename to wide_hist for clarity

59ecfa5

siqi-he commented Apr 13, 2026

View reviewed changes

RAMitchell requested a review from Copilot April 23, 2026 07:30

Copilot started reviewing on behalf of RAMitchell April 23, 2026 07:30 View session

RAMitchell requested changes Apr 23, 2026

View reviewed changes

Copilot AI reviewed Apr 23, 2026

View reviewed changes

siqi-he added 6 commits April 27, 2026 04:48

merge tiled kernel into RowsWiseBuildHistKernel with dynamic per-node…

c9f7c21

… gate

extract helper and remove redundant states

0862915

localize changes to row kernel

accd8c2

enforce bin idx sorting

a95b474

remove format changes

37315c7

update sorting location

a4897a4

trivialfis reviewed Apr 27, 2026

View reviewed changes

remove thread local

4c42c06

RAMitchell approved these changes Apr 28, 2026

View reviewed changes

Merge branch 'master' into tiling

2c98333

RAMitchell merged commit 3284a0f into dmlc:master May 5, 2026
80 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Optimizations for wide histogram building#12158

Optimizations for wide histogram building#12158
RAMitchell merged 20 commits into
dmlc:masterfrom
siqi-he:tiling

siqi-he commented Apr 13, 2026

Uh oh!

siqi-he Apr 13, 2026 •

edited

Loading

Uh oh!

trivialfis Apr 14, 2026

Uh oh!

siqi-he Apr 14, 2026

Uh oh!

siqi-he commented Apr 20, 2026

Uh oh!

RAMitchell left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

trivialfis left a comment

Uh oh!

trivialfis Apr 27, 2026

Uh oh!

siqi-he commented Apr 27, 2026

Uh oh!

RAMitchell commented Apr 28, 2026

Uh oh!

siqi-he commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		constexpr double kMinDensityForTiling = 0.5;
		bool bin_sorted = !BuildingManager::kAnyMissing \|\| gmat.RowsSortedByBin();

		auto const &gmat =
		*(p_fmat->GetBatches<GHistIndexMatrix>(&ctx, BatchParam{kMaxBins, 0.5}).begin());

Uh oh!

Uh oh!

Conversation

siqi-he commented Apr 13, 2026

Optimizations for wide histogram building using column block tiling

Motivation

Changes

Benchmark methodology

Results

Uh oh!

siqi-he Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trivialfis Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

siqi-he Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

siqi-he commented Apr 20, 2026

Uh oh!

RAMitchell left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

trivialfis left a comment

Choose a reason for hiding this comment

Uh oh!

trivialfis Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

siqi-he commented Apr 27, 2026

Uh oh!

RAMitchell commented Apr 28, 2026

Uh oh!

siqi-he commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

siqi-he Apr 13, 2026 •

edited

Loading