Skip to content

Optimizations for wide histogram building#12158

Merged
RAMitchell merged 20 commits into
dmlc:masterfrom
siqi-he:tiling
May 5, 2026
Merged

Optimizations for wide histogram building#12158
RAMitchell merged 20 commits into
dmlc:masterfrom
siqi-he:tiling

Conversation

@siqi-he
Copy link
Copy Markdown
Contributor

@siqi-he siqi-he commented Apr 13, 2026

Optimizations for wide histogram building using column block tiling

Motivation

For wide datasets (e.g. >500 features), the per-thread histogram buffer in CPU hist tree building exceeds L2 cache. Each row scatters gradient updates across the full buffer, causing heavy cache misses. The existing ColsWiseBuildHistKernel mitigates this for dense data by iterating column-by-column, but suffers from poor gradient-pair reuse (reloads gpair for every column). There is no mitigation for the sparse (any-missing) row-wise path.

Changes

Add column-block tiling with a thread-local local buffer to both histogram kernels. Instead of scattering into the full histogram, each thread accumulates into a small buffer covering ~32 columns worth of bins (~128 KB, fits in L2), then flushes to the full histogram. This localizes writes and amortizes gradient-pair loads across multiple columns per row.

Benchmark methodology

All benchmarks use tree_method='hist', max_depth=8, max_bin=256, 100 rounds, 3 repeats (average of runs 2-3). CPU pinning via taskset. The benchmarks were run using aws ec2 c6i.32xl instance, using all physical cores (i.e. nthread=64).

Datasets include real (Epsilon, Bosch, Santander) and synthetic (HIGGS with PolynomialFeatures expansion, various sparsity levels via injected NaN at fixed seed). Sparse datasets force the row-wise kernel path (IsDense()=false). Predictions were verified to be identical (measured by np.allclose) between master (b2f15e6) and tiling branch across all datasets at 100 rounds.

Results

Dataset Features Hist size Sparsity Baseline Tiled Speedup
epsilon-dense 2000 7.8 MB 0% 38.75s 36.42s 1.06x
epsilon-sparse0.1% 2000 7.8 MB 0.1% 138.74s 46.94s 2.95x
higgs3-sparse0.1% 4494 17.6 MB 0.1% 74.58s 41.25s 1.81x
higgs3-sparse1% 4494 17.6 MB 1% 77.58s 43.48s 1.78x
higgs3-1kf-sparse0.1% 1000 3.9 MB 0.1% 12.41s 10.03s 1.24x
higgs3-1kf-sparse1% 1000 3.9 MB 1% 11.03s 9.49s 1.16x
higgs3-1kf-sparse10% 1000 3.9 MB 10% 12.43s 10.01s 1.24x
higgs3-1kf-sparse49.99% 1000 3.9 MB 49.99% 9.99s 9.23s 1.08x
higgs3-500f-dense 500 2.0 MB 0% 3.50s 3.38s 1.04x
santander 200 0.8 MB 0% 1.65s 1.74s 0.95x
bosch 968 3.8 MB 81% 5.17s 5.16s 1.00x
full benchmark script
#!/usr/bin/env python
"""Benchmark script for tiling PR.
Reproduces the results table in the PR description.

Usage:
    taskset -c 0-63 python bench_pr.py <label> [nthreads]

Datasets required:
    - HIGGS.csv (Kaggle Higgs)
    - data/epsilon.bz2 (LIBSVM format, from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/)
    - data/santander_train.csv (Kaggle Santander)
    - train_numeric.csv (Kaggle Bosch)
"""
import datetime
import gc
import os
import sys

import numpy as np
import xgboost as xgb

LABEL = sys.argv[1] if len(sys.argv) > 1 else "unknown"
N_THREADS = int(sys.argv[2]) if len(sys.argv) > 2 else os.cpu_count() // 2
NUM_ROUND = 100
N_REPEATS = 3

PARAM = {
    "objective": "binary:logistic",
    "eta": 0.1,
    "max_depth": 8,
    "nthread": N_THREADS,
    "tree_method": "hist",
    "max_bin": 256,
}


def log(msg):
    print(msg, flush=True)


def inject_sparse(X, fraction, seed=123):
    rng = np.random.RandomState(seed)
    mask = rng.random(X.shape) < fraction
    X_sp = X.copy()
    X_sp[mask] = np.nan
    return X_sp


def bench(name, dm, n_features):
    hist_mb = n_features * 256 * 16 / (1024 * 1024)
    log(f"\n  {name}: hist={hist_mb:.1f}MB")
    xgb.train(PARAM, dm, 5)
    times = []
    for r in range(N_REPEATS):
        st = datetime.datetime.now()
        xgb.train(PARAM, dm, NUM_ROUND)
        elapsed = (datetime.datetime.now() - st).total_seconds()
        times.append(elapsed)
        log(f"    Run {r + 1}: {elapsed:.2f}s")
    avg = np.mean(times[1:])
    std = np.std(times[1:])
    log(f"    >> {name}: {avg:.2f}s +/-{std:.2f}")


log(f"{'=' * 70}")
log(f"Benchmark: {LABEL}")
log(f"Threads: {N_THREADS}, Rounds: {NUM_ROUND}, Repeats: {N_REPEATS}")
log(f"{'=' * 70}")

# ---------- Epsilon ----------
log("\nLoading Epsilon...")
from sklearn.datasets import load_svmlight_file

X_eps, y_eps = load_svmlight_file("data/epsilon.bz2", n_features=2000)
X_eps = X_eps.toarray().astype(np.float32)
y_eps = ((y_eps + 1) / 2).astype(np.float32)

bench("epsilon-dense", xgb.DMatrix(X_eps, label=y_eps), 2000)
bench(
    "epsilon-sparse0.1%",
    xgb.DMatrix(inject_sparse(X_eps, 0.001), label=y_eps, missing=np.nan),
    2000,
)
del X_eps, y_eps
gc.collect()

# ---------- HIGGS poly-3 ----------
log("\nLoading HIGGS poly-3...")
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures

df = pd.read_csv("HIGGS.csv", header=None, nrows=100_000)
data_raw = df.values[:, 1:].astype(np.float32)
label_h = df.values[:, 0].astype(np.float32)
poly = PolynomialFeatures(degree=3, interaction_only=False, include_bias=False)
X_h3 = poly.fit_transform(data_raw).astype(np.float32)
del df, data_raw, poly
gc.collect()

# 4494 features
bench(
    "higgs3-sparse0.1%",
    xgb.DMatrix(inject_sparse(X_h3, 0.001), label=label_h, missing=np.nan),
    4494,
)
bench(
    "higgs3-sparse1%",
    xgb.DMatrix(inject_sparse(X_h3, 0.01), label=label_h, missing=np.nan),
    4494,
)

# 1000 features
X_1k = X_h3[:, :1000].copy()
del X_h3
gc.collect()

bench(
    "higgs3-1kf-sparse0.1%",
    xgb.DMatrix(inject_sparse(X_1k, 0.001), label=label_h, missing=np.nan),
    1000,
)
bench(
    "higgs3-1kf-sparse1%",
    xgb.DMatrix(inject_sparse(X_1k, 0.01), label=label_h, missing=np.nan),
    1000,
)
bench(
    "higgs3-1kf-sparse10%",
    xgb.DMatrix(inject_sparse(X_1k, 0.10), label=label_h, missing=np.nan),
    1000,
)
bench(
    "higgs3-1kf-sparse49.99%",
    xgb.DMatrix(inject_sparse(X_1k, 0.4999), label=label_h, missing=np.nan),
    1000,
)

# 500 features dense
X_500 = X_1k[:, :500].copy()
del X_1k
gc.collect()
bench("higgs3-500f-dense", xgb.DMatrix(X_500, label=label_h), 500)
del X_500, label_h
gc.collect()

# ---------- Santander ----------
log("\nLoading Santander...")
df_s = pd.read_csv("data/santander_train.csv")
y_s = df_s["target"].values.astype(np.float32)
X_s = df_s.drop(columns=["ID_code", "target"]).values.astype(np.float32)
bench("santander", xgb.DMatrix(X_s, label=y_s), 200)
del X_s, y_s, df_s
gc.collect()

# ---------- Bosch ----------
log("\nLoading Bosch...")
df_b = pd.read_csv("train_numeric.csv")
y_b = df_b["Response"].values.astype(np.float32)
X_b = df_b.drop(columns=["Id", "Response"]).values.astype(np.float32)
bench("bosch", xgb.DMatrix(X_b, label=y_b, missing=np.nan), 968)
del X_b, y_b, df_b
gc.collect()

log(f"\n{'=' * 70}")
log("Done")

Comment thread src/common/hist_util.cc Outdated
Comment on lines +502 to +503
constexpr double kMinDensityForTiling = 0.5;
bool bin_sorted = !BuildingManager::kAnyMissing || gmat.RowsSortedByBin();
Copy link
Copy Markdown
Contributor Author

@siqi-he siqi-he Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The local buffer is flushed entirely every column block. When the data is very sparse, only a few bins are actually hit. Therefore doing a full sweep would actually slow things down. The 0.5 threshold is a rough heuristic. The idea is that denser data tend to benefit more from tiling.

For the tiled kernel to work, entries within a row need to be in ascending bin order. It seems that this is the case for standard SparsePage but not guaranteed for CSRArrayAdapter as it accepts user-provided CSR data where column indices may not be sorted. This guard is thus added to avoid silent failures.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the histogram kernel needs to handle CSRArrayAdapter?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, the kernel doesn't handle CSRArrayAdapter itself. I mentioned that as a possible data flow path where unsorted columns could be passed into the kernel. That said, after digging deeper, it seems that CSRArrayAdapter is probably a bad example, as it will be eventually converted to SimpleDmatrix that sorts indices. However, it seems that there are other code path where unsorted column indices could end up here, for example: IterativeDMatrix CPU path pushes adapter batches directly into GHistIndexMatrix via PushAdapterBatchPushBatchImplSetIndexData (here), bypassing SimpleDMatrix.SortIndices() call. If the adapter provides unsorted column indices in those paths, the bin indices in GHistIndexMatrix would be unsorted, which would break the cursor-based sparse tiling and cause silent failures.

I did explore an order-agnostic approach (bin-range filtering — iterating all entries per row per column block and checking if (gidx >= bin_lo && gidx < bin_hi)). Unfortunately it's ~10x slower than cursors on wide data due to redundant scanning (n_blocks x nnz_per_row comparisons vs one pass with cursors). The cursor approach is lightweight and covers the common case where column indices are sorted — which is guaranteed by SimpleDMatrix and typical in practice. The RowsSortedByBin guard ensures correctness for edge cases by falling back to the original kernel.

@siqi-he
Copy link
Copy Markdown
Contributor Author

siqi-he commented Apr 20, 2026

@trivialfis Hi, just wanted to gently follow up on this PR and see if there is any concern about the changes. We have many use cases that would benefit from these optimizations, so we’d really appreciate it if we could move this forward. Thanks!

Copy link
Copy Markdown
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR. The speedups look nice but there are some concerns.

  • Adding member variables creates complexity through state.

  • These optimisations are implemented as special cases. If we optimise in this way the code base grows into a bunch of branches that becomes harder and harder to maintain. What I prefer is replacing existing paths with code that performs better in a general way to hit a balance between complexity and good performance.

In short, make it simpler if you can.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes CPU hist tree building for wide datasets by adding column-block tiling to improve cache locality and gradient-pair reuse, including for the sparse (any-missing) row-wise path. It also adds per-page metadata (RowsSortedByBin, Density) to safely gate the new sparse tiled kernel and extends the gradient-index cache format accordingly.

Changes:

  • Add a cache-aware wide_hist path to histogram building and introduce tiled histogram kernels (row-wise + column-wise) with a thread-local accumulation buffer.
  • Track RowsSortedByBin() and Density() in GHistIndexMatrix, persist RowsSortedByBin in the raw cache format, and recompute density on cache read.
  • Add/extend C++ tests to validate tiled sparse histogram correctness and cache round-tripping of the new metadata.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/common/hist_util.cc Implements column-block tiling kernels and gates sparse tiling on density + per-row bin-sort order.
src/tree/hist/histogram.h Propagates wide_hist into BuildHist and computes it from cache-size heuristics.
src/data/gradient_index.h Computes/stores RowsSortedByBin and Density during index construction; adds accessors + recompute helper.
src/data/gradient_index_format.cc Appends RowsSortedByBin to cached pages and recomputes density after reading.
tests/cpp/tree/hist/test_histogram.cc Adds a regression test comparing tiled vs non-tiled histograms on sparse data.
tests/cpp/data/test_gradient_index_page_raw_format.cc Extends cache IO test to assert RowsSortedByBin and Density behavior.
tests/cpp/data/test_gradient_index.cc Adds targeted tests for RowsSortedByBin() semantics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/cpp/tree/hist/test_histogram.cc Outdated
Comment on lines +633 to +635
auto const &gmat =
*(p_fmat->GetBatches<GHistIndexMatrix>(&ctx, BatchParam{kMaxBins, 0.5}).begin());

Comment thread src/common/hist_util.cc Outdated
Comment on lines +439 to +444
std::vector<size_t> row_starts(size);
std::vector<size_t> row_sizes(size);
for (size_t i = 0; i < size; ++i) {
row_starts[i] = get_row_ptr(rid[i]);
row_sizes[i] = get_row_ptr(rid[i] + 1) - row_starts[i];
}
Comment thread src/common/hist_util.cc Outdated
// blocks. Since entries are sorted by global bin index and blocks are
// processed in ascending bin order, the cursor for block k+1 starts where
// block k left off. Total work across all blocks = one pass per row.
std::vector<uint32_t> cursors(size, 0);
Copy link
Copy Markdown
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @RAMitchell . Now that you have good results, could you please consider restarting the work in a more organic way to make the code more robust?

I think LLMs should be trained to help remove code.

Comment thread src/common/hist_util.cc Outdated
max_block_bins = std::max(max_block_bins, bins);
}

static thread_local std::vector<double> tl_cols_buf;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a good sign to just inject a tloc variable.

@siqi-he
Copy link
Copy Markdown
Contributor Author

siqi-he commented Apr 27, 2026

@RAMitchell @trivialfis Thanks a lot for the feedback. I've removed all the member variables previously added. Bin index ordering still needs to be enforced for the kernel to work, so I moved that into the data construction path instead.

On simplifying the kernel, I've explored two approaches and would value your guidance:

This PR: adds a helper function for tiling, gated on estimated work. The fall-through path leaves master's existing kernel untouched. Shows consistent improvements across all datasets in our benchmarks.
tiling-v2: replaces the row-wise kernel entirely with one cursor-based loop, which is probably closer to the generic-implementation idea. This implementation regresses on a few smaller workloads (notably bosch, santander) due to per-call cursor-state allocation.
From a downstream-user standpoint, the consistent gains in this PR are what we'd hope to see, and the dispatch-based-on-estimated-workload pattern feels reasonably aligned with how the kernel already chooses between row-wise and column-wise paths. That said, tiling-v2 is the more literal answer to "replace existing paths with code that performs better in a general way," and we're happy to go in either direction. Would appreciate your input on which would be a better fit for the project.

Here are the updated benchmark results:

Dataset Master Tiling Tiling-v2 Tiling speedup Tiling-v2 speedup
epsilon-dense 38.75s 33.91s 35.79s 1.14x 1.08x
epsilon-sparse0.1% 138.74s 56.02s 51.30s 2.48x 2.70x
higgs3-sparse0.1% 74.58s 43.61s 44.03s 1.71x 1.69x
higgs3-sparse1% 77.58s 46.03s 44.24s 1.69x 1.75x
higgs3-1kf-sparse0.1% 12.41s 9.46s 10.23s 1.31x 1.21x
higgs3-1kf-sparse1% 11.03s 9.75s 9.57s 1.13x 1.15x
higgs3-1kf-sparse10% 12.43s 10.95s 9.93s 1.14x 1.25x
higgs3-1kf-sparse49.99% 9.99s 10.00s 9.59s 1.00x 1.04x
higgs3-500f-dense 3.50s 3.21s 3.24s 1.09x 1.08x
santander 1.65s 1.64s 1.78s 1.01x 0.93x
bosch 5.17s 4.70s 5.51s 1.10x 0.94x

@RAMitchell
Copy link
Copy Markdown
Member

In the second version the branching is just moved to inside the function - I don't think its really any simpler.

I looked deeper into the first version. After some thought, I think we can tolerate the specialisation for the improved sparse performance.

@siqi-he
Copy link
Copy Markdown
Contributor Author

siqi-he commented May 4, 2026

@RAMitchell Thanks again for accepting this direction and for the feedback along the way. We have quite a few downstream use cases that would benefit from this optimization, so we’re very interested in the time and cost savings it could unlock.

When you have a chance, could you share your sense of the timeline for getting this released? That would help us plan around it. Thanks!

@RAMitchell RAMitchell merged commit 3284a0f into dmlc:master May 5, 2026
80 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants