Optimizations for wide histogram building#12158
Conversation
| constexpr double kMinDensityForTiling = 0.5; | ||
| bool bin_sorted = !BuildingManager::kAnyMissing || gmat.RowsSortedByBin(); |
There was a problem hiding this comment.
The local buffer is flushed entirely every column block. When the data is very sparse, only a few bins are actually hit. Therefore doing a full sweep would actually slow things down. The 0.5 threshold is a rough heuristic. The idea is that denser data tend to benefit more from tiling.
For the tiled kernel to work, entries within a row need to be in ascending bin order. It seems that this is the case for standard SparsePage but not guaranteed for CSRArrayAdapter as it accepts user-provided CSR data where column indices may not be sorted. This guard is thus added to avoid silent failures.
There was a problem hiding this comment.
I don't think the histogram kernel needs to handle CSRArrayAdapter?
There was a problem hiding this comment.
You're right, the kernel doesn't handle CSRArrayAdapter itself. I mentioned that as a possible data flow path where unsorted columns could be passed into the kernel. That said, after digging deeper, it seems that CSRArrayAdapter is probably a bad example, as it will be eventually converted to SimpleDmatrix that sorts indices. However, it seems that there are other code path where unsorted column indices could end up here, for example: IterativeDMatrix CPU path pushes adapter batches directly into GHistIndexMatrix via PushAdapterBatch → PushBatchImpl → SetIndexData (here), bypassing SimpleDMatrix.SortIndices() call. If the adapter provides unsorted column indices in those paths, the bin indices in GHistIndexMatrix would be unsorted, which would break the cursor-based sparse tiling and cause silent failures.
I did explore an order-agnostic approach (bin-range filtering — iterating all entries per row per column block and checking if (gidx >= bin_lo && gidx < bin_hi)). Unfortunately it's ~10x slower than cursors on wide data due to redundant scanning (n_blocks x nnz_per_row comparisons vs one pass with cursors). The cursor approach is lightweight and covers the common case where column indices are sorted — which is guaranteed by SimpleDMatrix and typical in practice. The RowsSortedByBin guard ensures correctness for edge cases by falling back to the original kernel.
|
@trivialfis Hi, just wanted to gently follow up on this PR and see if there is any concern about the changes. We have many use cases that would benefit from these optimizations, so we’d really appreciate it if we could move this forward. Thanks! |
RAMitchell
left a comment
There was a problem hiding this comment.
Thank you for the PR. The speedups look nice but there are some concerns.
-
Adding member variables creates complexity through state.
-
These optimisations are implemented as special cases. If we optimise in this way the code base grows into a bunch of branches that becomes harder and harder to maintain. What I prefer is replacing existing paths with code that performs better in a general way to hit a balance between complexity and good performance.
In short, make it simpler if you can.
There was a problem hiding this comment.
Pull request overview
This PR optimizes CPU hist tree building for wide datasets by adding column-block tiling to improve cache locality and gradient-pair reuse, including for the sparse (any-missing) row-wise path. It also adds per-page metadata (RowsSortedByBin, Density) to safely gate the new sparse tiled kernel and extends the gradient-index cache format accordingly.
Changes:
- Add a cache-aware
wide_histpath to histogram building and introduce tiled histogram kernels (row-wise + column-wise) with a thread-local accumulation buffer. - Track
RowsSortedByBin()andDensity()inGHistIndexMatrix, persistRowsSortedByBinin the raw cache format, and recompute density on cache read. - Add/extend C++ tests to validate tiled sparse histogram correctness and cache round-tripping of the new metadata.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
src/common/hist_util.cc |
Implements column-block tiling kernels and gates sparse tiling on density + per-row bin-sort order. |
src/tree/hist/histogram.h |
Propagates wide_hist into BuildHist and computes it from cache-size heuristics. |
src/data/gradient_index.h |
Computes/stores RowsSortedByBin and Density during index construction; adds accessors + recompute helper. |
src/data/gradient_index_format.cc |
Appends RowsSortedByBin to cached pages and recomputes density after reading. |
tests/cpp/tree/hist/test_histogram.cc |
Adds a regression test comparing tiled vs non-tiled histograms on sparse data. |
tests/cpp/data/test_gradient_index_page_raw_format.cc |
Extends cache IO test to assert RowsSortedByBin and Density behavior. |
tests/cpp/data/test_gradient_index.cc |
Adds targeted tests for RowsSortedByBin() semantics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| auto const &gmat = | ||
| *(p_fmat->GetBatches<GHistIndexMatrix>(&ctx, BatchParam{kMaxBins, 0.5}).begin()); | ||
|
|
| std::vector<size_t> row_starts(size); | ||
| std::vector<size_t> row_sizes(size); | ||
| for (size_t i = 0; i < size; ++i) { | ||
| row_starts[i] = get_row_ptr(rid[i]); | ||
| row_sizes[i] = get_row_ptr(rid[i] + 1) - row_starts[i]; | ||
| } |
| // blocks. Since entries are sorted by global bin index and blocks are | ||
| // processed in ascending bin order, the cursor for block k+1 starts where | ||
| // block k left off. Total work across all blocks = one pass per row. | ||
| std::vector<uint32_t> cursors(size, 0); |
trivialfis
left a comment
There was a problem hiding this comment.
I agree with @RAMitchell . Now that you have good results, could you please consider restarting the work in a more organic way to make the code more robust?
I think LLMs should be trained to help remove code.
| max_block_bins = std::max(max_block_bins, bins); | ||
| } | ||
|
|
||
| static thread_local std::vector<double> tl_cols_buf; |
There was a problem hiding this comment.
It's not a good sign to just inject a tloc variable.
|
@RAMitchell @trivialfis Thanks a lot for the feedback. I've removed all the member variables previously added. Bin index ordering still needs to be enforced for the kernel to work, so I moved that into the data construction path instead. On simplifying the kernel, I've explored two approaches and would value your guidance: This PR: adds a helper function for tiling, gated on estimated work. The fall-through path leaves master's existing kernel untouched. Shows consistent improvements across all datasets in our benchmarks. Here are the updated benchmark results:
|
|
In the second version the branching is just moved to inside the function - I don't think its really any simpler. I looked deeper into the first version. After some thought, I think we can tolerate the specialisation for the improved sparse performance. |
|
@RAMitchell Thanks again for accepting this direction and for the feedback along the way. We have quite a few downstream use cases that would benefit from this optimization, so we’re very interested in the time and cost savings it could unlock. When you have a chance, could you share your sense of the timeline for getting this released? That would help us plan around it. Thanks! |
Optimizations for wide histogram building using column block tiling
Motivation
For wide datasets (e.g. >500 features), the per-thread histogram buffer in CPU hist tree building exceeds L2 cache. Each row scatters gradient updates across the full buffer, causing heavy cache misses. The existing
ColsWiseBuildHistKernelmitigates this for dense data by iterating column-by-column, but suffers from poor gradient-pair reuse (reloads gpair for every column). There is no mitigation for the sparse (any-missing) row-wise path.Changes
Add column-block tiling with a thread-local local buffer to both histogram kernels. Instead of scattering into the full histogram, each thread accumulates into a small buffer covering ~32 columns worth of bins (~128 KB, fits in L2), then flushes to the full histogram. This localizes writes and amortizes gradient-pair loads across multiple columns per row.
Benchmark methodology
All benchmarks use
tree_method='hist',max_depth=8,max_bin=256, 100 rounds, 3 repeats (average of runs 2-3). CPU pinning viataskset. The benchmarks were run using aws ec2 c6i.32xl instance, using all physical cores (i.e. nthread=64).Datasets include real (Epsilon, Bosch, Santander) and synthetic (HIGGS with PolynomialFeatures expansion, various sparsity levels via injected NaN at fixed seed). Sparse datasets force the row-wise kernel path (
IsDense()=false). Predictions were verified to be identical (measured bynp.allclose) between master (b2f15e6) and tiling branch across all datasets at 100 rounds.Results
full benchmark script