Skip to content

Commit bf1fa05

Browse files
committed
Updating the verified data, addressed review comments on using scikit-learn, removing memory allocator section, and clarifying the scope to include all 3 methods
1 parent 71513f0 commit bf1fa05

1 file changed

Lines changed: 68 additions & 50 deletions

File tree

software/xgboost/README.md

Lines changed: 68 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,12 @@
1-
# XGBoost Optimization on Intel® Processors
1+
# Gradient Boosting Inference Optimization on Intel® Processors
22

33
## Introduction
44

5-
[XGBoost](https://xgboost.readthedocs.io/) is one of the most popular and efficient gradient boosting frameworks for classification and regression tasks on tabular data. This guide covers techniques to significantly accelerate XGBoost inference on Intel® Xeon® processors using [Intel® oneAPI Data Analytics Library (oneDAL)](https://www.intel.com/content/www/us/en/developer/tools/oneapi/onedal.html) via its Python interface, `daal4py`.
5+
[XGBoost](https://xgboost.readthedocs.io/), [LightGBM](https://lightgbm.readthedocs.io/), and [CatBoost](https://catboost.ai/) are among the most popular and efficient gradient boosting frameworks for classification and regression tasks on tabular data. This guide covers techniques to significantly accelerate inference for these frameworks on Intel® Xeon® processors using [oneDAL (oneAPI Data Analytics Library)](http://uxlfoundation.github.io/oneDAL/) via its Python interface, `daal4py`, provided through the [`scikit-learn-intelex`](https://github.com/intel/scikit-learn-intelex) package.
66

7-
By converting trained XGBoost models to oneDAL, you can achieve **up to 36x faster inference** with no loss in prediction quality and minimal code changes. oneDAL leverages Intel® Advanced Vector Extensions 512 (AVX-512) and optimized memory access patterns to maximize performance on Intel hardware.
7+
By converting trained models to oneDAL, you can achieve **orders of magnitude faster inference** with no loss in prediction quality and minimal code changes. oneDAL leverages Intel® Advanced Vector Extensions 512 (AVX-512) and optimized memory access patterns to maximize performance on Intel hardware.
8+
9+
> **Note:** `daal4py` supports a specific subset of GBT model configurations (e.g., standard classification and regression trees). For model types not supported by daal4py, consider alternatives such as [ONNX Runtime](https://onnxruntime.ai/) for optimized inference.
810
911
## Contents
1012

@@ -27,28 +29,35 @@ By converting trained XGBoost models to oneDAL, you can achieve **up to 36x fast
2729
- [Faster XGBoost, LightGBM, and CatBoost Inference on the CPU (Intel Developer)](https://www.intel.com/content/www/us/en/developer/articles/technical/faster-xgboost-light-gbm-catboost-inference-on-cpu.html)
2830
- [Improving the Performance of XGBoost and LightGBM Inference (Intel Analytics Software)](https://medium.com/intel-analytics-software/improving-the-performance-of-xgboost-and-lightgbm-inference-3b542c03447e)
2931
- [Fast Gradient Boosting Tree Inference for Intel Xeon Processors (Intel Analytics Software)](https://medium.com/intel-analytics-software/fast-gradient-boosting-tree-inference-for-intel-xeon-processors-35756f174f55)
30-
- [daal4py Model Builders Documentation](https://intelpython.github.io/daal4py/model-builders.html)
32+
- [scikit-learn-intelex Model Builders Documentation](https://uxlfoundation.github.io/scikit-learn-intelex/latest/model_builders.html)
33+
- [About daal4py](https://uxlfoundation.github.io/scikit-learn-intelex/latest/about_daal4py.html)
3134
- [oneDAL GitHub Repository](https://github.com/uxlfoundation/oneDAL)
32-
- [Intel Extension for Scikit-learn (sklearnex)](https://github.com/intel/scikit-learn-intelex)
35+
- [scikit-learn-intelex (sklearnex)](https://github.com/intel/scikit-learn-intelex)
3336

3437
## Prerequisites
3538

3639
- Intel® Xeon® Scalable Processor (2nd Generation or newer recommended for AVX-512 support)
37-
- Python 3.9 or higher
38-
- XGBoost installed (`xgboost` package)
40+
- Python version supported by [scikit-learn-intelex](https://github.com/intel/scikit-learn-intelex) (currently 3.10+)
41+
- One or more gradient boosting libraries: [XGBoost](https://xgboost.readthedocs.io/) (`xgboost` from PyPI or `py-xgboost` from conda-forge), [LightGBM](https://lightgbm.readthedocs.io/) (`lightgbm`), [CatBoost](https://catboost.ai/) (`catboost`)
3942

4043
## Installation
4144

42-
Install `daal4py` from PyPI:
45+
The `daal4py` module is provided through the `scikit-learn-intelex` package. Install from PyPI:
4346

4447
```bash
45-
pip install daal4py
48+
pip install scikit-learn-intelex
4649
```
4750

4851
Or from conda-forge:
4952

5053
```bash
51-
conda install -c conda-forge daal4py --override-channels
54+
conda install -c conda-forge scikit-learn-intelex --override-channels
55+
```
56+
57+
Install the gradient boosting libraries you need:
58+
59+
```bash
60+
pip install xgboost lightgbm catboost
5261
```
5362

5463
## Accelerating XGBoost Inference with oneDAL
@@ -136,27 +145,26 @@ d4p_model = d4p.mb.convert_model(reg)
136145
d4p_predictions = d4p_model.predict(X_test)
137146
```
138147

148+
139149
### Getting Prediction Probabilities
140150

141-
For classification tasks, you can request both labels and probabilities:
151+
For classification tasks, you can request both labels and probabilities using the high-level API:
142152

143153
```python
144154
import daal4py as d4p
145155

146-
# Using the lower-level API for more control
147-
daal_model = d4p.get_gbt_model_from_xgboost(clf.get_booster())
156+
# Convert the model
157+
d4p_model = d4p.mb.convert_model(clf)
148158

149-
predict_algo = d4p.gbt_classification_prediction(
150-
nClasses=n_classes,
151-
resultsToEvaluate="computeClassLabels|computeClassProbabilities"
152-
)
153-
daal_prediction = predict_algo.compute(X_test, daal_model)
159+
# Get class labels
160+
predictions = d4p_model.predict(X_test)
154161

155-
# Access results
156-
labels = daal_prediction.prediction
157-
probabilities = daal_prediction.probabilities
162+
# Get prediction probabilities
163+
probabilities = d4p_model.predict_proba(X_test)
158164
```
159165

166+
For full documentation on supported model types and options, see the [Model Builders documentation](https://uxlfoundation.github.io/scikit-learn-intelex/latest/model_builders.html).
167+
160168
### Saving and Loading Converted Models
161169

162170
Converted oneDAL models can be serialized with `pickle` for deployment:
@@ -181,24 +189,34 @@ predictions = model.predict(X_test)
181189

182190
## Performance Results
183191

184-
### daal4py (oneDAL) Inference Speedup over Native Libraries
192+
### daal4py (oneDAL) Inference Speedup over Native Libraries (Batch Size = 1)
185193

186-
The following results were measured on an Intel® Xeon® Platinum 8592+ (Emerald Rapids), 2 sockets, 64 cores/socket, 256 threads, 503 GB RAM. Benchmarks were pinned to a single NUMA node (cores 0–31) using `numactl --localalloc --physcpubind=0-31`. Each model was trained with 100 estimators at max depth 8. Inference was measured over 100 iterations after warmup. Speedup = native library inference time / daal4py inference time.
194+
The following results were measured on an AWS r8i.12xlarge instance (Intel® Xeon® Scalable Processor, Granite Rapids, 48 vCPUs, 384 GB RAM). Each model was trained with 1,000 estimators. Inference was measured at batch size = 1 (single-row prediction). Speedup = native library inference time / daal4py inference time.
187195

188196
| Dataset | Rows | Features | Task | daal4py vs XGBoost | daal4py vs LightGBM | daal4py vs CatBoost |
189197
|:--------|-----:|---------:|:-----|-------------------:|--------------------:|--------------------:|
190-
| Abalone | 4,177 | 8 | Regression | 2.66x | 3.53x | 6.12x |
191-
| HIGGS-1M | 940,160 | 24 | Classification | 1.87x | 6.10x | 9.25x |
192-
| MLSR | 203 | 12,600 | Regression | 8.02x | 2.51x | 25.91x |
193-
| Mortgage-1Q | 500,000 | 45 | Regression | 1.24x | 1.66x | 5.27x |
194-
| PLAsTiCC | 200,000 | 60 | Classification | 2.81x | 6.50x | 1.11x |
195-
| Airline | 26,969 | 7 | Classification | 1.73x | 3.55x | 10.01x |
198+
| Abalone | 4,177 | 8 | Regression | 12.56x | 10.06x | 4.91x |
199+
| Airline | 26,969 | 6,452 | Classification (binary) | 11.27x | 13.01x | 1.85x |
200+
| Airline-OHE | 940,160 | 24 | Classification (binary) | 5.32x | 51.03x | 46.86x |
201+
| Bosch | 6,000,960 | 136 | Classification (binary) | 10.98x | 21.84x | 15.01x |
202+
| Covtype | 500,000 | 45 | Classification (7-class) | 2.56x | 1.49x | 0.20x |
203+
| Epsilon | 200,000 | 60 | Classification (binary) | 8.69x | 28.34x | 23.19x |
204+
| Fraud | 76,020 | 370 | Classification (binary) | 15.78x | 41.55x | 3.58x |
205+
| HIGGS | 26,969 | 7 | Classification (binary) | 10.82x | 13.53x | 2.36x |
206+
| HIGGS-1M | 1,183,747 | 968 | Classification (binary) | 12.26x | 13.91x | 3.01x |
207+
| MLSR | 581,012 | 54 | Regression | 13.67x | 11.61x | 5.73x |
208+
| Mortgage-1Q | 500,000 | 2,000 | Regression | 13.05x | 8.91x | 4.09x |
209+
| PLAsTiCC | 200,000 | 60 | Classification (14-class) | 2.42x | 1.07x | 0.11x |
210+
| Santander | 940,160 | 24 | Classification (binary) | 11.07x | 17.22x | 7.42x |
211+
| Year Prediction MSD | 515,345 | 90 | Regression | 11.59x | 10.46x | 4.56x |
212+
213+
**Software versions used for benchmarking:** XGBoost 2.1.4, LightGBM 4.6.0, CatBoost 1.2.10, scikit-learn-intelex 2024.7, Python 3.10.12, scikit-learn 1.5.2. For best results, use the latest available versions of these packages.
196214

197-
**Software versions:** XGBoost 2.1.4, LightGBM 4.6.0, CatBoost 1.2.10, daal4py 2024.7, Python 3.10.12, scikit-learn 1.5.2
215+
**Hardware:** AWS r8i.12xlarge (Intel® Xeon® Scalable Processor, Granite Rapids, 48 vCPUs, 384 GB RAM)
198216

199-
**Hardware:** Intel® Xeon® Platinum 8592+ (Emerald Rapids), 2 sockets, 64 cores/socket, 256 threads, HT On, 503 GB DDR5, single NUMA node
217+
Across all datasets, daal4py consistently accelerates inference for all three gradient boosting frameworks. LightGBM sees the largest gains (up to 51x on Airline-OHE), XGBoost achieves 5–16x speedup across all workloads, and CatBoost benefits most on high-dimensional binary classification tasks.
200218

201-
Across all datasets, daal4py consistently accelerates inference for all three gradient boosting frameworks. CatBoost sees the largest gains (up to 25.9x on MLSR), while LightGBM and XGBoost benefit most on larger datasets and higher-dimensional feature spaces. Prediction quality is preserved — match rates are 99.7–100% across all tests.
219+
For multiclass classification, default XGBoost, LightGBM, and daal4py all use one tree per class. CatBoost, on the other hand, uses vectorized trees. This means all other approaches end up processing `num_classes x` more trees compared to CatBoost, e.g., 7,000 vs 1,000 for Covtype. For smaller `num_estimators` like `100`, `daal4py` outperforms CatBoost, but as `num_estimators` gets larger, CatBoost provides better inference latency.
202220

203221
### Reproducing the Benchmark
204222

@@ -243,13 +261,21 @@ print(f"Speedup: {speedup:.2f}x")
243261

244262
## How It Works
245263

246-
oneDAL achieves faster GBT inference through two key optimizations:
264+
The speedup from oneDAL comes from three primary factors:
247265

248-
### AVX-512 Vectorized Tree Traversal
249-
oneDAL uses Intel AVX-512 vector instructions (`vpgatherd` and `vcmpp`) to process multiple observations through decision trees simultaneously. Instead of traversing one observation at a time, it processes a block of rows through each tree in parallel using SIMD operations for node comparisons and index computations.
266+
### 1. Python/Framework Overhead Elimination
250267

251-
### Cache-Optimized Memory Access
252-
Tree structures are blocked in memory so that a subset of trees and a block of observations fit in the L1 data cache. This ensures the majority of memory accesses are served from L1 cache at maximum bandwidth, rather than incurring costly main memory accesses.
268+
Native Python-based prediction (XGBoost, LightGBM, CatBoost) incurs significant per-prediction overhead: interpreter dispatch, type checking, array conversion, reference counting, and Python-to-C++ data marshalling. The majority of CPU time in native inference is spent in this framework glue code rather than actual tree traversal.
269+
270+
By converting the model to a native C++ representation, oneDAL eliminates this overhead entirely. The prediction hot path runs without any Python interpreter involvement.
271+
272+
### 2. Vectorized Tree Traversal
273+
274+
oneDAL uses SIMD instructions (AVX2/AVX-512) to traverse decision trees. Instead of scalar node-by-node comparisons, it processes multiple tree nodes or observations in parallel using vector gather and compare operations. This means the actual tree traversal computation is concentrated in a tight, optimized loop rather than being spread across many small framework functions.
275+
276+
### 3. Reduced Kernel and Synchronization Overhead
277+
278+
Native frameworks spend a notable portion of time in kernel space due to Python GIL contention and threading layer interactions (syscalls, thread scheduling, locks). oneDAL minimizes this by keeping execution in user space with efficient thread parallelism.
253279

254280
## Configuration Recommendations
255281

@@ -259,7 +285,7 @@ Tree structures are blocked in memory so that a subset of trees and a block of o
259285
| Data Type | Use `float32` for maximum throughput; `float64` is also supported |
260286
| Batch Size | oneDAL performs well across batch sizes, with the largest advantage at batch size = 1 (online inference) |
261287
| NUMA | For multi-socket systems, pin processes to a single NUMA node to minimize cross-socket memory access |
262-
| daal4py Version | Use daal4py 2023.2 or newer (required for missing values support). Each release includes additional optimizations and bug fixes, so the latest version is recommended |
288+
| scikit-learn-intelex Version | Use the latest version of `scikit-learn-intelex` for best performance, newest model support, and bug fixes |
263289

264290
### Scaling Inference on Multi-Socket Systems
265291

@@ -269,7 +295,7 @@ On multi-socket Intel Xeon systems, there are two key decisions that significant
269295

270296
A single daal4py process uses internal threading (TBB/OpenMP) to parallelize across available cores. Alternatively, you can run multiple independent OS-level processes, each pinned to a separate NUMA node with its own copy of the model and data. These approaches offer different tradeoffs.
271297

272-
Testing on a 4-NUMA-node Intel Xeon Platinum 8592+ (200K rows, 24 features, 100 trees, `numactl --localalloc`) showed:
298+
Testing on a 4-NUMA-node Intel Xeon Platinum 8592+ (`airline-ohe` dataset, 200K rows, 24 features, 100 trees, `numactl --localalloc`) showed:
273299

274300
| Configuration | Throughput (rows/s) | p50 Latency (us) | Scaling |
275301
|:--------------|--------------------:|------------------:|:--------|
@@ -287,10 +313,12 @@ Key observations:
287313
- **Thread scaling is sub-linear** — using 4x the cores in a single process yields only **2.1x** throughput, because cross-socket memory coherency traffic limits scaling.
288314
- **The tradeoff is latency**: thread scaling achieves **lower per-request latency** (1,230 us at 128 cores) because all cores collaborate on each prediction. Process scaling maintains a fixed latency (~2,000 us per worker, 32 cores each) but delivers **higher aggregate throughput**.
289315

290-
#### Hyper-threading Hurts Performance
316+
#### Hyper-threading can Hurt Performance
291317

292318
daal4py's AVX-512 vectorized tree traversal is [backend-bound](https://www.intel.com/content/www/us/en/docs/vtune-profiler/cookbook/2023-0/top-down-microarchitecture-analysis-method.html) — whether the bottleneck is core execution units or memory bandwidth, adding hyperthreads increases resource contention on the shared physical core, harming performance.
293319

320+
> **Cloud instance note:** On AWS and GCP, each vCPU does not necessarily map to a hyperthread. Smaller instance sizes use soft partitioning, so you may not know how many physical cores vs. hyperthreads you are getting. The guidance below applies most directly to bare-metal or dedicated-host instances where the physical topology is known. On shared instances, benchmark with your specific instance size to determine whether pinning provides a benefit.
321+
294322
| Configuration (1 NUMA node) | Throughput (rows/s) | p50 Latency (us) |
295323
|:-----------------------------|--------------------:|------------------:|
296324
| 32 physical cores only (`--physcpubind=0-31`) | ~18M | ~2,000 |
@@ -319,14 +347,4 @@ numactl --localalloc --physcpubind=96-127 python my_inference.py --shard=3 &
319347

320348
**Always pin to physical cores** — use `--physcpubind` with physical core IDs, not `--cpunodebind` which includes hyperthread siblings. On systems where HT cannot be disabled in BIOS, explicit `--physcpubind` ranges are essential.
321349

322-
#### Memory Allocator
323-
324-
Alternative memory allocators such as jemalloc or tcmalloc can sometimes improve performance over the default glibc malloc. It is recommended to test with these enabled to see if either provides a benefit for your workload:
325350

326-
```bash
327-
# jemalloc
328-
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 python my_inference.py
329-
330-
# tcmalloc
331-
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4 python my_inference.py
332-
```

0 commit comments

Comments
 (0)