[enc] Support trianing continuation.#11598
Merged
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull Request Overview
This PR implements training continuation support for categorical encoders, enabling XGBoost to handle changing category encodings between training and prediction phases. It adds support for using the __arrow_c_device_array__ interface in cuDF 25.06 and includes comprehensive handling of all integer types for categorical features.
- Introduce support for training continuation with categorical features by allowing reference categories for re-coding
- Add complete support for all integer types (uint8_t, uint16_t, uint32_t, uint64_t) in categorical encoding
- Implement
__arrow_c_device_array__interface support for cuDF 25.06 compatibility
Reviewed Changes
Copilot reviewed 55 out of 55 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/python/test_with_polars.py | Update test to use new export_to_arrow=True parameter |
| tests/python/test_ordinal.py | Fix typo in function name and add new test functions |
| tests/python-gpu/test_gpu_ordinal.py | Add mixed device tests and new ordinal test functions |
| tests/cpp/test_serialization.cc | Add test coverage for new unsigned integer array types |
| tests/cpp/data/test_cat_container.h | Update constructor call to use Context parameter |
| src/tree/gpu_hist/evaluate_splits.cu | Remove deprecated CUB version check |
| src/predictor/predict_fn.h | Move accessor classes to cat_container.h |
| src/predictor/gpu_predictor.cu | Refactor to use shared accessor functions |
| src/predictor/cpu_predictor.cc | Update sparse page view interface and use shared accessors |
| src/gbm/gbtree_model.h | Add CatsShared() method for shared pointer access |
| src/encoder/ordinal.h | Add support for all unsigned integer types |
| src/encoder/ordinal.cuh | Improve type checking and error messages |
| src/data/sparse_page_dmatrix.cc | Update CatContainer constructor call |
| src/data/simple_dmatrix.cu | Handle reference categorical data and encoding |
| src/data/simple_dmatrix.cc | Add support for encoded columnar batches |
| src/data/quantile_dmatrix.cu | Update CatContainer constructor |
| src/data/proxy_dmatrix.h | Add reference categories support |
| src/data/proxy_dmatrix.cuh | Handle reference categorical encoding |
| src/data/proxy_dmatrix.cu | Add type utilities and reference categories |
| src/data/proxy_dmatrix.cc | Improve DMatrix creation from proxy |
| src/data/gradient_index.cc | Add template instantiation for new batch type |
| src/data/entry.h | Move entry-related structures to dedicated header |
| src/data/ellpack_page.cu | Add specialization for encoded cuDF adapter |
| src/data/device_adapter.cuh | Implement encoded adapter batch and cuDF improvements |
| src/data/device_adapter.cu | Add reference categories parsing for cuDF |
| src/data/data.cc | Update includes and add template instantiation |
| src/data/columnar.h | Add helper functions for arrow-based categorical data |
| src/data/cat_container.h | Add accessor classes and CPU implementation |
| src/data/cat_container.cuh | Add CUDA implementation for category accessors |
| src/data/cat_container.cu | Update constructor and improve memory handling |
| src/data/cat_container.cc | Add support for all unsigned integer types |
| src/data/array_interface.h | Add TypeStr method declaration |
| src/data/array_interface.cc | Implement TypeStr method for better error messages |
| src/data/adapter.h | Refactor adapter interfaces and add encoding support |
| src/data/adapter.cc | Add reference categories parsing for columnar adapter |
| src/common/type.h | Add GetValueT utility type alias |
| src/common/quantile.cc | Add template instantiation for encoded batch |
| src/common/json.cc | Add support for new unsigned integer array types |
| src/common/hist_util.cuh | Remove deprecated CUB version checks |
| src/common/device_vector.cuh | Add missing include |
| src/common/column_matrix.h | Update includes for moved structures |
| src/c_api/c_api.cc | Update category API function signatures |
| python-package/xgboost/testing/utils.py | Add assert_allclose utility function |
| python-package/xgboost/testing/ordinal.py | Comprehensive test suite for categorical encoding |
| python-package/xgboost/testing/federated.py | Update type annotations |
| python-package/xgboost/testing/dask.py | Update type annotations |
| python-package/xgboost/data.py | Add Categories type support and reference handling |
| python-package/xgboost/core.py | Update Categories class and API calls |
| python-package/xgboost/callback.py | Update type annotations |
| python-package/xgboost/_typing.py | Add new type definitions |
| python-package/xgboost/_data_utils.py | Major refactor for reference categories and arrow support |
| python-package/pyproject.toml | Add CUDA to extension whitelist |
| include/xgboost/json_io.h | Add visitor methods for new array types |
| include/xgboost/json.h | Add new unsigned integer array value kinds |
| doc/python/python_api.rst | Add Categories class to documentation |
Member
Author
|
cc @rongou . |
rongou
approved these changes
Jul 31, 2025
arrow. sketch of the container. mapping. aif. Pass it down. Extract the names. store it.rename. sketch of the accessor. Pass in the batch. Cleanup. work on test. copy work on pandas. Use list to keep the order. no print. Start the work on QDM. Cleanup. typing. alias. Fix. import skip. cleanup. outdated test. note. Move initialization. Get cats cudf. In the adapter. Copy. Work on CUDA test. Setters. rename. static. Work on cuDF acc. fix. cleanup. assert. doc string. dispatch. get cats. Check. cleanup. Work on numeric index. Numeric. typo. Remove. Notes. Split up arrow utilities. Notes. Reference. Fix note. Hide. typo. Notes. hide. Test. Work on removing pyarrow as dep. Merge. Move. Move. Use the handle for the host. Work on device. Use host storage. Cleanup. Work on training continuation. device. More. Cleanup the type hints. more checks. Work on predict check. type hints. More. Tests. params. small optimization. modin series.
arrow schema. check. Device array. debug fix. Wait. cleanup. lint. cleanup. Cleanup. Lost track. lint. Note. cleanup. Work on update. cleanup. Polars.
e5d30df to
85bb3a5
Compare
20 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Ref #11088
__arrow_c_device_array__in cuDF 25.06.feature_typesfor reference encoding.DMatrix.todos