Skip to content

Fix encoding OOV category.#12193

Merged
trivialfis merged 2 commits into
dmlc:masterfrom
trivialfis:fix-cat-oov
Apr 30, 2026
Merged

Fix encoding OOV category.#12193
trivialfis merged 2 commits into
dmlc:masterfrom
trivialfis:fix-cat-oov

Conversation

@trivialfis
Copy link
Copy Markdown
Member

Close #12189

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a correctness bug in the ordinal (auto) re-coder for native categorical features where out-of-vocabulary (OOV) categories could be silently mapped to an in-range training category when lower_bound returned an insertion point rather than an exact match.

Changes:

  • Add exact-match verification after lower_bound in CPU and CUDA SearchSorted implementations (string + numeric) to properly detect OOV categories in-range.
  • Add CPU and GPU Python tests covering OOV categories whose sorted insertion point falls within the training set.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/encoder/ordinal.h Adds post-lower_bound equality check to avoid returning an insertion point as a match (CPU path).
src/encoder/ordinal.cuh Mirrors the CPU fix on the CUDA path for both string and numeric category searches.
python-package/xgboost/testing/ordinal.py Introduces run_cat_oov_in_range test helper validating error behavior for in-range OOV categories.
tests/python/test_ordinal.py Wires the new OOV-in-range test into the CPU test suite.
tests/python-gpu/test_gpu_ordinal.py Wires the new OOV-in-range test into the GPU test suite.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@trivialfis trivialfis requested a review from RAMitchell April 29, 2026 19:49
Copy link
Copy Markdown
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still not super happy about us doing recoding.

@trivialfis
Copy link
Copy Markdown
Member Author

I am still not super happy about us doing recoding.

Yeah, at some point I gave up arguing against that.

@trivialfis trivialfis merged commit 661e2ec into dmlc:master Apr 30, 2026
84 checks passed
@trivialfis trivialfis deleted the fix-cat-oov branch April 30, 2026 11:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug] SearchSorted missing equality check causes OOV categories to silently map to wrong training category

3 participants