MAINT: Don't transfer data to host for train_test_split#3147
MAINT: Don't transfer data to host for train_test_split#3147david-cortes-intel wants to merge 1 commit into
train_test_split#3147Conversation
|
/azp run Nightly |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run Nightly |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/intelci: run |
Codecov Report✅ All modified and coverable lines are covered by tests.
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 2 files with indirect coverage changes 🚀 New features to boost your workflow:
|
|
The train test split code doesnt nicely patch over sklearn, id give the daal4py code a look first, and make sure this change has sklearn_check_version usage. i wont throw a changes requested, @ethanglaser could you please give this another look. |
|
I am not deeply familiar with this, just going based on description, so thanks for clarifying @icfaust. In that case do you think it makes sense to not make this change @david-cortes-intel ? |
I don't see why it'd need a version check. Under these changes, it would be supporting array API whenever scikit-learn does so.
If this is not merged, there are errors in scikit-learn tests due to forced data movements from GPU to CPU and back, plus cases where data ends up with different queues than it originally had. Plus it makes the whole thing slower compared to scikit-learn due to the data movements. |
|
This information can be discovered via git blame and via github: The function From the description of the first PR it states 'Brings extension for the device_offload logic and extends support_usm_ndarray', which implicitly references the second PR. To be clear: These purposes were not the same and continue not to be the same. They exhibit different prioritization in support in the codebase dictated by management choices in the past (i.e. DPNP (and previously dpctl tensors) have received prioritized interface support in sklearnex. If they use codepaths which use If you want to follow through on this, I recommend getting a discussion going with @napetrov and @syakov-intel about this prioritization and whether we should deprecate it. As of now this is a removal of a feature which could be solved with What we should not do is do this blindly or in a ham-handed fashion. I do not like having to step in while on parental leave, but I had to as it is a feature loss which can be discovered with a little bit of research into the codebase. Though it seems that this was already changed piecemeal for some daal4py estimators (cats out of the bag as of your PR #2996). cc @ethanglaser ps a lot of this derives from the fact that dpnp did not support array API until recently (IntelPython/dpnp#1755). However we need to maintain |
|
@icfaust Regardless of the scikit-learn version, DPNP arrays support slicing them with numpy arrays by this point, so they work under this function without patching and there's no advantage in doing additional data transfers. As of 2026.0, version constraints on dependencies would not allow installing the latest sklearnex together with older DPNP, so there's no sense in supporting those. |
Don't make assumptions how users may be using thirdy party packages - that's isnt our job.
I think you misunderstood my text. usm_ndarray is still supported by sklearnex outside the array api system, especially the dispatch config (even if sklearn does), and should continue to be until management says otherwise. |
|
@icfaust This function
By the time this code reaches a public release, it will have a requirement on MKL>=2026.0, which has a higher .so version number than previous ones. Older DPNP versions (where slicing with numpy arrays is not implemented) would not work with that new MKL, so it's fair to assume that if DPNP is present, it will be from a newer version, unless users start trying to build things from source and use custom options for DPNP, which I do not think is a common use-case. |
|
again @david-cortes-intel , its not about older dpnp support, its about supporting usm_ndarray work without the array api dispatch. |
@icfaust Are you able to provide an example where there would be usm_ndarray passed to a public-facing function without involving a DPNP array? As it stands currently, there is nothing in the documentation about it, nor tests or examples, without which it's hard to figure out if something is supported. |
not sure what that provides, if you use coverage tools youll see that the non array api path is used for dpnp. i will apologize for the state of the documentation when it comes to this, this is something that was implemented by others where i was as diplomatic as possible in getting it integrated. the standard can be found here https://github.com/IntelPython/DPPY-Spec. . if you want to deprecate it to just array api, again we need management to sign off. in terms of testing, we use the dataframes and queues functionality for this circumstance, which has become so pervasive it has become standard. have you ever thought why none of those are wrapped in an array_api_dispatch? in fact i refactored test_patching.py in order to limit bad developer behavor and guarantee that we were delivering on the promises we were making to users, as testing and green CI is our strongest rules. if you go back and look at those PRs that may help. let me know if you have additional questions. |
That's precisely what this PR is meant to solve.
You can see that the tests still pass after this PR. There's been changes in libraries like DPNP since the time the workaround being removed here was introduced. |
Description
Function
train_test_splitis wrapper insupport_input_format, but by this point, scikit-learn already supports array API for it.Checklist:
Completeness and readability
Testing