Skip to content

MAINT: Don't transfer data to host for train_test_split#3147

Open
david-cortes-intel wants to merge 1 commit into
uxlfoundation:mainfrom
david-cortes-intel:split_array_api
Open

MAINT: Don't transfer data to host for train_test_split#3147
david-cortes-intel wants to merge 1 commit into
uxlfoundation:mainfrom
david-cortes-intel:split_array_api

Conversation

@david-cortes-intel
Copy link
Copy Markdown
Contributor

Description

Function train_test_split is wrapper in support_input_format, but by this point, scikit-learn already supports array API for it.


Checklist:

Completeness and readability

  • Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
  • I have resolved any merge conflicts that might occur with the base branch.

Testing

  • I have run it locally and tested the changes extensively.
  • All CI jobs are green or I have provided justification why they aren't.

@david-cortes-intel
Copy link
Copy Markdown
Contributor Author

/azp run Nightly

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@david-cortes-intel
Copy link
Copy Markdown
Contributor Author

/azp run Nightly

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@david-cortes-intel
Copy link
Copy Markdown
Contributor Author

/intelci: run

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

Flag Coverage Δ
azure 79.53% <ø> (-0.09%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sklearnex/model_selection/split.py 100.00% <ø> (ø)

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@david-cortes-intel david-cortes-intel marked this pull request as ready for review April 30, 2026 11:13
@icfaust
Copy link
Copy Markdown
Contributor

icfaust commented Apr 30, 2026

The train test split code doesnt nicely patch over sklearn, id give the daal4py code a look first, and make sure this change has sklearn_check_version usage.

i wont throw a changes requested, @ethanglaser could you please give this another look.

@ethanglaser
Copy link
Copy Markdown
Contributor

I am not deeply familiar with this, just going based on description, so thanks for clarifying @icfaust. In that case do you think it makes sense to not make this change @david-cortes-intel ?

@david-cortes-intel
Copy link
Copy Markdown
Contributor Author

The train test split code doesnt nicely patch over sklearn, id give the daal4py code a look first, and make sure this change has sklearn_check_version usage.

i wont throw a changes requested, @ethanglaser could you please give this another look.

I don't see why it'd need a version check. Under these changes, it would be supporting array API whenever scikit-learn does so.

I am not deeply familiar with this, just going based on description, so thanks for clarifying @icfaust. In that case do you think it makes sense to not make this change @david-cortes-intel ?

If this is not merged, there are errors in scikit-learn tests due to forced data movements from GPU to CPU and back, plus cases where data ends up with different queues than it originally had. Plus it makes the whole thing slower compared to scikit-learn due to the data movements.

@icfaust
Copy link
Copy Markdown
Contributor

icfaust commented May 4, 2026

This information can be discovered via git blame and via github:

The function support_input_format was renamed from support_usm_ndarray in this PR: #1861
support_usm_ndarray was introduced in this PR: #1216

From the description of the first PR it states 'Brings extension for the device_offload logic and extends support_usm_ndarray', which implicitly references the second PR.

To be clear: support_input_format serves two purposes 1) it gives higher priority to usm_ndarray datatypes (those with __sycl_usm_array_interface__, the original implementation) and 2) support array api inputs (follows this standard https://data-apis.org/array-api/latest/ and is implemented in the second).

These purposes were not the same and continue not to be the same. They exhibit different prioritization in support in the codebase dictated by management choices in the past (i.e. __sycl_usm_array_interface__ is always evaluated before any array API on purpose). Through multiple maintainers there has been a misunderstanding that seemingly continues to propagate, so to be clear- by supporting usm_ndarray we are NOT doing array API. Its also why we don't use an out-of-the-box implementation of get_namespace (it was just convenient to piggy back on it after sklearn1.2).

DPNP (and previously dpctl tensors) have received prioritized interface support in sklearnex. If they use codepaths which use __sycl_usm_array_interface__ then sklearnex makes no promise about array API. By assuming that it is array API related, we have removed functional capability with older sklearn installs with usm_ndarray data. The way to check this is to check if all methods and estimators work for sklearn 1.0 with a dpnp/dpctl tensor GPU queue. If that doesn't work we are breaking a promise to users and management.

If you want to follow through on this, I recommend getting a discussion going with @napetrov and @syakov-intel about this prioritization and whether we should deprecate it. As of now this is a removal of a feature which could be solved with sklearn_check_version matching when array API was introduced. I am perfectly fine with dropping dpnp support for older sklearn versions, but I think it should be done with their explicit sign off. Ideally we drop this prioritization entirely and stick to array API. As of now I have not seen this discussion or any written confirmation that we are switching tack.

What we should not do is do this blindly or in a ham-handed fashion. I do not like having to step in while on parental leave, but I had to as it is a feature loss which can be discovered with a little bit of research into the codebase. Though it seems that this was already changed piecemeal for some daal4py estimators (cats out of the bag as of your PR #2996).

cc @ethanglaser

ps a lot of this derives from the fact that dpnp did not support array API until recently (IntelPython/dpnp#1755). However we need to maintain __sycl_usm_array_interface__ because it is more exact with SYCL queue usage in a way that array API is not, which could be important in some edge cases.

@david-cortes-intel
Copy link
Copy Markdown
Contributor Author

@icfaust Regardless of the scikit-learn version, DPNP arrays support slicing them with numpy arrays by this point, so they work under this function without patching and there's no advantage in doing additional data transfers. As of 2026.0, version constraints on dependencies would not allow installing the latest sklearnex together with older DPNP, so there's no sense in supporting those.

@icfaust
Copy link
Copy Markdown
Contributor

icfaust commented May 5, 2026

@icfaust Regardless of the scikit-learn version, DPNP arrays support slicing them with numpy arrays by this point, so they work under this function without patching and there's no advantage in doing additional data transfers.

Don't make assumptions how users may be using thirdy party packages - that's isnt our job.

As of 2026.0, version constraints on dependencies would not allow installing the latest sklearnex together with older DPNP, so there's no sense in supporting those.

I think you misunderstood my text. usm_ndarray is still supported by sklearnex outside the array api system, especially the dispatch config (even if sklearn does), and should continue to be until management says otherwise.

@david-cortes-intel
Copy link
Copy Markdown
Contributor Author

@icfaust This function train_test_split (from scikit-learn) works with DPNP arrays without patching and without array API dispatch already.

Don't make assumptions how users may be using thirdy party packages - that's isnt our job.

By the time this code reaches a public release, it will have a requirement on MKL>=2026.0, which has a higher .so version number than previous ones. Older DPNP versions (where slicing with numpy arrays is not implemented) would not work with that new MKL, so it's fair to assume that if DPNP is present, it will be from a newer version, unless users start trying to build things from source and use custom options for DPNP, which I do not think is a common use-case.

@icfaust
Copy link
Copy Markdown
Contributor

icfaust commented May 5, 2026

again @david-cortes-intel , its not about older dpnp support, its about supporting usm_ndarray work without the array api dispatch.

@david-cortes-intel
Copy link
Copy Markdown
Contributor Author

again @david-cortes-intel , its not about older dpnp support, its about supporting usm_ndarray work without the array api dispatch.

@icfaust Are you able to provide an example where there would be usm_ndarray passed to a public-facing function without involving a DPNP array? As it stands currently, there is nothing in the documentation about it, nor tests or examples, without which it's hard to figure out if something is supported.

@icfaust
Copy link
Copy Markdown
Contributor

icfaust commented May 10, 2026

again @david-cortes-intel , its not about older dpnp support, its about supporting usm_ndarray work without the array api dispatch.

@icfaust Are you able to provide an example where there would be usm_ndarray passed to a public-facing function without involving a DPNP array? As it stands currently, there is nothing in the documentation about it, nor tests or examples, without which it's hard to figure out if something is supported.

not sure what that provides, if you use coverage tools youll see that the non array api path is used for dpnp. i will apologize for the state of the documentation when it comes to this, this is something that was implemented by others where i was as diplomatic as possible in getting it integrated. the standard can be found here https://github.com/IntelPython/DPPY-Spec. . if you want to deprecate it to just array api, again we need management to sign off. in terms of testing, we use the dataframes and queues functionality for this circumstance, which has become so pervasive it has become standard. have you ever thought why none of those are wrapped in an array_api_dispatch? in fact i refactored test_patching.py in order to limit bad developer behavor and guarantee that we were delivering on the promises we were making to users, as testing and green CI is our strongest rules. if you go back and look at those PRs that may help. let me know if you have additional questions.

@david-cortes-intel
Copy link
Copy Markdown
Contributor Author

not sure what that provides, if you use coverage tools youll see that the non array api path is used for dpnp.

That's precisely what this PR is meant to solve.

in terms of testing, we use the dataframes and queues functionality for this circumstance, which has become so pervasive it has become standard.

You can see that the tests still pass after this PR. There's been changes in libraries like DPNP since the time the workaround being removed here was introduced.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants