ENH make Random*Sampler accept dask array and dataframe#777
ENH make Random*Sampler accept dask array and dataframe#777glemaitre wants to merge 32 commits intoscikit-learn-contrib:masterfrom
Conversation
|
Hello @glemaitre! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2020-11-08 19:22:54 UTC |
|
This pull request introduces 2 alerts when merging 7aae9d9 into edd7522 - view on LGTM.com new alerts:
|
TomAugspurger
left a comment
There was a problem hiding this comment.
Looks good overall. I think the comment about dask.compute(...) rather than x.compute(), y.compute() is the most important.
Other than that I tried to share some of the difficulties I've run into with Dask-ML, but things look nice overall.
| _REGISTERED_DASK_CONTAINER = [] | ||
|
|
||
| try: | ||
| from dask import array, dataframe |
There was a problem hiding this comment.
People can have just dask[array] installed (not dataframe) so it's possible to have the array import succeed, but the dataframe import fail. So if you want to support that case those would need to be in separate try / except blocks.
Maybe you instead want from dask import is_dask_collection? That's a bit broader though (it also covers anything implementing dask's collection interface like dask.Bag, xarray.DataArray and xarray.Dataset).
There was a problem hiding this comment.
That seems what I wanted :)
| if hasattr(y, "unique"): | ||
| labels = np.asarray(y.unique()) | ||
| else: | ||
| labels = np.unique(y).compute() |
There was a problem hiding this comment.
I've struggled with this check in dask-ml. Depending on where it's called, it's potentially very expensive (you might be loading a ton of data just to check if it's multi-label, and then loading it again to to the training).
Whenever possible, it's helpful to provide an option to skip this check by having the user specify it when creating the estimator, or in a keyword to fit (dunno if that applies here).
There was a problem hiding this comment.
I thought about it. Do you think that having a context manager outside would make sense:
with set_config(avoid_check=True):
# some imblearn/scikit-learn/dask codeThought, we might get into trouble with issues related to scikit-learn/scikit-learn#18736
It might just be easier to have an optional class parameter that applies only for dask arrays.
| force_all_finite=False, | ||
| if is_dask_container(y) and hasattr(y, "to_dask_array"): | ||
| y = y.to_dask_array() | ||
| y.compute_chunk_sizes() |
There was a problem hiding this comment.
In Dask-ML we (@stsievert I think? maybe me?) prefer to have the user do this: https://github.com/dask/dask-ml/blob/7e11ce1505a485104e02d49a3620c8264e63e12e/dask_ml/utils.py#L166-L173. If you're just fitting the one estimator then this is probably equivalent. If you're doing something like a cross_val_score, then I think this would end up loading data multiple times just to compute the chunk sizes.
There was a problem hiding this comment.
This is something that I was unsure of, here. If I recall, the issue was that I could not have called ravel on the Series and then the easiest way to always have an array and convert back to a Series reusing the meta-data.
However, if we assume that the checks are too expensive to be done in a distributive setting, we don't need to call the check below and we can directly pass the Series and handle it during the resampling.
So, we have fewer safeguards but at least it is more performant which is something you probably want in a distrubted setting
| if is_dask_container(unique): | ||
| unique, counts = unique.compute(), counts.compute() |
There was a problem hiding this comment.
As written, this will fully execute the task graph of y twice. Once to compute unique and once to compute counts.
| if is_dask_container(unique): | |
| unique, counts = unique.compute(), counts.compute() | |
| if is_dask_container(unique): | |
| unique, counts = dask.compute(unique, counts) |
You'll need to import dask
|
This pull request introduces 5 alerts when merging d4aabf8 into edd7522 - view on LGTM.com new alerts:
|
|
This pull request introduces 5 alerts when merging 58acdf2 into edd7522 - view on LGTM.com new alerts:
|
|
This pull request introduces 2 alerts when merging e54c772 into edd7522 - view on LGTM.com new alerts:
|
|
This pull request introduces 2 alerts when merging 167fc2a into edd7522 - view on LGTM.com new alerts:
|
167fc2a to
ec4a717
Compare
ec4a717 to
20b44c6
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #777 +/- ##
==========================================
+ Coverage 96.55% 98.18% +1.63%
==========================================
Files 82 94 +12
Lines 5140 5900 +760
Branches 0 515 +515
==========================================
+ Hits 4963 5793 +830
+ Misses 177 100 -77
- Partials 0 7 +7 ☔ View full report in Codecov by Sentry. |
|
This pull request introduces 2 alerts when merging 20b44c6 into edd7522 - view on LGTM.com new alerts:
|
|
This pull request introduces 2 alerts when merging a6e975b into edd7522 - view on LGTM.com new alerts:
|
|
This pull request introduces 2 alerts when merging 456c3eb into edd7522 - view on LGTM.com new alerts:
|
|
@glemaitre Why didn't you merge this branch with master everything seems alright, isn't it ? |
|
It is one year old so I don't recall the details. It was only a POC to see what would be the issue to deal with It would need further work to go ahead. |
|
I think it would be a pity if it doesn't go in for this comment. We can't really do much about it except avoiding calling this method. Happy to help if there is anything else that need to be done :) |
POC to see if we can make at least the
RandomUnderSamplerandRandomOverSampleraccept some dask array and dataframeNote: