fix: improve the error message provided by deduplicate#1996
fix: improve the error message provided by deduplicate#1996MarieSacksick wants to merge 8 commits into
Conversation
52e0c98 to
4508b7d
Compare
8fc774b to
f2a7d6a
Compare
rcap107
left a comment
There was a problem hiding this comment.
Thanks for the PR @MarieSacksick!
| average distance between data points in the first and second cluster. | ||
| n_jobs : int, default=None | ||
| The number of jobs to run in parallel. | ||
| warn : bool, default=False |
There was a problem hiding this comment.
I'm wondering if the warning is needed, we almost never use warnings in skrub 🤔
There was a problem hiding this comment.
True that it's a bit unusual for skrub. But it seems to me that it would be nice to warn the user that no deduplication was done. And I don't want to add a systematic warning that will be ignored and will push the user to remove all warnings...
There was a problem hiding this comment.
i also wouldn't add the warn parameter. I would say either we consider that there are no duplicates to find and return the original list without warnings, or we consider it is a genuine failure and raise a (informative) exception
There was a problem hiding this comment.
This situation might happen regularly, and I don't want to break pipelines while deduplicate could be used to clean data "in case". I'll remove the warn parameter, but add this behavior in documentation.
There was a problem hiding this comment.
Now that I think of it, why is it a function and not a transformer? Is there a way for the user to know the clusters or not?
It looks like I can only deduplicate a whole dataset, and not fit on train and apply on a predict.
Co-authored-by: Riccardo Cappuzzo <7548232+rcap107@users.noreply.github.com>
|
Now that I think of it, why is it a function and not a transformer? Is there a way for the user to know the clusters or not?
It looks like I can only deduplicate a whole dataset, and not fit on train and apply on a predict.
You are completely right but I would consider it as low-priority because AFAICT deduplicate mostly works on synthetic examples crafted for deduplicate. On the 2-3 real use cases I tried I could not get anything useful out of it.
|
I second this. I don't trust deduplicate either, and I'd prefer if the function were not used or publicized as much as it is. As for the historical reasons, this was added as a function in #339. The discussion includes more context on why it was a function rather than a class. |
Bug Fix Pull Request
Description
Addresses #673
Checklist
How Has This Been Tested?
AI Disclosure