Skip to content

fix: improve the error message provided by deduplicate#1996

Open
MarieSacksick wants to merge 8 commits into
skrub-data:mainfrom
MarieSacksick:fix_error_deduplicate
Open

fix: improve the error message provided by deduplicate#1996
MarieSacksick wants to merge 8 commits into
skrub-data:mainfrom
MarieSacksick:fix_error_deduplicate

Conversation

@MarieSacksick
Copy link
Copy Markdown
Contributor

@MarieSacksick MarieSacksick commented Mar 28, 2026

Bug Fix Pull Request

Description

Addresses #673

Checklist

  • I have read the contributing guidelines
  • I have added tests that verify the bug fix
  • I have added an entry to CHANGES.rst describing the fix
  • My code follows the code style of this project
  • I have checked my code and corrected any misspellings

How Has This Been Tested?

AI Disclosure

  • This PR contains AI-generated code
    • I have tested the code generated in my PR
    • I have read and understood every line that has been generated by the AI agent
    • I can explain what the AI-generated code does

@MarieSacksick MarieSacksick force-pushed the fix_error_deduplicate branch from 52e0c98 to 4508b7d Compare March 28, 2026 14:58
@MarieSacksick MarieSacksick linked an issue Mar 28, 2026 that may be closed by this pull request
@MarieSacksick MarieSacksick force-pushed the fix_error_deduplicate branch from 8fc774b to f2a7d6a Compare April 25, 2026 13:36
@MarieSacksick MarieSacksick marked this pull request as ready for review April 25, 2026 20:15
@rcap107 rcap107 added this to the Release 0.10 milestone May 4, 2026
Copy link
Copy Markdown
Member

@rcap107 rcap107 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @MarieSacksick!

Comment thread skrub/_deduplicate.py Outdated
Comment thread skrub/_deduplicate.py
average distance between data points in the first and second cluster.
n_jobs : int, default=None
The number of jobs to run in parallel.
warn : bool, default=False
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if the warning is needed, we almost never use warnings in skrub 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True that it's a bit unusual for skrub. But it seems to me that it would be nice to warn the user that no deduplication was done. And I don't want to add a systematic warning that will be ignored and will push the user to remove all warnings...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i also wouldn't add the warn parameter. I would say either we consider that there are no duplicates to find and return the original list without warnings, or we consider it is a genuine failure and raise a (informative) exception

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This situation might happen regularly, and I don't want to break pipelines while deduplicate could be used to clean data "in case". I'll remove the warn parameter, but add this behavior in documentation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with that solution 👍

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think of it, why is it a function and not a transformer? Is there a way for the user to know the clusters or not?
It looks like I can only deduplicate a whole dataset, and not fit on train and apply on a predict.

Co-authored-by: Riccardo Cappuzzo <7548232+rcap107@users.noreply.github.com>
@jeromedockes
Copy link
Copy Markdown
Member

jeromedockes commented May 13, 2026 via email

@rcap107
Copy link
Copy Markdown
Member

rcap107 commented May 13, 2026

Now that I think of it, why is it a function and not a transformer? Is there a way for the user to know the clusters or not? It looks like I can only deduplicate a whole dataset, and not fit on train and apply on a predict.
You are completely right but I would consider it as low-priority because AFAICT deduplicate mostly works on synthetic examples crafted for deduplicate. On the 2-3 real use cases I tried I could not get anything useful out of it.

I second this. I don't trust deduplicate either, and I'd prefer if the function were not used or publicized as much as it is.
If anything, I'd rather make the shortcomings clearer, because I feel like users are getting an overly optimistic view of the capabilities of this feature.

As for the historical reasons, this was added as a function in #339. The discussion includes more context on why it was a function rather than a class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DOC - Improve the error message provided by deduplicate

3 participants