Skip to content

ENH - Add configurable distance metric for fuzzy join nearest‑neighbor matching#1861

Merged
rcap107 merged 6 commits into
skrub-data:mainfrom
sabasiddique1:feat-distance-metric-fuzzy-join
Apr 23, 2026
Merged

ENH - Add configurable distance metric for fuzzy join nearest‑neighbor matching#1861
rcap107 merged 6 commits into
skrub-data:mainfrom
sabasiddique1:feat-distance-metric-fuzzy-join

Conversation

@sabasiddique1
Copy link
Copy Markdown
Contributor

@sabasiddique1 sabasiddique1 commented Jan 26, 2026

REF: #1847
Goal
Allow users to choose the distance metric used for nearest‑neighbor matching in fuzzy_join / Joiner, enabling alternatives like cosine or manhattan for different data types.

Context / Issue
Addresses enhancement request to support non‑Euclidean distance metrics for fuzzy join nearest neighbors.

What changed (step‑by‑step)

  1. Added a metric parameter to fuzzy_join and Joiner with default 'euclidean'.
  2. Passed the metric through the matching classes to sklearn.neighbors.NearestNeighbors.
  3. Documented the new parameter in docstrings with common metric examples.
  4. Added tests covering multiple metrics and invalid metric handling.

Tests

  • pytest skrub/tests/test_fuzzy_join.py -v

@MarieSacksick
Copy link
Copy Markdown
Contributor

MarieSacksick commented Jan 28, 2026

hello @sabasiddique1!
Welcome :).

The CI doesn't pass, could you fix it please? It looks like it's because of the styling, it's explained how to deal with it here in the contribution guide.
In the mean time, you can put your PR in draft mode, so that it's clear that it's ongoing work. You can do it at the top right of the PR menu:
image

(I assumed it was still ongoing to I didn't have a look at the full code changes, but I still put a couple of comments on the tests in hope that it will alleviate the number of back and forth to iterate on the PR)

Comment thread skrub/tests/test_fuzzy_join.py Outdated
Comment thread skrub/tests/test_fuzzy_join.py Outdated

c = fuzzy_join(b, a, left_on="col3", right_on="col1", add_match_info=True)
assert ns.shape(c)[0] == len(b)
def test_fuzzy_join_distance_metrics(df_module):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of repeating the code, you can parameterize it, for instance as it's done here or here.
Also, we want to test the values and not only the shape, to make sure that the input "metric" is taken into account, and that the function doesn't always use the default one.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I’ll parameterize the metric cases and assert the actual matched values (not just shape) so we verify that the metric parameter changes the result.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment was marked as resolved, but the actual problem was not addressed

Comment thread skrub/tests/test_fuzzy_join.py Outdated
left, right, on="A", suffix="r", metric='euclidean', add_match_info=False
)
assert ns.shape(result_euclidean)[0] == 2
assert ns.shape(result_euclidean)[1] == 3 # A, Ar, Br
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment # A, Ar, Br is unclear to me, what do you mean here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out, I’ll remove or reword the ‘# A, Ar, Br’ comment for clarity.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment was also resolved without addressing the problem

Copy link
Copy Markdown
Member

@rcap107 rcap107 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR, @sabasiddique1

I wrote some additional early comments to add to what @MarieSacksick has already mentioned

Comment thread skrub/_fuzzy_join.py Outdated
Comment thread skrub/_fuzzy_join.py Outdated
Comment thread skrub/_joiner.py Outdated
Comment thread skrub/_matching.py Outdated
@rcap107 rcap107 changed the title Add configurable distance metric for fuzzy join nearest‑neighbor matching ENH - Add configurable distance metric for fuzzy join nearest‑neighbor matching Jan 29, 2026
@sabasiddique1 sabasiddique1 marked this pull request as draft January 30, 2026 14:16
@sabasiddique1 sabasiddique1 force-pushed the feat-distance-metric-fuzzy-join branch from 040157a to 1a599ab Compare January 30, 2026 16:36
@sabasiddique1
Copy link
Copy Markdown
Contributor Author

sabasiddique1 commented Jan 30, 2026

Can we add the ‘no changelog needed’ label for this PR, or should I add file?

@sabasiddique1 sabasiddique1 marked this pull request as ready for review January 30, 2026 17:03
@rcap107
Copy link
Copy Markdown
Member

rcap107 commented Jan 30, 2026

Can we add the ‘no changelog needed’ label for this PR, or should I add file?

There should be an entry in the "Changes" section of the changelog. All user-facing changes should be reported, and adding a parameter is a user-facing change.

@rcap107 rcap107 added this to the Release 0.8.0 milestone Feb 9, 2026
Copy link
Copy Markdown
Member

@rcap107 rcap107 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sabasiddique1, there are still a few things to fix. Given the other comments, I thought some of the point had already been addressed. Maybe you forgot to commit your changes?

Comment thread skrub/tests/test_fuzzy_join.py Outdated

c = fuzzy_join(b, a, left_on="col3", right_on="col1", add_match_info=True)
assert ns.shape(c)[0] == len(b)
def test_fuzzy_join_distance_metrics(df_module):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment was marked as resolved, but the actual problem was not addressed

Comment thread skrub/tests/test_fuzzy_join.py Outdated
left, right, on="A", suffix="r", metric='euclidean', add_match_info=False
)
assert ns.shape(result_euclidean)[0] == 2
assert ns.shape(result_euclidean)[1] == 3 # A, Ar, Br
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment was also resolved without addressing the problem

Comment thread skrub/tests/test_fuzzy_join.py Outdated
assert ns.shape(result_euclidean)[1] == 3 # A, Ar, Br

# Test with cosine
result_cosine = fuzzy_join(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual value of the result should be tested against the expected value as a sanity check, for this and all the other metrics.

Comment thread skrub/_joiner.py Outdated
(using the specified metric) computed to find, for each main table row, its
nearest neighbor within the auxiliary table.

Insert some columns whose names start with skrub_Joiner containing
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section of the docstring looks like an incorrect copypaste, why is it here?

Comment thread CHANGES.rst Outdated
:class:`TfidfVectorizer`.
:pr:`1819` by :user:`Eloi Massoulié <emassoulie>`
- Added a ``metric`` parameter to :func:`fuzzy_join` and :class:`Joiner` to configure
the nearest-neighbor distance used for matching.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mention that metric can be one of the metrics supported by sklearn.neighbors.NearestNeighbors, with a reference to the its docstring.

@sabasiddique1 sabasiddique1 force-pushed the feat-distance-metric-fuzzy-join branch from 07646ba to bccc4d9 Compare February 15, 2026 11:06
@rcap107 rcap107 modified the milestones: Release 0.8.0, Release 0.8.1 Mar 18, 2026
@rcap107
Copy link
Copy Markdown
Member

rcap107 commented Apr 14, 2026

Hi @sabasiddique1, I've updated the pr manually to avoid trouble with possible merge conflicts. I'm going to review this again today (or tomorrow at the latest). After that we can see if there is still work to do, or if it can be merged.

Copy link
Copy Markdown
Member

@rcap107 rcap107 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good to me, thanks a lot @sabasiddique1!

I'll leave this open for a few more days in case other maintainers want to take a look, then I'll merge it.

@rcap107
Copy link
Copy Markdown
Member

rcap107 commented Apr 23, 2026

Merging this, thanks @sabasiddique1

@rcap107 rcap107 merged commit bc2c689 into skrub-data:main Apr 23, 2026
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ENH - Allow using a different distance for the nearest neighbors in fuzzy join

3 participants