ENH - Add configurable distance metric for fuzzy join nearest‑neighbor matching#1861
Conversation
|
hello @sabasiddique1! The CI doesn't pass, could you fix it please? It looks like it's because of the styling, it's explained how to deal with it here in the contribution guide. (I assumed it was still ongoing to I didn't have a look at the full code changes, but I still put a couple of comments on the tests in hope that it will alleviate the number of back and forth to iterate on the PR) |
|
|
||
| c = fuzzy_join(b, a, left_on="col3", right_on="col1", add_match_info=True) | ||
| assert ns.shape(c)[0] == len(b) | ||
| def test_fuzzy_join_distance_metrics(df_module): |
There was a problem hiding this comment.
There was a problem hiding this comment.
Got it, I’ll parameterize the metric cases and assert the actual matched values (not just shape) so we verify that the metric parameter changes the result.
There was a problem hiding this comment.
This comment was marked as resolved, but the actual problem was not addressed
| left, right, on="A", suffix="r", metric='euclidean', add_match_info=False | ||
| ) | ||
| assert ns.shape(result_euclidean)[0] == 2 | ||
| assert ns.shape(result_euclidean)[1] == 3 # A, Ar, Br |
There was a problem hiding this comment.
The comment # A, Ar, Br is unclear to me, what do you mean here?
There was a problem hiding this comment.
Thanks for pointing that out, I’ll remove or reword the ‘# A, Ar, Br’ comment for clarity.
There was a problem hiding this comment.
This comment was also resolved without addressing the problem
There was a problem hiding this comment.
Thanks a lot for the PR, @sabasiddique1
I wrote some additional early comments to add to what @MarieSacksick has already mentioned
040157a to
1a599ab
Compare
|
Can we add the ‘no changelog needed’ label for this PR, or should I add file? |
There should be an entry in the "Changes" section of the changelog. All user-facing changes should be reported, and adding a parameter is a user-facing change. |
rcap107
left a comment
There was a problem hiding this comment.
Hi @sabasiddique1, there are still a few things to fix. Given the other comments, I thought some of the point had already been addressed. Maybe you forgot to commit your changes?
|
|
||
| c = fuzzy_join(b, a, left_on="col3", right_on="col1", add_match_info=True) | ||
| assert ns.shape(c)[0] == len(b) | ||
| def test_fuzzy_join_distance_metrics(df_module): |
There was a problem hiding this comment.
This comment was marked as resolved, but the actual problem was not addressed
| left, right, on="A", suffix="r", metric='euclidean', add_match_info=False | ||
| ) | ||
| assert ns.shape(result_euclidean)[0] == 2 | ||
| assert ns.shape(result_euclidean)[1] == 3 # A, Ar, Br |
There was a problem hiding this comment.
This comment was also resolved without addressing the problem
| assert ns.shape(result_euclidean)[1] == 3 # A, Ar, Br | ||
|
|
||
| # Test with cosine | ||
| result_cosine = fuzzy_join( |
There was a problem hiding this comment.
The actual value of the result should be tested against the expected value as a sanity check, for this and all the other metrics.
| (using the specified metric) computed to find, for each main table row, its | ||
| nearest neighbor within the auxiliary table. | ||
|
|
||
| Insert some columns whose names start with skrub_Joiner containing |
There was a problem hiding this comment.
This section of the docstring looks like an incorrect copypaste, why is it here?
| :class:`TfidfVectorizer`. | ||
| :pr:`1819` by :user:`Eloi Massoulié <emassoulie>` | ||
| - Added a ``metric`` parameter to :func:`fuzzy_join` and :class:`Joiner` to configure | ||
| the nearest-neighbor distance used for matching. |
There was a problem hiding this comment.
Please mention that metric can be one of the metrics supported by sklearn.neighbors.NearestNeighbors, with a reference to the its docstring.
07646ba to
bccc4d9
Compare
|
Hi @sabasiddique1, I've updated the pr manually to avoid trouble with possible merge conflicts. I'm going to review this again today (or tomorrow at the latest). After that we can see if there is still work to do, or if it can be merged. |
rcap107
left a comment
There was a problem hiding this comment.
The PR looks good to me, thanks a lot @sabasiddique1!
I'll leave this open for a few more days in case other maintainers want to take a look, then I'll merge it.
|
Merging this, thanks @sabasiddique1 |

REF: #1847
Goal
Allow users to choose the distance metric used for nearest‑neighbor matching in
fuzzy_join/Joiner, enabling alternatives like cosine or manhattan for different data types.Context / Issue
Addresses enhancement request to support non‑Euclidean distance metrics for fuzzy join nearest neighbors.
What changed (step‑by‑step)
metricparameter tofuzzy_joinandJoinerwith default'euclidean'.metricthrough the matching classes tosklearn.neighbors.NearestNeighbors.Tests
pytest skrub/tests/test_fuzzy_join.py -v