SimHash encapsulation#77
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR encapsulates the internal configuration details of SimHash by making previously public constants private and providing a more user-friendly API. Instead of requiring users to manually calculate the minimum number of matching SimHashes needed for searching, the sim_hash_search function now returns this value directly.
- Made
SIM_BUCKET_SIZEandSIM_BUCKETSconstants private to hide internal implementation details - Enhanced
sim_hashes_searchmethod to return both the iterator and the calculated minimum matches required - Implemented
ExactSizeIteratorforSimHashIteratorto enable length calculations
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| crates/geo_filters/src/diff_count/sim_hash.rs | Made constants private, enhanced search API to return minimum matches, implemented ExactSizeIterator |
| crates/geo_filters/src/diff_count.rs | Removed public exports of now-private constants |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| let expected_min_matches = actual_count | ||
| .saturating_sub(expected_diff) | ||
| .max(SIM_BUCKETS / 2); | ||
| assert_eq!(min_matches, expected_min_matches) |
There was a problem hiding this comment.
? This is exactly the same code as in the implementation.
I don't think that test is very useful then...
There was a problem hiding this comment.
I was using it when experimenting if the length of the iterator differs to the length of the range. I left it in because I thought it might catch regressions in changes to sim_hashes_search.
Will remove it for now.
Motivation
We previously exposed some internal details with how the sim hashes are configured in the
geo_filterscreate. It would be preferable to not expose this configuration as constants (which need to have math applied to them to be useful 😄) and instead provide functions which express some more concrete usage patterns.We needed to expose the
SIM_BUCKETSconstant so that users could calculate the number of required matches when searching for geofilters usingSimHashes. Rather than requiring them to manually calculate this I've changed thesim_hash_searchfunction to return the minimum number of matchingSimHashes required before another filter can be considered a match.Implementation
Given the
SimHashIteratoralways decrements itsprev_bucket_idby 1 until it reaches zero then the initial value of this is equivalent to the length of the iterator. I've just put this into asize_hintand implementedExactSizeIteratorforSimHashIterator. This allows us to get the number ofSimHashes that the iterator will produce.Use this length or the length of the range produced by
sim_hash_rangeto get the number ofSimHashes the iterator will produce.Question to reviewer
Is the math such that the length of the
sim_hash_rangeis always shorter than the number ofSimHashes produces bysim_hashesfor all possible configurations of a geofilter?