Skip to content

SimHash encapsulation#77

Merged
itsibitzi merged 7 commits into
mainfrom
sc-20250730-sim-hash
Aug 1, 2025
Merged

SimHash encapsulation#77
itsibitzi merged 7 commits into
mainfrom
sc-20250730-sim-hash

Conversation

@itsibitzi
Copy link
Copy Markdown
Contributor

@itsibitzi itsibitzi commented Jul 30, 2025

Motivation

We previously exposed some internal details with how the sim hashes are configured in the geo_filters create. It would be preferable to not expose this configuration as constants (which need to have math applied to them to be useful 😄) and instead provide functions which express some more concrete usage patterns.

We needed to expose the SIM_BUCKETS constant so that users could calculate the number of required matches when searching for geofilters using SimHashes. Rather than requiring them to manually calculate this I've changed the sim_hash_search function to return the minimum number of matching SimHashes required before another filter can be considered a match.

Implementation

Given the SimHashIterator always decrements its prev_bucket_id by 1 until it reaches zero then the initial value of this is equivalent to the length of the iterator. I've just put this into a size_hint and implemented ExactSizeIterator for SimHashIterator. This allows us to get the number of SimHashes that the iterator will produce.

Use this length or the length of the range produced by sim_hash_range to get the number of SimHashes the iterator will produce.

Question to reviewer

Is the math such that the length of the sim_hash_range is always shorter than the number of SimHashes produces by sim_hashes for all possible configurations of a geofilter?

@itsibitzi itsibitzi marked this pull request as ready for review August 1, 2025 12:51
@itsibitzi itsibitzi requested a review from a team as a code owner August 1, 2025 12:51
Copilot AI review requested due to automatic review settings August 1, 2025 12:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR encapsulates the internal configuration details of SimHash by making previously public constants private and providing a more user-friendly API. Instead of requiring users to manually calculate the minimum number of matching SimHashes needed for searching, the sim_hash_search function now returns this value directly.

  • Made SIM_BUCKET_SIZE and SIM_BUCKETS constants private to hide internal implementation details
  • Enhanced sim_hashes_search method to return both the iterator and the calculated minimum matches required
  • Implemented ExactSizeIterator for SimHashIterator to enable length calculations

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
crates/geo_filters/src/diff_count/sim_hash.rs Made constants private, enhanced search API to return minimum matches, implemented ExactSizeIterator
crates/geo_filters/src/diff_count.rs Removed public exports of now-private constants

Comment thread crates/geo_filters/src/diff_count/sim_hash.rs
Comment thread crates/geo_filters/src/diff_count/sim_hash.rs Outdated
@itsibitzi itsibitzi marked this pull request as draft August 1, 2025 12:57
@itsibitzi itsibitzi marked this pull request as ready for review August 1, 2025 13:10
let expected_min_matches = actual_count
.saturating_sub(expected_diff)
.max(SIM_BUCKETS / 2);
assert_eq!(min_matches, expected_min_matches)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? This is exactly the same code as in the implementation.
I don't think that test is very useful then...

Copy link
Copy Markdown
Contributor Author

@itsibitzi itsibitzi Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was using it when experimenting if the length of the iterator differs to the length of the range. I left it in because I thought it might catch regressions in changes to sim_hashes_search.

Will remove it for now.

@itsibitzi itsibitzi merged commit 80ae11c into main Aug 1, 2025
7 checks passed
@itsibitzi itsibitzi deleted the sc-20250730-sim-hash branch August 1, 2025 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants