@@ -16,18 +16,14 @@ Maximum-performance **MinHash** library for .NET 10, designed for fast approxima
1616- ** Thread-safe reads** — ` MinHasher ` and ` MinHashSignature ` are safe to share across threads
1717- ** ` MinHashIndex ` ** — in-memory similarity-search index with threshold filtering and sorted results
1818
19- ---
20-
2119## Installation
2220
2321``` shell
2422dotnet add package Atulin.MinHash
25- ```
23+ ````
2624
2725Requires ** .NET 10.0** or later.
2826
29- ---
30-
3127# # Quick Start
3228
3329` ` ` csharp
@@ -42,11 +38,9 @@ uint[] sigB = hasher.ComputeSignature("the quick brown fox leaps over the sleepy
4238
4339// 3. Estimate Jaccard similarity
4440double similarity = MinHasher.EstimateJaccard(sigA, sigB);
45- Console .WriteLine ($" Similarity: {similarity : P1 }" ); // e.g. "Similarity: 62.5%"
41+ Console.WriteLine($" Similarity: {similarity:P1}" );
4642` ` `
4743
48- ---
49-
5044# # Usage
5145
5246# ## `MinHasher` — Core Engine
@@ -57,9 +51,9 @@ var hasher = new MinHasher();
5751
5852// Custom configuration
5953var hasher = new MinHasher(
60- signatureSize : 256 , // More functions → higher accuracy, more memory
61- shingleSize : 3 , // Character n-gram length; 3–5 works well for most text
62- seed : 42 ); // Deterministic parameter generation
54+ signatureSize: 256,
55+ shingleSize: 3,
56+ seed: 42);
6357` ` `
6458
6559# ### `ComputeSignature` — allocating overload
@@ -71,7 +65,6 @@ uint[] signature = hasher.ComputeSignature("hello world");
7165# ### `ComputeSignatureTo` — zero-allocation overload
7266
7367` ` ` csharp
74- // Reuse a pre-allocated buffer — no heap allocation in the hot path
7568uint[] buffer = new uint[hasher.SignatureSize];
7669hasher.ComputeSignatureTo(" hello world" , buffer);
7770` ` `
@@ -82,14 +75,8 @@ hasher.ComputeSignatureTo("hello world", buffer);
8275double j = MinHasher.EstimateJaccard(sigA, sigB); // 0.0 – 1.0
8376` ` `
8477
85- Both spans must have the same length; an ` ArgumentException ` is thrown otherwise.
86-
87- ---
88-
8978# ## `MinHashSignature` — Immutable Wrapper
9079
91- Wraps a raw ` uint[] ` signature to provide an ergonomic, value-type API:
92-
9380` ` ` csharp
9481var hasher = new MinHasher(signatureSize: 128);
9582
@@ -102,94 +89,71 @@ Console.WriteLine(sigA.Length); // 128
10289ReadOnlySpan< uint> raw = sigA.Span;
10390` ` `
10491
105- ---
106-
10792# ## `MinHashIndex` — Similarity Search
10893
109- Index a collection of documents and query by approximate Jaccard similarity:
110-
11194` ` ` csharp
11295var hasher = new MinHasher(signatureSize: 128);
11396var index = new MinHashIndex(hasher);
11497
115- // Add documents
11698index.Add(" doc-1" , " the quick brown fox jumps over the lazy dog" );
11799index.Add(" doc-2" , " a fast auburn fox leaps across a sleepy hound" );
118100index.Add(" doc-3" , " completely unrelated text about cooking pasta" );
119101
120- // Query: returns all entries with similarity ≥ threshold, sorted descending
121102var results = index.Query(" quick fox jumps over dog" , threshold: 0.3);
122103
123104foreach (var (key, similarity) in results)
124105 Console.WriteLine($" {key}: {similarity:P1}" );
125-
126- // Example output:
127- // doc-1: 68.8%
128- // doc-2: 35.9%
129106` ` `
130107
131- > ** Note:** ` MinHashIndex ` is not thread-safe for concurrent writes. Reads (` Query ` ) may be parallelised safely once the index is fully populated.
132-
133- ---
134-
135108# # Algorithm
136109
137- 1 . Decompose text into overlapping ** k -character shingles** (character n-grams).
110+ 1. Decompose text into overlapping ** $k $ - character shingles** (character n-grams).
1381112. Hash each shingle with ** xxHash32** over its raw UTF-16 bytes.
139- 3 . Apply ** ` numHashFunctions ` universal hashes** : ` h_i(x) = (aᵢ·x + bᵢ) mod (2³¹−1) ` .
140- 4 . ` Signature[i] ` = ** minimum** over all shingles of ` hᵢ(shingle) ` .
141- 5 . ** Jaccard(A, B) ≈ |{i : sigA[ i] == sigB[ i] }| / numHashFunctions** .
142-
143- The Mersenne-prime modulo ` (2³¹−1) ` is computed with a fast bitwise fold instead of integer division.
112+ 3. Apply ** $n $ universal hash functions** : $h_i (x) = (a_i x + b_i) \b mod (2^{31} - 1)$
113+ 4. Compute signature: $\t ext{Signature}[i] = \m in_{s \i n S} h_i(s)$
114+ 5. Estimate similarity: $J (A, B) \a pprox \f rac{| {i : \t ext{sig}_A[i] = \t ext{sig}_B[i]}| }{n}$
144115
145- ---
116+ The Mersenne-prime modulo $2 ^{31} - 1$ is computed with a fast bitwise fold instead of integer division.
146117
147118# # Benchmarks
148119
149120Measured on ** AMD Ryzen 9 9900X** , .NET 10.0.6, x64 RyuJIT (AVX512/AVX2 available), BenchmarkDotNet v0.15.8.
150121
151122# ## Benchmarks on 151-character text
152123
153- | Method | Signature Size | Mean | Allocated |
154- | ---| ---:| ---:| ---:|
155- | ` ComputeCharSignature ` | 128 | 16.72 µs | 536 B |
156- | ` ComputeCharSignatureInto ` | 128 | 51.53 µs | - |
157- | ` ComputeWordSignature ` | 128 | 3.37 µs | 1144 B |
158- | ` ComputeWordSignatureInto ` | 128 | 9.70 µs | 608 B |
159- | ` EstimateJaccard ` | 128 | 5.43 ns | - |
160- | ` ComputeCharSignature ` | 256 | 32.45 µs | 1048 B |
161- | ` ComputeCharSignatureInto ` | 256 | 101.96 µs | - |
162- | ` ComputeWordSignature ` | 256 | 6.07 µs | 1656 B |
163- | ` ComputeWordSignatureInto ` | 256 | 18.06 µs | 608 B |
164- | ` EstimateJaccard ` | 256 | 9.67 ns | - |
165-
166- ` EstimateJaccard ` is fully SIMD-accelerated and ** zero-allocation** at any signature size. The jump from 128→256 reflects processing two AVX2 vector-width batches instead of one.
167-
168- > All benchmarks were run with ` dotnet run -c Release ` . Source: [ ` MinHash.Benchmark/MinHasherBench.cs ` ] ( MinHash.Benchmark/MinHashBench.cs ) .
169-
170- ---
124+ | Method | Signature Size | Mean | Allocated |
125+ | -------------------------- | -------------: | --------: | --------: |
126+ | ` ComputeCharSignature` | 128 | 16.72 µs | 536 B |
127+ | ` ComputeCharSignatureInto` | 128 | 51.53 µs | - |
128+ | ` ComputeWordSignature` | 128 | 3.37 µs | 1144 B |
129+ | ` ComputeWordSignatureInto` | 128 | 9.70 µs | 608 B |
130+ | ` EstimateJaccard` | 128 | 5.43 ns | - |
131+ | ` ComputeCharSignature` | 256 | 32.45 µs | 1048 B |
132+ | ` ComputeCharSignatureInto` | 256 | 101.96 µs | - |
133+ | ` ComputeWordSignature` | 256 | 6.07 µs | 1656 B |
134+ | ` ComputeWordSignatureInto` | 256 | 18.06 µs | 608 B |
135+ | ` EstimateJaccard` | 256 | 9.67 ns | - |
171136
172137# # Configuration Guide
173138
174- | Parameter | Default | Recommendation |
175- | ----------------- | ---------| -------------------------------------------------------- |
176- | ` signatureSize ` | 128 | 128 for ~ 97% accuracy; 256 for ~ 99% accuracy |
177- | ` shingleSize ` | 3 | 3–4 for short texts; 5 for paragraphs/documents |
178- | ` seed ` | 0xDEADBEEF | Change only if you need independent hash families |
139+ | Parameter | Default | Recommendation |
140+ | --------------- | ------------ | ------------------------------------------------- |
141+ | ` signatureSize` | 128 | 128 for ~ 97% accuracy; 256 for ~ 99% accuracy |
142+ | ` shingleSize` | 3 | 3–4 for short texts; 5 for longer documents |
143+ | ` seed` | ` 0xDEADBEEF` | Change only if you need independent hash families |
179144
180- ** Accuracy vs. size trade-off:** Jaccard estimation error is approximately ` 1/√(signatureSize) ` . At 128 functions the expected error is ≈ 8.8%; at 256 it drops to ≈ 6.25%.
145+ ** Accuracy vs. size trade-off:** Jaccard estimation error is approximately $ \f rac{1}{ \s qrt{n}}$
181146
182- ---
147+ * At $n = 128$: error $\a pprox 8.8%$
148+ * At $n = 256$: error $\a pprox 6.25%$
183149
184150# # Thread Safety
185151
186- | Type | Read | Write |
187- | -------------------| -------| -------|
188- | ` MinHasher ` | ✅ Safe | ✅ Safe (stateless after construction) |
189- | ` MinHashSignature ` | ✅ Safe | N/A (immutable) |
190- | ` MinHashIndex ` | ✅ Safe | ❌ Not safe for concurrent ` Add ` calls |
191-
192- ---
152+ | Type | Read | Write |
153+ | ------------------ | ------ | ------------------------------- |
154+ | ` MinHasher` | ✅ Safe | ✅ Safe |
155+ | ` MinHashSignature` | ✅ Safe | N/A |
156+ | ` MinHashIndex` | ✅ Safe | ❌ Not safe for concurrent ` Add` |
193157
194158# # License
195159
0 commit comments