Skip to content

Commit 22f25ef

Browse files
authored
Update README for formatting and content clarity
1 parent 3f8f7a2 commit 22f25ef

1 file changed

Lines changed: 35 additions & 71 deletions

File tree

README.md

Lines changed: 35 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -16,18 +16,14 @@ Maximum-performance **MinHash** library for .NET 10, designed for fast approxima
1616
- **Thread-safe reads**`MinHasher` and `MinHashSignature` are safe to share across threads
1717
- **`MinHashIndex`** — in-memory similarity-search index with threshold filtering and sorted results
1818

19-
---
20-
2119
## Installation
2220

2321
```shell
2422
dotnet add package Atulin.MinHash
25-
```
23+
````
2624

2725
Requires **.NET 10.0** or later.
2826

29-
---
30-
3127
## Quick Start
3228

3329
```csharp
@@ -42,11 +38,9 @@ uint[] sigB = hasher.ComputeSignature("the quick brown fox leaps over the sleepy
4238
4339
// 3. Estimate Jaccard similarity
4440
double similarity = MinHasher.EstimateJaccard(sigA, sigB);
45-
Console.WriteLine($"Similarity: {similarity:P1}"); // e.g. "Similarity: 62.5%"
41+
Console.WriteLine($"Similarity: {similarity:P1}");
4642
```
4743

48-
---
49-
5044
## Usage
5145

5246
### `MinHasher` — Core Engine
@@ -57,9 +51,9 @@ var hasher = new MinHasher();
5751
5852
// Custom configuration
5953
var hasher = new MinHasher(
60-
signatureSize: 256, // More functions → higher accuracy, more memory
61-
shingleSize: 3, // Character n-gram length; 3–5 works well for most text
62-
seed: 42); // Deterministic parameter generation
54+
signatureSize: 256,
55+
shingleSize: 3,
56+
seed: 42);
6357
```
6458

6559
#### `ComputeSignature` — allocating overload
@@ -71,7 +65,6 @@ uint[] signature = hasher.ComputeSignature("hello world");
7165
#### `ComputeSignatureTo` — zero-allocation overload
7266

7367
```csharp
74-
// Reuse a pre-allocated buffer — no heap allocation in the hot path
7568
uint[] buffer = new uint[hasher.SignatureSize];
7669
hasher.ComputeSignatureTo("hello world", buffer);
7770
```
@@ -82,14 +75,8 @@ hasher.ComputeSignatureTo("hello world", buffer);
8275
double j = MinHasher.EstimateJaccard(sigA, sigB); // 0.0 – 1.0
8376
```
8477

85-
Both spans must have the same length; an `ArgumentException` is thrown otherwise.
86-
87-
---
88-
8978
### `MinHashSignature` — Immutable Wrapper
9079

91-
Wraps a raw `uint[]` signature to provide an ergonomic, value-type API:
92-
9380
```csharp
9481
var hasher = new MinHasher(signatureSize: 128);
9582
@@ -102,94 +89,71 @@ Console.WriteLine(sigA.Length); // 128
10289
ReadOnlySpan<uint> raw = sigA.Span;
10390
```
10491

105-
---
106-
10792
### `MinHashIndex` — Similarity Search
10893

109-
Index a collection of documents and query by approximate Jaccard similarity:
110-
11194
```csharp
11295
var hasher = new MinHasher(signatureSize: 128);
11396
var index = new MinHashIndex(hasher);
11497
115-
// Add documents
11698
index.Add("doc-1", "the quick brown fox jumps over the lazy dog");
11799
index.Add("doc-2", "a fast auburn fox leaps across a sleepy hound");
118100
index.Add("doc-3", "completely unrelated text about cooking pasta");
119101
120-
// Query: returns all entries with similarity ≥ threshold, sorted descending
121102
var results = index.Query("quick fox jumps over dog", threshold: 0.3);
122103
123104
foreach (var (key, similarity) in results)
124105
Console.WriteLine($"{key}: {similarity:P1}");
125-
126-
// Example output:
127-
// doc-1: 68.8%
128-
// doc-2: 35.9%
129106
```
130107

131-
> **Note:** `MinHashIndex` is not thread-safe for concurrent writes. Reads (`Query`) may be parallelised safely once the index is fully populated.
132-
133-
---
134-
135108
## Algorithm
136109

137-
1. Decompose text into overlapping **k-character shingles** (character n-grams).
110+
1. Decompose text into overlapping **$k$-character shingles** (character n-grams).
138111
2. Hash each shingle with **xxHash32** over its raw UTF-16 bytes.
139-
3. Apply **`numHashFunctions` universal hashes**: `h_i(x) = (aᵢ·x + bᵢ) mod (2³¹−1)`.
140-
4. `Signature[i]` = **minimum** over all shingles of `hᵢ(shingle)`.
141-
5. **Jaccard(A, B) ≈ |{i : sigA[i] == sigB[i]}| / numHashFunctions**.
142-
143-
The Mersenne-prime modulo `(2³¹−1)` is computed with a fast bitwise fold instead of integer division.
112+
3. Apply **$n$ universal hash functions**: $h_i(x) = (a_i x + b_i) \bmod (2^{31} - 1)$
113+
4. Compute signature: $\text{Signature}[i] = \min_{s \in S} h_i(s)$
114+
5. Estimate similarity: $J(A, B) \approx \frac{|{i : \text{sig}_A[i] = \text{sig}_B[i]}|}{n}$
144115

145-
---
116+
The Mersenne-prime modulo $2^{31} - 1$ is computed with a fast bitwise fold instead of integer division.
146117

147118
## Benchmarks
148119

149120
Measured on **AMD Ryzen 9 9900X**, .NET 10.0.6, x64 RyuJIT (AVX512/AVX2 available), BenchmarkDotNet v0.15.8.
150121

151122
### Benchmarks on 151-character text
152123

153-
| Method | Signature Size | Mean | Allocated |
154-
|---|---:|---:|---:|
155-
| `ComputeCharSignature` | 128 | 16.72 µs | 536 B |
156-
| `ComputeCharSignatureInto` | 128 | 51.53 µs | - |
157-
| `ComputeWordSignature` | 128 | 3.37 µs | 1144 B |
158-
| `ComputeWordSignatureInto` | 128 | 9.70 µs | 608 B |
159-
| `EstimateJaccard` | 128 | 5.43 ns | - |
160-
| `ComputeCharSignature` | 256 | 32.45 µs | 1048 B |
161-
| `ComputeCharSignatureInto` | 256 | 101.96 µs | - |
162-
| `ComputeWordSignature` | 256 | 6.07 µs | 1656 B |
163-
| `ComputeWordSignatureInto` | 256 | 18.06 µs | 608 B |
164-
| `EstimateJaccard` | 256 | 9.67 ns | - |
165-
166-
`EstimateJaccard` is fully SIMD-accelerated and **zero-allocation** at any signature size. The jump from 128→256 reflects processing two AVX2 vector-width batches instead of one.
167-
168-
> All benchmarks were run with `dotnet run -c Release`. Source: [`MinHash.Benchmark/MinHasherBench.cs`](MinHash.Benchmark/MinHashBench.cs).
169-
170-
---
124+
| Method | Signature Size | Mean | Allocated |
125+
| -------------------------- | -------------: | --------: | --------: |
126+
| `ComputeCharSignature` | 128 | 16.72 µs | 536 B |
127+
| `ComputeCharSignatureInto` | 128 | 51.53 µs | - |
128+
| `ComputeWordSignature` | 128 | 3.37 µs | 1144 B |
129+
| `ComputeWordSignatureInto` | 128 | 9.70 µs | 608 B |
130+
| `EstimateJaccard` | 128 | 5.43 ns | - |
131+
| `ComputeCharSignature` | 256 | 32.45 µs | 1048 B |
132+
| `ComputeCharSignatureInto` | 256 | 101.96 µs | - |
133+
| `ComputeWordSignature` | 256 | 6.07 µs | 1656 B |
134+
| `ComputeWordSignatureInto` | 256 | 18.06 µs | 608 B |
135+
| `EstimateJaccard` | 256 | 9.67 ns | - |
171136

172137
## Configuration Guide
173138

174-
| Parameter | Default | Recommendation |
175-
|-----------------|---------|--------------------------------------------------------|
176-
| `signatureSize` | 128 | 128 for ~97% accuracy; 256 for ~99% accuracy |
177-
| `shingleSize` | 3 | 3–4 for short texts; 5 for paragraphs/documents |
178-
| `seed` | 0xDEADBEEF | Change only if you need independent hash families |
139+
| Parameter | Default | Recommendation |
140+
| --------------- | ------------ | ------------------------------------------------- |
141+
| `signatureSize` | 128 | 128 for ~97% accuracy; 256 for ~99% accuracy |
142+
| `shingleSize` | 3 | 3–4 for short texts; 5 for longer documents |
143+
| `seed` | `0xDEADBEEF` | Change only if you need independent hash families |
179144

180-
**Accuracy vs. size trade-off:** Jaccard estimation error is approximately `1/√(signatureSize)`. At 128 functions the expected error is ≈ 8.8%; at 256 it drops to ≈ 6.25%.
145+
**Accuracy vs. size trade-off:** Jaccard estimation error is approximately $\frac{1}{\sqrt{n}}$
181146

182-
---
147+
* At $n = 128$: error $\approx 8.8%$
148+
* At $n = 256$: error $\approx 6.25%$
183149

184150
## Thread Safety
185151

186-
| Type | Read | Write |
187-
|-------------------|-------|-------|
188-
| `MinHasher` | ✅ Safe | ✅ Safe (stateless after construction) |
189-
| `MinHashSignature`| ✅ Safe | N/A (immutable) |
190-
| `MinHashIndex` | ✅ Safe | ❌ Not safe for concurrent `Add` calls |
191-
192-
---
152+
| Type | Read | Write |
153+
| ------------------ | ------ | ------------------------------- |
154+
| `MinHasher` | ✅ Safe | ✅ Safe |
155+
| `MinHashSignature` | ✅ Safe | N/A |
156+
| `MinHashIndex` | ✅ Safe | ❌ Not safe for concurrent `Add` |
193157

194158
## License
195159

0 commit comments

Comments
 (0)