|
| 1 | +# sparse-ngrams |
| 2 | + |
| 3 | +Fast sparse n-gram extraction from byte slices. |
| 4 | + |
| 5 | +Sparse grams select variable-length n-grams (2–8 bytes) without extracting all possible substrings. The algorithm is deterministic: the same extraction logic applies to every substring, making it suitable for substring search indexes. |
| 6 | + |
| 7 | +For background, see: |
| 8 | +- [The technology behind GitHub's new code search](https://github.blog/engineering/architecture-optimization/the-technology-behind-githubs-new-code-search/#fn-69904-bignote) |
| 9 | +- [Sparse n-grams: smarter trigram selection](https://cursor.com/blog/fast-regex-search#sparse-n-grams-smarter-trigram-selection) |
| 10 | + |
| 11 | +## Caveats |
| 12 | + |
| 13 | +The integrated bigram table contains only lowercase ASCII bigrams. Callers should lowercase and normalize input before extraction (e.g. fold uppercase to lowercase, map non-ASCII bytes to a single sentinel value). This makes the implementation suitable for case-insensitive search indexes. |
| 14 | + |
| 15 | +## How it works |
| 16 | + |
| 17 | +Each consecutive byte pair (bigram) is assigned a frequency-based priority from a precomputed table. An n-gram boundary occurs wherever a bigram has lower priority than all bigrams between it and the previous boundary. This is computed efficiently using a monotone deque or a scan-based approach. |
| 18 | + |
| 19 | +For a document of N bytes, this produces at most 3(N−1) n-grams: N−1 bigrams, plus up to 2(N−1) algorithmically selected longer n-grams (up to 8 bytes). |
| 20 | + |
| 21 | +### Selection criterion |
| 22 | + |
| 23 | +A substring of length 3–8 is emitted as a sparse n-gram if and only if every interior bigram priority is strictly greater than the maximum of the left and right boundary bigram priorities. |
| 24 | + |
| 25 | +## Usage |
| 26 | + |
| 27 | +```rust |
| 28 | +use sparse_ngrams::{collect_sparse_grams, NGram, MAX_SPARSE_GRAM_SIZE}; |
| 29 | + |
| 30 | +let input = b"hello world"; |
| 31 | +let grams = collect_sparse_grams(input); |
| 32 | +for gram in &grams { |
| 33 | + assert!(gram.len() >= 2); |
| 34 | + assert!(gram.len() <= MAX_SPARSE_GRAM_SIZE as usize); |
| 35 | +} |
| 36 | +``` |
| 37 | + |
| 38 | +## Performance |
| 39 | + |
| 40 | +Benchmarks on an Apple M1 (15 KB input, `lib.rs` source file): |
| 41 | + |
| 42 | +| Variant | Throughput | |
| 43 | +|---------|-----------| |
| 44 | +| `deque` | ~3.5 GB/s | |
| 45 | +| `scan` | ~4.9 GB/s | |
| 46 | + |
| 47 | +The `scan` variant is ~40% faster than the deque variant by replacing the monotone deque with a fixed-size circular buffer and a suffix-minimum scan. |
| 48 | + |
| 49 | +## Bigram table size |
| 50 | + |
| 51 | +The priority table maps byte pairs to frequency-based priorities. Increasing the table size (number of ranked bigrams) produces more distinct longer n-grams, but saturates quickly: |
| 52 | + |
| 53 | + |
| 54 | + |
| 55 | +| Table size | Unique n-grams | % of max | |
| 56 | +|-----------|-----------------|----------| |
| 57 | +| 100 | 5.8M | 77.0% | |
| 58 | +| 200 | 6.4M | 84.4% | |
| 59 | +| 400 | 6.8M | 90.2% | |
| 60 | +| 800 | 7.3M | 96.0% | |
| 61 | +| 1,600 | 7.5M | 99.2% | |
| 62 | +| 3,200 | 7.6M | 99.9% | |
| 63 | +| 5,845 | 7.6M | 100% | |
| 64 | + |
| 65 | +The current bigram table contains the 5,845 most frequent bigrams from a large code corpus. |
| 66 | +The table saturates quickly — the first ~1,600 bigrams already capture 99% of the unique n-grams. |
| 67 | + |
| 68 | +## Maximum n-gram length |
| 69 | + |
| 70 | +Increasing the maximum n-gram length produces more unique longer grams, with diminishing returns: |
| 71 | + |
| 72 | + |
| 73 | + |
| 74 | +| Max length | Unique n-grams | vs. len=8 | |
| 75 | +|-----------|---------------|-----------| |
| 76 | +| 2 | 1.2M | 16% | |
| 77 | +| 3 | 4.1M | 54% | |
| 78 | +| 4 | 5.3M | 70% | |
| 79 | +| 6 | 6.8M | 89% | |
| 80 | +| 8 | 7.6M | 100% | |
| 81 | +| 12 | 8.5M | 113% | |
| 82 | +| 16 | 9.1M | 120% | |
| 83 | +| 24 | 9.7M | 128% | |
| 84 | +| 32 | 10.1M | 133% | |
| 85 | +| 48 | 10.4M | 137% | |
| 86 | +| 64 | 10.5M | 139% | |
| 87 | + |
| 88 | +The default of 8 captures most of the discriminative power. Going to 16 adds ~20% more unique grams but doubles the scan window; going to 64 adds only ~39% total. |
0 commit comments