Skip to content

Commit f460d14

Browse files
authored
Merge pull request #113 from github/aneubeck/sparse_ngrams
Add sparse gram extraction
2 parents 4b28588 + f7c1877 commit f460d14

16 files changed

Lines changed: 879 additions & 468 deletions

File tree

Cargo.toml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,6 @@ members = [
44
"crates/*",
55
"crates/bpe/benchmarks",
66
"crates/bpe/tests",
7-
"crates/hash-sorted-map/benchmarks",
87
]
98
resolver = "2"
109

Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ build:
2424

2525
.PHONY: build-js
2626
build-js:
27+
which wasm-pack || cargo install wasm-pack
2728
npm --prefix crates/string-offsets/js install
2829
npm --prefix crates/string-offsets/js run compile
2930

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ A collection of useful algorithms written in Rust. Currently contains:
55
- [`geo_filters`](crates/geo_filters): probabilistic data structures that solve the [Distinct Count Problem](https://en.wikipedia.org/wiki/Count-distinct_problem) using geometric filters.
66
- [`bpe`](crates/bpe): fast, correct, and novel algorithms for the [Byte Pair Encoding Algorithm](https://en.wikipedia.org/wiki/Large_language_model#BPE) which are particularly useful for chunking of documents.
77
- [`bpe-openai`](crates/bpe-openai): Fast tokenizers for OpenAI token sets based on the `bpe` crate.
8+
- [`sparse-ngrams`](crates/sparse-ngrams): fast sparse n-gram extraction from byte slices. Selects variable-length n-grams (2–8 bytes) deterministically using bigram frequency priorities, suitable for substring search indexes.
89
- [`string-offsets`](crates/string-offsets): converts string positions between bytes, chars, UTF-16 code units, and line numbers. Useful when sending string indices across language boundaries.
910

1011
## Background

crates/sparse-ngrams/Cargo.toml

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
[package]
2+
name = "sparse-ngrams"
3+
version = "0.1.0"
4+
edition = "2021"
5+
description = "Fast sparse n-gram extraction from byte slices."
6+
repository = "https://github.com/github/rust-gems"
7+
license = "MIT"
8+
keywords = ["ngram", "algorithm", "search", "index"]
9+
categories = ["algorithms", "data-structures", "text-processing"]
10+
11+
[lib]
12+
bench = false
13+
14+
[[bench]]
15+
name = "performance"
16+
path = "benchmarks/performance.rs"
17+
harness = false
18+
19+
[dev-dependencies]
20+
criterion = "0.7"

crates/sparse-ngrams/README.md

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
# sparse-ngrams
2+
3+
Fast sparse n-gram extraction from byte slices.
4+
5+
Sparse grams select variable-length n-grams (2–8 bytes) without extracting all possible substrings. The algorithm is deterministic: the same extraction logic applies to every substring, making it suitable for substring search indexes.
6+
7+
For background, see:
8+
- [The technology behind GitHub's new code search](https://github.blog/engineering/architecture-optimization/the-technology-behind-githubs-new-code-search/#fn-69904-bignote)
9+
- [Sparse n-grams: smarter trigram selection](https://cursor.com/blog/fast-regex-search#sparse-n-grams-smarter-trigram-selection)
10+
11+
## Caveats
12+
13+
The integrated bigram table contains only lowercase ASCII bigrams. Callers should lowercase and normalize input before extraction (e.g. fold uppercase to lowercase, map non-ASCII bytes to a single sentinel value). This makes the implementation suitable for case-insensitive search indexes.
14+
15+
## How it works
16+
17+
Each consecutive byte pair (bigram) is assigned a frequency-based priority from a precomputed table. An n-gram boundary occurs wherever a bigram has lower priority than all bigrams between it and the previous boundary. This is computed efficiently using a monotone deque or a scan-based approach.
18+
19+
For a document of N bytes, this produces at most 3(N−1) n-grams: N−1 bigrams, plus up to 2(N−1) algorithmically selected longer n-grams (up to 8 bytes).
20+
21+
### Selection criterion
22+
23+
A substring of length 3–8 is emitted as a sparse n-gram if and only if every interior bigram priority is strictly greater than the maximum of the left and right boundary bigram priorities.
24+
25+
## Usage
26+
27+
```rust
28+
use sparse_ngrams::{collect_sparse_grams, NGram, MAX_SPARSE_GRAM_SIZE};
29+
30+
let input = b"hello world";
31+
let grams = collect_sparse_grams(input);
32+
for gram in &grams {
33+
assert!(gram.len() >= 2);
34+
assert!(gram.len() <= MAX_SPARSE_GRAM_SIZE as usize);
35+
}
36+
```
37+
38+
## Performance
39+
40+
Benchmarks on an Apple M1 (15 KB input, `lib.rs` source file):
41+
42+
| Variant | Throughput |
43+
|---------|-----------|
44+
| `deque` | ~3.5 GB/s |
45+
| `scan` | ~4.9 GB/s |
46+
47+
The `scan` variant is ~40% faster than the deque variant by replacing the monotone deque with a fixed-size circular buffer and a suffix-minimum scan.
48+
49+
## Bigram table size
50+
51+
The priority table maps byte pairs to frequency-based priorities. Increasing the table size (number of ranked bigrams) produces more distinct longer n-grams, but saturates quickly:
52+
53+
![Unique n-grams vs. table size](images/unique_ngrams_vs_table_size.png)
54+
55+
| Table size | Unique n-grams | % of max |
56+
|-----------|-----------------|----------|
57+
| 100 | 5.8M | 77.0% |
58+
| 200 | 6.4M | 84.4% |
59+
| 400 | 6.8M | 90.2% |
60+
| 800 | 7.3M | 96.0% |
61+
| 1,600 | 7.5M | 99.2% |
62+
| 3,200 | 7.6M | 99.9% |
63+
| 5,845 | 7.6M | 100% |
64+
65+
The current bigram table contains the 5,845 most frequent bigrams from a large code corpus.
66+
The table saturates quickly — the first ~1,600 bigrams already capture 99% of the unique n-grams.
67+
68+
## Maximum n-gram length
69+
70+
Increasing the maximum n-gram length produces more unique longer grams, with diminishing returns:
71+
72+
![Unique n-grams vs. max length](images/unique_ngrams_vs_max_length.png)
73+
74+
| Max length | Unique n-grams | vs. len=8 |
75+
|-----------|---------------|-----------|
76+
| 2 | 1.2M | 16% |
77+
| 3 | 4.1M | 54% |
78+
| 4 | 5.3M | 70% |
79+
| 6 | 6.8M | 89% |
80+
| 8 | 7.6M | 100% |
81+
| 12 | 8.5M | 113% |
82+
| 16 | 9.1M | 120% |
83+
| 24 | 9.7M | 128% |
84+
| 32 | 10.1M | 133% |
85+
| 48 | 10.4M | 137% |
86+
| 64 | 10.5M | 139% |
87+
88+
The default of 8 captures most of the discriminative power. Going to 16 adds ~20% more unique grams but doubles the scan window; going to 64 adds only ~39% total.
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
use criterion::{black_box, criterion_group, criterion_main, BenchmarkId, Criterion, Throughput};
2+
use sparse_ngrams::{
3+
collect_sparse_grams_deque, collect_sparse_grams_scan, max_sparse_grams, NGram,
4+
};
5+
6+
fn bench_collect(c: &mut Criterion) {
7+
let inputs: Vec<(&str, Vec<u8>)> = vec![
8+
("small_11B", b"hello world".to_vec()),
9+
(
10+
"medium_900B",
11+
"the quick brown fox jumps over the lazy dog. "
12+
.repeat(20)
13+
.into_bytes(),
14+
),
15+
(
16+
"large_15KB",
17+
include_str!("../src/lib.rs").as_bytes().to_vec(),
18+
),
19+
];
20+
21+
let mut group = c.benchmark_group("collect");
22+
for (name, input) in &inputs {
23+
let mut buf = vec![NGram::from_bytes(b"xx"); max_sparse_grams(input.len())];
24+
group.throughput(Throughput::Bytes(input.len() as u64));
25+
26+
group.bench_with_input(BenchmarkId::new("deque", name), input, |b, input| {
27+
b.iter(|| collect_sparse_grams_deque(black_box(input), &mut buf))
28+
});
29+
group.bench_with_input(BenchmarkId::new("scan", name), input, |b, input| {
30+
b.iter(|| collect_sparse_grams_scan(black_box(input), &mut buf))
31+
});
32+
}
33+
group.finish();
34+
}
35+
36+
criterion_group!(benches, bench_collect);
37+
criterion_main!(benches);
40 KB
Loading
37.3 KB
Loading
17.1 KB
Binary file not shown.

crates/sparse-ngrams/src/deque.rs

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
//! Stack-allocated circular buffer (monotone deque).
2+
3+
use std::mem::MaybeUninit;
4+
5+
/// Deque element representing two neighboring bytes in the input.
6+
#[derive(Debug, Clone, Copy)]
7+
pub(crate) struct PosStateBytes {
8+
/// Absolute index position between the two bigram characters.
9+
/// I.e. 1 references the very first bigram.
10+
pub index: u32,
11+
pub value: u16,
12+
}
13+
14+
/// Stack-allocated circular buffer holding up to `CAP` elements.
15+
/// Replaces `VecDeque<PosStateBytes>` — avoids heap allocation and fits in a
16+
/// single cache line for small CAP values.
17+
pub(crate) struct FixedDeque<const CAP: usize> {
18+
data: [MaybeUninit<PosStateBytes>; CAP],
19+
start: u8,
20+
len: u8,
21+
}
22+
23+
impl<const CAP: usize> FixedDeque<CAP> {
24+
pub fn new() -> Self {
25+
Self {
26+
data: [MaybeUninit::uninit(); CAP],
27+
start: 0,
28+
len: 0,
29+
}
30+
}
31+
32+
#[inline]
33+
pub fn front(&self) -> Option<&PosStateBytes> {
34+
if self.len == 0 {
35+
None
36+
} else {
37+
Some(unsafe { self.data[self.start as usize].assume_init_ref() })
38+
}
39+
}
40+
41+
#[inline]
42+
pub fn back(&self) -> Option<&PosStateBytes> {
43+
if self.len == 0 {
44+
None
45+
} else {
46+
let idx = (self.start + self.len - 1) as usize % CAP;
47+
Some(unsafe { self.data[idx].assume_init_ref() })
48+
}
49+
}
50+
51+
#[inline]
52+
pub fn pop_front(&mut self) {
53+
debug_assert!(self.len > 0);
54+
self.start = (self.start + 1) % CAP as u8;
55+
self.len -= 1;
56+
}
57+
58+
#[inline]
59+
pub fn pop_back(&mut self) {
60+
debug_assert!(self.len > 0);
61+
self.len -= 1;
62+
}
63+
64+
#[inline]
65+
pub fn push_back(&mut self, val: PosStateBytes) {
66+
debug_assert!((self.len as usize) < CAP);
67+
let idx = (self.start + self.len) as usize % CAP;
68+
self.data[idx] = MaybeUninit::new(val);
69+
self.len += 1;
70+
}
71+
}

0 commit comments

Comments
 (0)