Skip to content

Commit c276254

Browse files
authored
Merge pull request #107 from github/aneubeck/prefixmap
First part of a HashSortedMap
2 parents 15a20e9 + 2a5c666 commit c276254

12 files changed

Lines changed: 1640 additions & 2 deletions

File tree

Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ members = [
44
"crates/*",
55
"crates/bpe/benchmarks",
66
"crates/bpe/tests",
7+
"crates/hash-sorted-map/benchmarks",
78
]
89
resolver = "2"
910

crates/bpe/benchmarks/equivalence.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ fn test_compare_dictionary() {
3030
hugging_tokens.remove(added_token);
3131
}
3232
let mut hugging_tokens: Vec<_> = hugging_tokens.into_iter().collect();
33-
hugging_tokens.sort_by(|(_, a), (_, b)| a.cmp(b));
33+
hugging_tokens.sort_by_key(|(_, a)| *a);
3434
let hugging_tokens: Vec<_> = hugging_tokens
3535
.into_iter()
3636
.map(|(token, _)| token.chars().map(char_to_byte).collect())

crates/hash-sorted-map/Cargo.toml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
[package]
2+
name = "hash-sorted-map"
3+
authors = ["The blackbird team <support@github.com>"]
4+
version = "0.1.0"
5+
edition = "2021"
6+
description = "A hash map with hash-ordered iteration and linear-time merge, designed for search-index term maps."
7+
repository = "https://github.com/github/rust-gems"
8+
license = "MIT"
9+
keywords = ["hashmap", "sorted", "merge", "simd"]
10+
categories = ["algorithms", "data-structures"]
Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
# HashSortedMap vs. Rust Swiss Table (hashbrown): Optimization Analysis
2+
3+
## Executive Summary
4+
5+
`HashSortedMap` is a Swiss-table-inspired hash map that uses **overflow
6+
chaining** (instead of open addressing), **SIMD group scanning** (NEON/SSE2),
7+
a **slot-hint fast path**, and an **optimized growth strategy**. It is generic
8+
over key type, value type, and hash builder.
9+
10+
This document analyzes the design trade-offs versus
11+
[hashbrown](https://github.com/rust-lang/hashbrown) and records the
12+
experimental results that guided the current design.
13+
14+
---
15+
16+
## Architecture Comparison
17+
18+
```
19+
┌──────────────────────────────────────────────────────────────────┐
20+
│ hashbrown Swiss Table │
21+
│ │
22+
│ Single contiguous allocation (SoA): │
23+
│ [Padding] [T_n ... T_1 T_0] [CT_0 CT_1 ... CT_n] [CT_extra] │
24+
│ data control bytes (mirrored) │
25+
│ │
26+
│ • Open addressing, triangular probing │
27+
│ • 16-byte groups (SSE2) or 8-byte groups (NEON/generic) │
28+
│ • EMPTY / DELETED / FULL tag states │
29+
└──────────────────────────────────────────────────────────────────┘
30+
31+
┌──────────────────────────────────────────────────────────────────┐
32+
│ HashSortedMap │
33+
│ │
34+
│ Vec<Group<K,V>> where each Group (AoS): │
35+
│ { ctrl: [u8; 8], keys: [MaybeUninit<K>; 8], │
36+
│ values: [MaybeUninit<V>; 8], overflow: u32 } │
37+
│ │
38+
│ • Overflow chaining (linked groups) │
39+
│ • 8-byte groups with NEON/SSE2/scalar SIMD scan │
40+
│ • EMPTY / FULL tag states only (insertion-only, no deletion) │
41+
│ • Slot-hint fast path │
42+
└──────────────────────────────────────────────────────────────────┘
43+
```
44+
45+
---
46+
47+
## Optimizations Investigated
48+
49+
### 1. SIMD Group Scanning ✅ Implemented
50+
51+
Platform-specific SIMD for control byte matching:
52+
- **aarch64**: NEON `vceq_u8` + `vreinterpret_u64_u8` (8-byte groups)
53+
- **x86_64**: SSE2 `_mm_cmpeq_epi8` + `_mm_movemask_epi8` (16-byte groups)
54+
- **Fallback**: Scalar u64 zero-byte detection trick
55+
56+
**Benchmark result**: ~5% faster than scalar on Apple M-series. The gain is
57+
modest because the slot-hint fast path often skips the group scan entirely.
58+
59+
### 2. Open Addressing with Triangular Probing ❌ Rejected
60+
61+
This is not really an option for this hash map, since it would prevent efficient sorting.
62+
Additionally, we didn't observe any performance improvement in comparison to the linked overflow buffer approach.
63+
The biggest benefit of triangular probing is that it allows a much higher load factor, i.e. reduces memory consumption which isn't our main concern though.
64+
65+
**Benchmark result**: **40% slower** than overflow chaining. With the AoS
66+
layout, each group is ~112 bytes, so probing to the next group jumps over
67+
large memory regions. Overflow chaining with the slot-hint fast path is
68+
faster because most inserts land in the first group.
69+
70+
### 3. SoA Memory Layout ❌ Rejected
71+
72+
Tested a SoA variant (`SoaHashSortedMap`) with separate control byte and
73+
key/value arrays, combined with triangular probing.
74+
75+
**Benchmark result**: **Slowest variant** — even slower than AoS open
76+
addressing. The two-Vec SoA layout doubles TLB/cache pressure versus
77+
hashbrown's single-allocation layout. Without the single-allocation trick,
78+
SoA is worse than AoS for this use case.
79+
80+
### 4. Capacity Sizing ✅ Implemented
81+
82+
Without the correct sizing, there was always the penality of a grow operation.
83+
84+
**Fix**: Changed to ~70% max load factor. This was the **single biggest improvement** — HashSortedMap went from 2× slower to matching hashbrown.
85+
86+
### 5. Optimized Growth ✅ Implemented
87+
88+
The original `grow()` called the full `insert()` for each element (including
89+
duplicate checking and overflow traversal). hashbrown uses:
90+
- `find_insert_index` (skip duplicate check)
91+
- `ptr::copy_nonoverlapping` (raw memory copy)
92+
- Bulk counter updates
93+
94+
**Fix**: Added `insert_for_grow()` that skips duplicate checking, uses raw
95+
pointer copies, and iterates occupied slots via bitmask.
96+
97+
**Benchmark result**: Growth is now **2× faster** than hashbrown (4.8 µs vs
98+
9.8 µs for 3 resize rounds).
99+
100+
### 6. Branch Prediction Hints ⚠️ Mixed Results
101+
102+
Added `likely()`/`unlikely()` annotations and `#[cold] #[inline(never)]` on
103+
the overflow path.
104+
105+
**Benchmark result**: Helped the scalar version (~2–6% faster) but **hurt the
106+
SIMD version** by pessimizing NEON code generation. Removed from the SIMD
107+
implementation, kept in the scalar version.
108+
109+
### 7. Slot Hint Fast Path (Unique to HashSortedMap)
110+
111+
HashSortedMap checks a preferred slot before scanning the group:
112+
```rust
113+
let hint = slot_hint(hash); // 3 bits from hash → slot index
114+
if ctrl[hint] == EMPTY { /* direct insert */ }
115+
if ctrl[hint] == tag && keys[hint] == key { /* direct hit */ }
116+
```
117+
118+
hashbrown does **not** have this optimization — it always does a full SIMD
119+
group scan. The reason why the performance is different is probably due to the different overflow strategies and the different load factors.
120+
121+
### 8. Overflow Reserve Sizing ✅ Validated
122+
123+
Tested overflow reserves from 0% to 100% of primary groups:
124+
125+
| Reserve | Growth scenario (µs) |
126+
|---------|----------------------|
127+
| m/8 (12.5%, default) | 8.04 |
128+
| m/4 (25%) | 8.33 |
129+
| m/2 (50%) | 8.93 |
130+
| m/1 (100%) | 10.31 |
131+
| 0 (grow immediately) | 6.96 |
132+
133+
**Conclusion**: Smaller reserves are faster — growing early is cheaper than
134+
traversing overflow chains.
135+
136+
### 9. IdentityHasher Fix ✅ Implemented
137+
138+
The original `IdentityHasher` zero-extended u32 to u64, putting zeros in the
139+
top 32 bits. Since hashbrown derives the 7-bit tag from `hash >> 57`, every
140+
entry got the same tag — completely defeating control byte filtering.
141+
142+
**Fix**: Use `folded_multiply` to expand u32 keys to u64 with independent
143+
entropy in both halves. Also changed trigram generation to use
144+
`folded_multiply` instead of murmur3.
145+
146+
---
147+
148+
## Optimizations Not Implemented (and Why)
149+
150+
| Optimization | Reason |
151+
|---------------------------------|------------------------------------------|
152+
| **Tombstone / DELETED support** | Insertion-only map — no deletions needed |
153+
| **In-place rehashing** | No tombstones to reclaim |
154+
| **Control byte mirroring** | Not needed with overflow chaining (no wrap-around) |
155+
| **Custom allocator support** | Out of scope for benchmarking |
156+
| **Over-allocation utilization** | Uses `Vec` (no raw allocator control) |
157+
158+
---
159+
160+
## Summary of Impact
161+
162+
| Change | Effect on insert time |
163+
|----------------------------|------------------------------|
164+
| Capacity sizing fix | **−50%** (biggest win) |
165+
| Optimized growth path | **−10%** on growth scenarios |
166+
| SIMD group scanning | **−5%** |
167+
| Branch hints (scalar only) | **−2–6%** |
168+
| IdentityHasher fix | Enabled fair comparison |
169+
170+
The current HashSortedMap **matches hashbrown+FxHash** on pre-sized inserts,
171+
**beats all hashbrown variants** on overwrites, and has **2× faster growth**.

crates/hash-sorted-map/README.md

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
# hash-sorted-map
2+
3+
A hash map whose groups are ordered by hash prefix, enabling efficient
4+
sorted-order iteration and linear-time merging of two maps.
5+
6+
## Motivation
7+
8+
In a search index, each document produces a **term map** (term → frequency).
9+
At index time, term maps from many documents must be **merged** into a single
10+
posting list, and the result is **serialized in hash-key order** so that
11+
lookups can use a skip-list approach, leveraging the hash ordering to
12+
efficiently jump to the right region of the serialized data.
13+
14+
A conventional hash map stores entries in arbitrary order, so merging two maps
15+
requires collecting, sorting, and reshuffling all entries — an expensive step
16+
that dominates indexing time for large term maps typical of code search, where
17+
documents contain massive numbers of tokens.
18+
19+
`HashSortedMap` avoids this by organizing its groups by hash prefix.
20+
Iterating through the groups in order yields entries sorted by their hashed
21+
keys, which means:
22+
23+
- **Merging** two maps is a single linear scan (like merge-sort's merge step).
24+
- **Serialization** in hash-key order requires no extra sorting or copying.
25+
26+
## Design
27+
28+
`HashSortedMap<K, V, S>` is a Swiss-table-inspired hash map that uses:
29+
30+
- **Overflow chaining** instead of open addressing — groups that fill up link
31+
to overflow groups rather than probing into neighbours.
32+
- **Slot hint** — a preferred slot index derived from the hash, checked before
33+
scanning the group. Gives a direct hit on most inserts at low load.
34+
- **SIMD group scanning** — uses NEON on aarch64, SSE2 on x86\_64, and a
35+
scalar fallback elsewhere to scan 8–16 control bytes in parallel.
36+
- **AoS group layout** — each group stores its control bytes, keys, and values
37+
together, keeping a single insert's data within 1–2 cache lines.
38+
- **Optimized growth** — during resize, elements are re-inserted without
39+
duplicate checking and copied via raw pointers.
40+
- **Generic key/value/hasher** — supports any `K: Hash + Eq`, any
41+
`S: BuildHasher`, and `Borrow<Q>`-based lookups.
42+
43+
## Benchmark results
44+
45+
All benchmarks insert 1000 random trigram hashes (scrambled with
46+
`folded_multiply`) into maps with various configurations. Measured on Apple
47+
M-series (aarch64).
48+
49+
### Insert 1000 trigrams — pre-sized, no growth
50+
51+
| Rank | Map | Time (µs) | vs best |
52+
|------|-----|-----------|---------|
53+
| 🥇 | FoldHashMap | 2.44 ||
54+
| 🥈 | FxHashMap | 2.61 | +7% |
55+
| 🥉 | hashbrown::HashMap | 2.67 | +9% |
56+
| 4 | **HashSortedMap** | **2.71** | +11% |
57+
| 5 | hashbrown+Identity | 2.74 | +12% |
58+
| 6 | std::HashMap+FNV | 3.27 | +34% |
59+
| 7 | AHashMap | 3.22 | +32% |
60+
| 8 | std::HashMap | 8.49 | +248% |
61+
62+
### Re-insert same keys (all overwrites)
63+
64+
| Map | Time (µs) |
65+
|-----|-----------|
66+
| **HashSortedMap** | **2.36**|
67+
| hashbrown+Identity | 2.58 |
68+
69+
### Growth from small (`with_capacity(128)`, 3 resize rounds)
70+
71+
| Map | Time (µs) | Growth penalty |
72+
|-----|-----------|----------------|
73+
| **HashSortedMap** | **4.85** | +2.14 |
74+
| hashbrown+Identity | 9.77 | +7.03 |
75+
76+
### Key takeaways
77+
78+
- **HashSortedMap matches the fastest hashbrown configurations** on pre-sized
79+
first-time inserts and is **the fastest for overwrites**.
80+
- **Growth is ~2× faster** than hashbrown thanks to the optimized
81+
`insert_for_grow` path that skips duplicate checking and uses raw copies.
82+
- The remaining gap to FoldHashMap (~11%) comes from foldhash's extremely
83+
efficient hash function that pipelines well with hashbrown's SIMD scan.
84+
85+
## Running
86+
87+
```sh
88+
cargo bench --bench hashmap_insert
89+
```
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
[package]
2+
name = "hash-sorted-map-benchmarks"
3+
edition = "2021"
4+
5+
[lib]
6+
path = "lib.rs"
7+
test = false
8+
9+
[[bench]]
10+
name = "performance"
11+
path = "performance.rs"
12+
harness = false
13+
test = false
14+
15+
[dependencies]
16+
hash-sorted-map = { path = ".." }
17+
criterion = "0.8"
18+
rand = "0.10"
19+
rustc-hash = "2"
20+
ahash = "0.8"
21+
hashbrown = "0.15"
22+
foldhash = "0.1"
23+
fnv = "1"
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
use std::hash::{BuildHasherDefault, Hasher};
2+
3+
use rand::RngExt;
4+
5+
const ARBITRARY0: u64 = 0x243f6a8885a308d3;
6+
7+
/// Folded multiply: full u64×u64→u128, then XOR the two halves.
8+
#[inline(always)]
9+
pub fn folded_multiply(x: u64, y: u64) -> u64 {
10+
let full = (x as u128).wrapping_mul(y as u128);
11+
(full as u64) ^ ((full >> 64) as u64)
12+
}
13+
14+
/// A hasher that passes through u32 keys without hashing, suitable for
15+
/// keys that are already well-distributed.
16+
#[derive(Default)]
17+
pub struct IdentityHasher(u64);
18+
19+
impl Hasher for IdentityHasher {
20+
fn write(&mut self, _bytes: &[u8]) {
21+
unimplemented!("IdentityHasher only supports write_u32");
22+
}
23+
fn write_u32(&mut self, i: u32) {
24+
self.0 = (i as u64) | ((i as u64) << 32);
25+
}
26+
fn finish(&self) -> u64 {
27+
self.0
28+
}
29+
}
30+
31+
pub type IdentityBuildHasher = BuildHasherDefault<IdentityHasher>;
32+
33+
/// Generate `n` random trigrams as well-distributed u32 hashes.
34+
/// Each trigram is packed into a u32, then scrambled with folded_multiply.
35+
pub fn random_trigram_hashes(n: usize) -> Vec<u32> {
36+
let mut rng = rand::rng();
37+
(0..n)
38+
.map(|_| {
39+
let a = rng.random_range(b'a'..=b'z') as u32;
40+
let b = rng.random_range(b'a'..=b'z') as u32;
41+
let c = rng.random_range(b'a'..=b'z') as u32;
42+
let packed = a | (b << 8) | (c << 16);
43+
folded_multiply(packed as u64, ARBITRARY0) as u32
44+
})
45+
.collect()
46+
}

0 commit comments

Comments
 (0)