|
4 | 4 |
|
5 | 5 | `HashSortedMap` is a Swiss-table-inspired hash map that uses **overflow |
6 | 6 | chaining** (instead of open addressing), **SIMD group scanning** (NEON/SSE2), |
7 | | -a **slot-hint fast path**, and an **optimized growth strategy**. It is generic |
8 | | -over key type, value type, and hash builder. |
| 7 | +and an **optimized growth strategy**. It is generic over key type, value type, |
| 8 | +and hash builder. |
9 | 9 |
|
10 | 10 | This document analyzes the design trade-offs versus |
11 | 11 | [hashbrown](https://github.com/rust-lang/hashbrown) and records the |
@@ -38,7 +38,6 @@ experimental results that guided the current design. |
38 | 38 | │ • Overflow chaining (linked groups) │ |
39 | 39 | │ • 8-byte groups with NEON/SSE2/scalar SIMD scan │ |
40 | 40 | │ • EMPTY / FULL tag states only (insertion-only, no deletion) │ |
41 | | -│ • Slot-hint fast path │ |
42 | 41 | └──────────────────────────────────────────────────────────────────┘ |
43 | 42 | ``` |
44 | 43 |
|
@@ -106,17 +105,32 @@ the overflow path. |
106 | 105 | SIMD version** by pessimizing NEON code generation. Removed from the SIMD |
107 | 106 | implementation, kept in the scalar version. |
108 | 107 |
|
109 | | -### 7. Slot Hint Fast Path (Unique to HashSortedMap) |
| 108 | +### 7. Slot Hint Fast Path ⚠️ Removed from Lookup Paths |
110 | 109 |
|
111 | | -HashSortedMap checks a preferred slot before scanning the group: |
| 110 | +Originally, HashSortedMap checked a preferred slot before scanning the group: |
112 | 111 | ```rust |
113 | 112 | let hint = slot_hint(hash); // 3 bits from hash → slot index |
114 | 113 | if ctrl[hint] == EMPTY { /* direct insert */ } |
115 | 114 | if ctrl[hint] == tag && keys[hint] == key { /* direct hit */ } |
116 | 115 | ``` |
117 | 116 |
|
118 | | -hashbrown does **not** have this optimization — it always does a full SIMD |
119 | | -group scan. The reason why the performance is different is probably due to the different overflow strategies and the different load factors. |
| 117 | +**Experimental finding**: This scalar check **hurts performance** on random |
| 118 | +workloads. The branch predictor cannot help because random keys map to random |
| 119 | +slots, making the hint check a 50/50 branch that pollutes the branch |
| 120 | +predictor. SIMD-only scanning (match_tag + match_empty) is uniformly fast |
| 121 | +regardless of key distribution. |
| 122 | + |
| 123 | +**Results of removing slot_hint from different paths:** |
| 124 | +- `find_or_insertion_slot` (entry API): **−25% latency** on merge benchmark |
| 125 | +- `get_hashed`: **−4.4%** improvement (SIMD scan is faster than branch+scalar) |
| 126 | +- `insert_hashed`: **+7%** regression on presized insert (the hint genuinely |
| 127 | + helps when inserting into a mostly-empty group), but accepted for code |
| 128 | + simplicity since the merge workload matters more |
| 129 | + |
| 130 | +**Current state**: slot_hint is **only** used in `insert_for_grow()`, where |
| 131 | +the map is guaranteed sparse after a resize (groups are mostly empty, so the |
| 132 | +hint slot is very likely free). For all other paths, SIMD-only scanning is |
| 133 | +used. |
120 | 134 |
|
121 | 135 | ### 8. Overflow Reserve Sizing ✅ Validated |
122 | 136 |
|
@@ -159,13 +173,85 @@ entropy in both halves. Also changed trigram generation to use |
159 | 173 |
|
160 | 174 | ## Summary of Impact |
161 | 175 |
|
162 | | -| Change | Effect on insert time | |
163 | | -|----------------------------|------------------------------| |
164 | | -| Capacity sizing fix | **−50%** (biggest win) | |
165 | | -| Optimized growth path | **−10%** on growth scenarios | |
166 | | -| SIMD group scanning | **−5%** | |
167 | | -| Branch hints (scalar only) | **−2–6%** | |
168 | | -| IdentityHasher fix | Enabled fair comparison | |
| 176 | +| Change | Effect | |
| 177 | +|---------------------------------|-------------------------------------| |
| 178 | +| Capacity sizing fix | **−50%** insert time (biggest win) | |
| 179 | +| Optimized growth path | **2× faster** growth than hashbrown | |
| 180 | +| SIMD group scanning | **−5%** insert time | |
| 181 | +| Slot hint removal (entry/get) | **−25%** merge latency | |
| 182 | +| Branch hints (scalar only) | **−2–6%** | |
| 183 | +| IdentityHasher fix | Enabled fair comparison | |
| 184 | + |
| 185 | +--- |
169 | 186 |
|
170 | | -The current HashSortedMap **matches hashbrown+FxHash** on pre-sized inserts, |
171 | | -**beats all hashbrown variants** on overwrites, and has **2× faster growth**. |
| 187 | +## Benchmark Results (Apple M-series, aarch64 NEON) |
| 188 | + |
| 189 | +### Insert (1000 trigrams, pre-sized) |
| 190 | + |
| 191 | +| Implementation | Time (µs) | vs hashbrown | |
| 192 | +|----------------------|-----------|--------------| |
| 193 | +| FoldHashMap | 2.44 | −11% | |
| 194 | +| FxHashMap | 2.61 | −5% | |
| 195 | +| hashbrown+Identity | 2.63 | baseline | |
| 196 | +| hashbrown::HashMap | 2.74 | +4% | |
| 197 | +| std::HashMap+FNV | 3.18 | +21% | |
| 198 | +| AHashMap | 3.38 | +29% | |
| 199 | +| **HashSortedMap** | **3.46** | **+32%** | |
| 200 | +| std::HashMap | 8.65 | +229% | |
| 201 | + |
| 202 | +### Reinsert (1000 trigrams, all keys exist) |
| 203 | + |
| 204 | +| Implementation | Time (µs) | |
| 205 | +|----------------------|-----------| |
| 206 | +| hashbrown+Identity | 2.50 | |
| 207 | +| **HashSortedMap** | **2.70** | |
| 208 | + |
| 209 | +### Growth (128 → 1000 trigrams, 3 resize rounds) |
| 210 | + |
| 211 | +| Implementation | Time (µs) | |
| 212 | +|----------------------|-----------| |
| 213 | +| **HashSortedMap** | **5.35** | |
| 214 | +| hashbrown+Identity | 10.12 | |
| 215 | + |
| 216 | +### Count (4000 trigrams, mixed insert/update) |
| 217 | + |
| 218 | +| Implementation | Time (µs) | |
| 219 | +|----------------------------------|-----------| |
| 220 | +| hashbrown+Identity entry() | 4.89 | |
| 221 | +| **HashSortedMap entry().or_default()** | **5.44** | |
| 222 | +| **HashSortedMap get_or_default** | **5.48** | |
| 223 | + |
| 224 | +### Iteration (1000 trigrams) |
| 225 | + |
| 226 | +| Implementation | Time (ns) | |
| 227 | +|-------------------------------|-----------| |
| 228 | +| **HashSortedMap iter()** | **794** | |
| 229 | +| **HashSortedMap into_iter()** | **998** | |
| 230 | +| hashbrown+Identity iter() | 1,067 | |
| 231 | +| hashbrown+Identity into_iter()| 1,060 | |
| 232 | + |
| 233 | +### Sort (100K trigrams) |
| 234 | + |
| 235 | +| Implementation | Time (µs) | |
| 236 | +|-----------------------------|-----------| |
| 237 | +| **HashSortedMap sort_by_hash** | **706** | |
| 238 | +| Vec::sort_unstable | 984 | |
| 239 | + |
| 240 | +### Merge (100 maps × 100K keys each → sorted output) |
| 241 | + |
| 242 | +| Implementation | Time (ms) | vs HSM merge+sort | |
| 243 | +|-----------------------------------|-----------|--------------------| |
| 244 | +| hashbrown merge presized | 30.4 | −46% | |
| 245 | +| **HashSortedMap merge presized** | **37.3** | **−33%** | |
| 246 | +| **HashSortedMap merge (no sort)** | **44.0** | **−21%** | |
| 247 | +| hashbrown merge | 45.4 | −19% | |
| 248 | +| **HashSortedMap merge + sort** | **55.9** | **baseline** | |
| 249 | +| hashbrown merge + Vec sort | 58.7 | +5% | |
| 250 | +| k-way merge sorted vecs | 445 | +696% | |
| 251 | + |
| 252 | +**Key takeaways:** |
| 253 | +- HashSortedMap has **2× faster growth** than hashbrown |
| 254 | +- **25% faster iteration** than hashbrown (dense group layout) |
| 255 | +- **sort_by_hash is 28% faster** than Vec::sort_unstable (data is partially sorted by group) |
| 256 | +- **merge + sort is 5% faster** than hashbrown merge + Vec sort (the primary use case) |
| 257 | +- Pre-sized insert is 32% slower than hashbrown (trade-off for sort/merge efficiency) |
0 commit comments