|
| 1 | +# HashSortedMap vs. Rust Swiss Table (hashbrown): Optimization Analysis |
| 2 | + |
| 3 | +## Executive Summary |
| 4 | + |
| 5 | +`HashSortedMap` is a Swiss-table-inspired hash map that uses **overflow |
| 6 | +chaining** (instead of open addressing), **SIMD group scanning** (NEON/SSE2), |
| 7 | +a **slot-hint fast path**, and an **optimized growth strategy**. It is generic |
| 8 | +over key type, value type, and hash builder. |
| 9 | + |
| 10 | +This document analyzes the design trade-offs versus |
| 11 | +[hashbrown](https://github.com/rust-lang/hashbrown) and records the |
| 12 | +experimental results that guided the current design. |
| 13 | + |
| 14 | +--- |
| 15 | + |
| 16 | +## Architecture Comparison |
| 17 | + |
| 18 | +``` |
| 19 | +┌──────────────────────────────────────────────────────────────────┐ |
| 20 | +│ hashbrown Swiss Table │ |
| 21 | +│ │ |
| 22 | +│ Single contiguous allocation (SoA): │ |
| 23 | +│ [Padding] [T_n ... T_1 T_0] [CT_0 CT_1 ... CT_n] [CT_extra] │ |
| 24 | +│ data control bytes (mirrored) │ |
| 25 | +│ │ |
| 26 | +│ • Open addressing, triangular probing │ |
| 27 | +│ • 16-byte groups (SSE2) or 8-byte groups (NEON/generic) │ |
| 28 | +│ • EMPTY / DELETED / FULL tag states │ |
| 29 | +└──────────────────────────────────────────────────────────────────┘ |
| 30 | +
|
| 31 | +┌──────────────────────────────────────────────────────────────────┐ |
| 32 | +│ HashSortedMap │ |
| 33 | +│ │ |
| 34 | +│ Vec<Group<K,V>> where each Group (AoS): │ |
| 35 | +│ { ctrl: [u8; 8], keys: [MaybeUninit<K>; 8], │ |
| 36 | +│ values: [MaybeUninit<V>; 8], overflow: u32 } │ |
| 37 | +│ │ |
| 38 | +│ • Overflow chaining (linked groups) │ |
| 39 | +│ • 8-byte groups with NEON/SSE2/scalar SIMD scan │ |
| 40 | +│ • EMPTY / FULL tag states only (insertion-only, no deletion) │ |
| 41 | +│ • Slot-hint fast path │ |
| 42 | +└──────────────────────────────────────────────────────────────────┘ |
| 43 | +``` |
| 44 | + |
| 45 | +--- |
| 46 | + |
| 47 | +## Optimizations Investigated |
| 48 | + |
| 49 | +### 1. SIMD Group Scanning ✅ Implemented |
| 50 | + |
| 51 | +Platform-specific SIMD for control byte matching: |
| 52 | +- **aarch64**: NEON `vceq_u8` + `vreinterpret_u64_u8` (8-byte groups) |
| 53 | +- **x86_64**: SSE2 `_mm_cmpeq_epi8` + `_mm_movemask_epi8` (16-byte groups) |
| 54 | +- **Fallback**: Scalar u64 zero-byte detection trick |
| 55 | + |
| 56 | +**Benchmark result**: ~5% faster than scalar on Apple M-series. The gain is |
| 57 | +modest because the slot-hint fast path often skips the group scan entirely. |
| 58 | + |
| 59 | +### 2. Open Addressing with Triangular Probing ❌ Rejected |
| 60 | + |
| 61 | +This is not really an option for this hash map, since it would prevent efficient sorting. |
| 62 | +Additionally, we didn't observe any performance improvement in comparison to the linked overflow buffer approach. |
| 63 | +The biggest benefit of triangular probing is that it allows a much higher load factor, i.e. reduces memory consumption which isn't our main concern though. |
| 64 | + |
| 65 | +**Benchmark result**: **40% slower** than overflow chaining. With the AoS |
| 66 | +layout, each group is ~112 bytes, so probing to the next group jumps over |
| 67 | +large memory regions. Overflow chaining with the slot-hint fast path is |
| 68 | +faster because most inserts land in the first group. |
| 69 | + |
| 70 | +### 3. SoA Memory Layout ❌ Rejected |
| 71 | + |
| 72 | +Tested a SoA variant (`SoaHashSortedMap`) with separate control byte and |
| 73 | +key/value arrays, combined with triangular probing. |
| 74 | + |
| 75 | +**Benchmark result**: **Slowest variant** — even slower than AoS open |
| 76 | +addressing. The two-Vec SoA layout doubles TLB/cache pressure versus |
| 77 | +hashbrown's single-allocation layout. Without the single-allocation trick, |
| 78 | +SoA is worse than AoS for this use case. |
| 79 | + |
| 80 | +### 4. Capacity Sizing ✅ Implemented |
| 81 | + |
| 82 | +Without the correct sizing, there was always the penality of a grow operation. |
| 83 | + |
| 84 | +**Fix**: Changed to ~70% max load factor. This was the **single biggest improvement** — HashSortedMap went from 2× slower to matching hashbrown. |
| 85 | + |
| 86 | +### 5. Optimized Growth ✅ Implemented |
| 87 | + |
| 88 | +The original `grow()` called the full `insert()` for each element (including |
| 89 | +duplicate checking and overflow traversal). hashbrown uses: |
| 90 | +- `find_insert_index` (skip duplicate check) |
| 91 | +- `ptr::copy_nonoverlapping` (raw memory copy) |
| 92 | +- Bulk counter updates |
| 93 | + |
| 94 | +**Fix**: Added `insert_for_grow()` that skips duplicate checking, uses raw |
| 95 | +pointer copies, and iterates occupied slots via bitmask. |
| 96 | + |
| 97 | +**Benchmark result**: Growth is now **2× faster** than hashbrown (4.8 µs vs |
| 98 | +9.8 µs for 3 resize rounds). |
| 99 | + |
| 100 | +### 6. Branch Prediction Hints ⚠️ Mixed Results |
| 101 | + |
| 102 | +Added `likely()`/`unlikely()` annotations and `#[cold] #[inline(never)]` on |
| 103 | +the overflow path. |
| 104 | + |
| 105 | +**Benchmark result**: Helped the scalar version (~2–6% faster) but **hurt the |
| 106 | +SIMD version** by pessimizing NEON code generation. Removed from the SIMD |
| 107 | +implementation, kept in the scalar version. |
| 108 | + |
| 109 | +### 7. Slot Hint Fast Path (Unique to HashSortedMap) |
| 110 | + |
| 111 | +HashSortedMap checks a preferred slot before scanning the group: |
| 112 | +```rust |
| 113 | +let hint = slot_hint(hash); // 3 bits from hash → slot index |
| 114 | +if ctrl[hint] == EMPTY { /* direct insert */ } |
| 115 | +if ctrl[hint] == tag && keys[hint] == key { /* direct hit */ } |
| 116 | +``` |
| 117 | + |
| 118 | +hashbrown does **not** have this optimization — it always does a full SIMD |
| 119 | +group scan. The reason why the performance is different is probably due to the different overflow strategies and the different load factors. |
| 120 | + |
| 121 | +### 8. Overflow Reserve Sizing ✅ Validated |
| 122 | + |
| 123 | +Tested overflow reserves from 0% to 100% of primary groups: |
| 124 | + |
| 125 | +| Reserve | Growth scenario (µs) | |
| 126 | +|---------|----------------------| |
| 127 | +| m/8 (12.5%, default) | 8.04 | |
| 128 | +| m/4 (25%) | 8.33 | |
| 129 | +| m/2 (50%) | 8.93 | |
| 130 | +| m/1 (100%) | 10.31 | |
| 131 | +| 0 (grow immediately) | 6.96 | |
| 132 | + |
| 133 | +**Conclusion**: Smaller reserves are faster — growing early is cheaper than |
| 134 | +traversing overflow chains. |
| 135 | + |
| 136 | +### 9. IdentityHasher Fix ✅ Implemented |
| 137 | + |
| 138 | +The original `IdentityHasher` zero-extended u32 to u64, putting zeros in the |
| 139 | +top 32 bits. Since hashbrown derives the 7-bit tag from `hash >> 57`, every |
| 140 | +entry got the same tag — completely defeating control byte filtering. |
| 141 | + |
| 142 | +**Fix**: Use `folded_multiply` to expand u32 keys to u64 with independent |
| 143 | +entropy in both halves. Also changed trigram generation to use |
| 144 | +`folded_multiply` instead of murmur3. |
| 145 | + |
| 146 | +--- |
| 147 | + |
| 148 | +## Optimizations Not Implemented (and Why) |
| 149 | + |
| 150 | +| Optimization | Reason | |
| 151 | +|---------------------------------|------------------------------------------| |
| 152 | +| **Tombstone / DELETED support** | Insertion-only map — no deletions needed | |
| 153 | +| **In-place rehashing** | No tombstones to reclaim | |
| 154 | +| **Control byte mirroring** | Not needed with overflow chaining (no wrap-around) | |
| 155 | +| **Custom allocator support** | Out of scope for benchmarking | |
| 156 | +| **Over-allocation utilization** | Uses `Vec` (no raw allocator control) | |
| 157 | + |
| 158 | +--- |
| 159 | + |
| 160 | +## Summary of Impact |
| 161 | + |
| 162 | +| Change | Effect on insert time | |
| 163 | +|----------------------------|------------------------------| |
| 164 | +| Capacity sizing fix | **−50%** (biggest win) | |
| 165 | +| Optimized growth path | **−10%** on growth scenarios | |
| 166 | +| SIMD group scanning | **−5%** | |
| 167 | +| Branch hints (scalar only) | **−2–6%** | |
| 168 | +| IdentityHasher fix | Enabled fair comparison | |
| 169 | + |
| 170 | +The current HashSortedMap **matches hashbrown+FxHash** on pre-sized inserts, |
| 171 | +**beats all hashbrown variants** on overwrites, and has **2× faster growth**. |
0 commit comments