remove slots mostly and update optimizations file

aneubeck · aneubeck · commit 170870e0e8bc · 2026-05-07T19:05:24.000+02:00
diff --git a/crates/hash-sorted-map/OPTIMIZATIONS.md b/crates/hash-sorted-map/OPTIMIZATIONS.md
@@ -4,8 +4,8 @@
 
 `HashSortedMap` is a Swiss-table-inspired hash map that uses **overflow
 chaining** (instead of open addressing), **SIMD group scanning** (NEON/SSE2),
-a **slot-hint fast path**, and an **optimized growth strategy**. It is generic
-over key type, value type, and hash builder.
+and an **optimized growth strategy**. It is generic over key type, value type,
+and hash builder.
 
 This document analyzes the design trade-offs versus
 [hashbrown](https://github.com/rust-lang/hashbrown) and records the
@@ -38,7 +38,6 @@ experimental results that guided the current design.
 │  • Overflow chaining (linked groups)                             │
 │  • 8-byte groups with NEON/SSE2/scalar SIMD scan                 │
 │  • EMPTY / FULL tag states only (insertion-only, no deletion)    │
-│  • Slot-hint fast path                                           │
 └──────────────────────────────────────────────────────────────────┘
 ```
 
@@ -106,17 +105,32 @@ the overflow path.
 SIMD version** by pessimizing NEON code generation. Removed from the SIMD
 implementation, kept in the scalar version.
 
-### 7. Slot Hint Fast Path (Unique to HashSortedMap)
+### 7. Slot Hint Fast Path ⚠️ Removed from Lookup Paths
 
-HashSortedMap checks a preferred slot before scanning the group:
+Originally, HashSortedMap checked a preferred slot before scanning the group:
 ```rust
 let hint = slot_hint(hash);  // 3 bits from hash → slot index
 if ctrl[hint] == EMPTY { /* direct insert */ }
 if ctrl[hint] == tag && keys[hint] == key { /* direct hit */ }
 ```
 
-hashbrown does **not** have this optimization — it always does a full SIMD
-group scan. The reason why the performance is different is probably due to the different overflow strategies and the different load factors.
+**Experimental finding**: This scalar check **hurts performance** on random
+workloads. The branch predictor cannot help because random keys map to random
+slots, making the hint check a 50/50 branch that pollutes the branch
+predictor. SIMD-only scanning (match_tag + match_empty) is uniformly fast
+regardless of key distribution.
+
+**Results of removing slot_hint from different paths:**
+- `find_or_insertion_slot` (entry API): **−25% latency** on merge benchmark
+- `get_hashed`: **−4.4%** improvement (SIMD scan is faster than branch+scalar)
+- `insert_hashed`: **+7%** regression on presized insert (the hint genuinely
+  helps when inserting into a mostly-empty group), but accepted for code
+  simplicity since the merge workload matters more
+
+**Current state**: slot_hint is **only** used in `insert_for_grow()`, where
+the map is guaranteed sparse after a resize (groups are mostly empty, so the
+hint slot is very likely free). For all other paths, SIMD-only scanning is
+used.
 
 ### 8. Overflow Reserve Sizing ✅ Validated
 
@@ -159,13 +173,85 @@ entropy in both halves. Also changed trigram generation to use
 
 ## Summary of Impact
 
-| Change                     | Effect on insert time        |
-|----------------------------|------------------------------|
-| Capacity sizing fix        | **−50%** (biggest win)       |
-| Optimized growth path      | **−10%** on growth scenarios |
-| SIMD group scanning        | **−5%**                      |
-| Branch hints (scalar only) | **−2–6%**                    |
-| IdentityHasher fix         | Enabled fair comparison      |
+| Change                          | Effect                              |
+|---------------------------------|-------------------------------------|
+| Capacity sizing fix             | **−50%** insert time (biggest win)  |
+| Optimized growth path           | **2× faster** growth than hashbrown |
+| SIMD group scanning             | **−5%** insert time                 |
+| Slot hint removal (entry/get)   | **−25%** merge latency              |
+| Branch hints (scalar only)      | **−2–6%**                           |
+| IdentityHasher fix              | Enabled fair comparison             |
+
+---
 
-The current HashSortedMap **matches hashbrown+FxHash** on pre-sized inserts,
-**beats all hashbrown variants** on overwrites, and has **2× faster growth**.
+## Benchmark Results (Apple M-series, aarch64 NEON)
+
+### Insert (1000 trigrams, pre-sized)
+
+| Implementation       | Time (µs) | vs hashbrown |
+|----------------------|-----------|--------------|
+| FoldHashMap          | 2.44      | −11%         |
+| FxHashMap            | 2.61      | −5%          |
+| hashbrown+Identity   | 2.63      | baseline     |
+| hashbrown::HashMap   | 2.74      | +4%          |
+| std::HashMap+FNV     | 3.18      | +21%         |
+| AHashMap             | 3.38      | +29%         |
+| **HashSortedMap**    | **3.46**  | **+32%**     |
+| std::HashMap         | 8.65      | +229%        |
+
+### Reinsert (1000 trigrams, all keys exist)
+
+| Implementation       | Time (µs) |
+|----------------------|-----------|
+| hashbrown+Identity   | 2.50      |
+| **HashSortedMap**    | **2.70**  |
+
+### Growth (128 → 1000 trigrams, 3 resize rounds)
+
+| Implementation       | Time (µs) |
+|----------------------|-----------|
+| **HashSortedMap**    | **5.35**  |
+| hashbrown+Identity   | 10.12     |
+
+### Count (4000 trigrams, mixed insert/update)
+
+| Implementation                   | Time (µs) |
+|----------------------------------|-----------|
+| hashbrown+Identity entry()       | 4.89      |
+| **HashSortedMap entry().or_default()** | **5.44** |
+| **HashSortedMap get_or_default** | **5.48**  |
+
+### Iteration (1000 trigrams)
+
+| Implementation                | Time (ns) |
+|-------------------------------|-----------|
+| **HashSortedMap iter()**      | **794**   |
+| **HashSortedMap into_iter()** | **998**   |
+| hashbrown+Identity iter()     | 1,067     |
+| hashbrown+Identity into_iter()| 1,060     |
+
+### Sort (100K trigrams)
+
+| Implementation              | Time (µs) |
+|-----------------------------|-----------|
+| **HashSortedMap sort_by_hash** | **706** |
+| Vec::sort_unstable          | 984       |
+
+### Merge (100 maps × 100K keys each → sorted output)
+
+| Implementation                    | Time (ms) | vs HSM merge+sort |
+|-----------------------------------|-----------|--------------------|
+| hashbrown merge presized          | 30.4      | −46%               |
+| **HashSortedMap merge presized**  | **37.3**  | **−33%**           |
+| **HashSortedMap merge (no sort)** | **44.0**  | **−21%**           |
+| hashbrown merge                   | 45.4      | −19%               |
+| **HashSortedMap merge + sort**    | **55.9**  | **baseline**       |
+| hashbrown merge + Vec sort        | 58.7      | +5%                |
+| k-way merge sorted vecs           | 445       | +696%              |
+
+**Key takeaways:**
+- HashSortedMap has **2× faster growth** than hashbrown
+- **25% faster iteration** than hashbrown (dense group layout)
+- **sort_by_hash is 28% faster** than Vec::sort_unstable (data is partially sorted by group)
+- **merge + sort is 5% faster** than hashbrown merge + Vec sort (the primary use case)
+- Pre-sized insert is 32% slower than hashbrown (trade-off for sort/merge efficiency)
diff --git a/crates/hash-sorted-map/src/hash_sorted_map.rs b/crates/hash-sorted-map/src/hash_sorted_map.rs
@@ -215,26 +215,11 @@ impl<K: Hash + Eq, V, S: BuildHasher> HashSortedMap<K, V, S> {
 
     fn insert_hashed(&mut self, hash: u64, key: K, value: V) -> Option<V> {
         let tag = tag(hash);
-        let hint = slot_hint(hash);
         let mut gi = self.container.group_index(hash);
         loop {
             let group = &mut self.container.groups[gi];
-            // Fast path: check preferred slot.
-            let c = group.ctrl[hint];
-            if c == CTRL_EMPTY {
-                group.ctrl[hint] = tag;
-                group.keys[hint] = MaybeUninit::new(key);
-                group.values[hint] = MaybeUninit::new(value);
-                self.container.len += 1;
-                return None;
-            }
-            if c == tag && unsafe { group.keys[hint].assume_init_ref() } == &key {
-                let old = std::mem::replace(unsafe { group.values[hint].assume_init_mut() }, value);
-                return Some(old);
-            }
-            // Slow path: SIMD scan group for tag match.
+            // SIMD scan group for tag match.
             let mut tag_mask = group_ops::match_tag(&group.ctrl, tag);
-            tag_mask = group_ops::clear_slot(tag_mask, hint);
             while let Some(i) = group_ops::next_match(&mut tag_mask) {
                 if unsafe { group.keys[i].assume_init_ref() } == &key {
                     let old =
@@ -267,9 +252,9 @@ impl<K: Hash + Eq, V, S: BuildHasher> HashSortedMap<K, V, S> {
                 self.container.num_groups += 1;
                 self.container.groups[gi].overflow = new_gi as u32;
                 let group = &mut self.container.groups[new_gi];
-                group.ctrl[hint] = tag;
-                group.keys[hint] = MaybeUninit::new(key);
-                group.values[hint] = MaybeUninit::new(value);
+                group.ctrl[0] = tag;
+                group.keys[0] = MaybeUninit::new(key);
+                group.values[0] = MaybeUninit::new(value);
                 self.container.len += 1;
                 return None;
             }
@@ -282,31 +267,20 @@ impl<K: Hash + Eq, V, S: BuildHasher> HashSortedMap<K, V, S> {
         Q: Eq + ?Sized,
     {
         let tag = tag(hash);
-        let hint = slot_hint(hash);
         let mut gi = self.container.group_index(hash);
 
         loop {
             let group = &self.container.groups[gi];
-
-            // Fast path: preferred slot.
-            let c = group.ctrl[hint];
-            if c == tag && unsafe { group.keys[hint].assume_init_ref() }.borrow() == key {
-                return Some(unsafe { group.values[hint].assume_init_ref() });
-            }
-
-            // Slow path: SIMD scan group.
+            // SIMD scan group for tag match.
             let mut tag_mask = group_ops::match_tag(&group.ctrl, tag);
-            tag_mask = group_ops::clear_slot(tag_mask, hint);
             while let Some(i) = group_ops::next_match(&mut tag_mask) {
                 if unsafe { group.keys[i].assume_init_ref() }.borrow() == key {
                     return Some(unsafe { group.values[i].assume_init_ref() });
                 }
             }
-
             if group_ops::match_empty(&group.ctrl) != 0 {
                 return None;
             }
-
             if group.overflow == NO_OVERFLOW {
                 return None;
             }
@@ -334,7 +308,6 @@ impl<K: Hash + Eq, V, S: BuildHasher> HashSortedMap<K, V, S> {
                     return FindResult::Found(group.values[i].as_mut_ptr());
                 }
             }
-
             // Check for empty slot in this group.
             let empty_mask = group_ops::match_empty(&group.ctrl);
             if empty_mask != 0 {
@@ -344,7 +317,6 @@ impl<K: Hash + Eq, V, S: BuildHasher> HashSortedMap<K, V, S> {
                     slot: i,
                 });
             }
-
             // Group full — follow or report end of chain.
             if group.overflow == NO_OVERFLOW {
                 return FindResult::Vacant(Insertion::NeedsOverflow {
@@ -626,7 +598,7 @@ impl<'a, K: Hash + Eq, V, S: BuildHasher> VacantEntry<'a, K, V, S> {
                     // `entry()` and now (we hold the only `&mut self`).
                     (*tail).overflow = new_gi as u32;
                 }
-                (new_group, slot_hint(hash))
+                (new_group, 0)
             }
         };
 
@@ -644,57 +616,18 @@ impl<'a, K: Hash + Eq, V, S: BuildHasher> VacantEntry<'a, K, V, S> {
 }
 
 /// Cold path: the chain was full, the table is at capacity, and we need to
-/// grow before inserting. Re-walks via the slow path after grow.
-///
-/// With clustered hash functions (e.g. identity hashing), the new primary
-/// group may still be full after grow, so we handle `NeedsOverflow` by
-/// allocating an overflow group.
+/// grow before inserting. Grows the map, then re-walks via `entry()` to find
+/// the new insertion slot.
 #[cold]
 #[inline(never)]
 fn insert_after_grow<K: Hash + Eq, V, S: BuildHasher>(
     map: &mut HashSortedMap<K, V, S>,
-    hash: u64,
+    _hash: u64,
     key: K,
     value: V,
 ) -> &mut V {
     map.grow();
-    let tag = tag(hash);
-    match map.find_or_insertion_slot(hash, &key) {
-        FindResult::Vacant(Insertion::Empty { group, slot }) => {
-            // SAFETY: `group` points into `map.container.groups` and is valid for `'a`.
-            unsafe {
-                let g = &mut *group;
-                g.ctrl[slot] = tag;
-                g.keys[slot] = MaybeUninit::new(key);
-                g.values[slot] = MaybeUninit::new(value);
-                map.container.len += 1;
-                g.values[slot].assume_init_mut()
-            }
-        }
-        FindResult::Vacant(Insertion::NeedsOverflow { tail }) => {
-            // Primary group chain is full even after grow (possible with
-            // clustered identity hashes). Allocate an overflow group.
-            debug_assert!(
-                (map.container.num_groups as usize) < map.container.groups.len(),
-                "overflow pool exhausted right after grow"
-            );
-            let new_gi = map.container.num_groups as usize;
-            map.container.num_groups += 1;
-            unsafe {
-                (*tail).overflow = new_gi as u32;
-            }
-            let slot = slot_hint(hash);
-            let group = &mut map.container.groups[new_gi];
-            group.ctrl[slot] = tag;
-            group.keys[slot] = MaybeUninit::new(key);
-            group.values[slot] = MaybeUninit::new(value);
-            map.container.len += 1;
-            unsafe { group.values[slot].assume_init_mut() }
-        }
-        FindResult::Found(_) => {
-            unreachable!("key was not in the table before grow")
-        }
-    }
+    map.entry(key).or_insert(value)
 }
 
 // No custom Drop needed for HashSortedMap — dropping `container` handles entries.