Skip to content
Open
Show file tree
Hide file tree
Changes from 22 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions crates/hash-sorted-map/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,5 @@ repository = "https://github.com/github/rust-gems"
license = "MIT"
keywords = ["hashmap", "sorted", "merge", "simd"]
categories = ["algorithms", "data-structures"]

[dependencies]
118 changes: 102 additions & 16 deletions crates/hash-sorted-map/OPTIMIZATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@

`HashSortedMap` is a Swiss-table-inspired hash map that uses **overflow
chaining** (instead of open addressing), **SIMD group scanning** (NEON/SSE2),
a **slot-hint fast path**, and an **optimized growth strategy**. It is generic
over key type, value type, and hash builder.
and an **optimized growth strategy**. It is generic over key type, value type,
and hash builder.

This document analyzes the design trade-offs versus
[hashbrown](https://github.com/rust-lang/hashbrown) and records the
Expand Down Expand Up @@ -38,7 +38,6 @@ experimental results that guided the current design.
│ • Overflow chaining (linked groups) │
│ • 8-byte groups with NEON/SSE2/scalar SIMD scan │
│ • EMPTY / FULL tag states only (insertion-only, no deletion) │
│ • Slot-hint fast path │
└──────────────────────────────────────────────────────────────────┘
```

Expand Down Expand Up @@ -106,17 +105,32 @@ the overflow path.
SIMD version** by pessimizing NEON code generation. Removed from the SIMD
implementation, kept in the scalar version.

### 7. Slot Hint Fast Path (Unique to HashSortedMap)
### 7. Slot Hint Fast Path ⚠️ Removed from Lookup Paths

HashSortedMap checks a preferred slot before scanning the group:
Originally, HashSortedMap checked a preferred slot before scanning the group:
```rust
let hint = slot_hint(hash); // 3 bits from hash → slot index
if ctrl[hint] == EMPTY { /* direct insert */ }
if ctrl[hint] == tag && keys[hint] == key { /* direct hit */ }
```

hashbrown does **not** have this optimization — it always does a full SIMD
group scan. The reason why the performance is different is probably due to the different overflow strategies and the different load factors.
**Experimental finding**: This scalar check **hurts performance** on random
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes so much sense. And removing it from lookup paths explains why we can sort the map and it's still a map. Pretty great outcome!

How valuable is it for resizing? Even if it usually hits, surely it's equally fast to find the first empty slot in the group with SIMD, and that will hit even more often.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed the grow code as well. As a result all occupied slots are at the beginning, so no more special treatment when sorting...

workloads. The branch predictor cannot help because random keys map to random
slots, making the hint check a 50/50 branch that pollutes the branch
predictor. SIMD-only scanning (match_tag + match_empty) is uniformly fast
regardless of key distribution.

**Results of removing slot_hint from different paths:**
- `find_or_insertion_slot` (entry API): **−25% latency** on merge benchmark
- `get_hashed`: **−4.4%** improvement (SIMD scan is faster than branch+scalar)
- `insert_hashed`: **+7%** regression on presized insert (the hint genuinely
helps when inserting into a mostly-empty group), but accepted for code
simplicity since the merge workload matters more

**Current state**: slot_hint is **only** used in `insert_for_grow()`, where
the map is guaranteed sparse after a resize (groups are mostly empty, so the
hint slot is very likely free). For all other paths, SIMD-only scanning is
used.

### 8. Overflow Reserve Sizing ✅ Validated

Expand Down Expand Up @@ -159,13 +173,85 @@ entropy in both halves. Also changed trigram generation to use

## Summary of Impact

| Change | Effect on insert time |
|----------------------------|------------------------------|
| Capacity sizing fix | **−50%** (biggest win) |
| Optimized growth path | **−10%** on growth scenarios |
| SIMD group scanning | **−5%** |
| Branch hints (scalar only) | **−2–6%** |
| IdentityHasher fix | Enabled fair comparison |
| Change | Effect |
|---------------------------------|-------------------------------------|
| Capacity sizing fix | **−50%** insert time (biggest win) |
| Optimized growth path | **2× faster** growth than hashbrown |
| SIMD group scanning | **−5%** insert time |
| Slot hint removal (entry/get) | **−25%** merge latency |
| Branch hints (scalar only) | **−2–6%** |
| IdentityHasher fix | Enabled fair comparison |

---

The current HashSortedMap **matches hashbrown+FxHash** on pre-sized inserts,
**beats all hashbrown variants** on overwrites, and has **2× faster growth**.
## Benchmark Results (Apple M-series, aarch64 NEON)

### Insert (1000 trigrams, pre-sized)

| Implementation | Time (µs) | vs hashbrown |
|----------------------|-----------|--------------|
| FoldHashMap | 2.44 | −11% |
| FxHashMap | 2.61 | −5% |
| hashbrown+Identity | 2.63 | baseline |
| hashbrown::HashMap | 2.74 | +4% |
| std::HashMap+FNV | 3.18 | +21% |
| AHashMap | 3.38 | +29% |
| **HashSortedMap** | **3.46** | **+32%** |
| std::HashMap | 8.65 | +229% |

### Reinsert (1000 trigrams, all keys exist)

| Implementation | Time (µs) |
|----------------------|-----------|
| hashbrown+Identity | 2.50 |
| **HashSortedMap** | **2.70** |

### Growth (128 → 1000 trigrams, 3 resize rounds)

| Implementation | Time (µs) |
|----------------------|-----------|
| **HashSortedMap** | **5.35** |
| hashbrown+Identity | 10.12 |

### Count (4000 trigrams, mixed insert/update)

| Implementation | Time (µs) |
|----------------------------------|-----------|
| hashbrown+Identity entry() | 4.89 |
| **HashSortedMap entry().or_default()** | **5.44** |
| **HashSortedMap get_or_default** | **5.48** |

### Iteration (1000 trigrams)

| Implementation | Time (ns) |
|-------------------------------|-----------|
| **HashSortedMap iter()** | **794** |
| **HashSortedMap into_iter()** | **998** |
| hashbrown+Identity iter() | 1,067 |
| hashbrown+Identity into_iter()| 1,060 |

### Sort (100K trigrams)

| Implementation | Time (µs) |
|-----------------------------|-----------|
| **HashSortedMap sort_by_hash** | **706** |
| Vec::sort_unstable | 984 |

### Merge (100 maps × 100K keys each → sorted output)

| Implementation | Time (ms) | vs HSM merge+sort |
|-----------------------------------|-----------|--------------------|
| hashbrown merge presized | 30.4 | −46% |
| **HashSortedMap merge presized** | **37.3** | **−33%** |
| **HashSortedMap merge (no sort)** | **44.0** | **−21%** |
| hashbrown merge | 45.4 | −19% |
| **HashSortedMap merge + sort** | **55.9** | **baseline** |
| hashbrown merge + Vec sort | 58.7 | +5% |
| k-way merge sorted vecs | 445 | +696% |

**Key takeaways:**
- HashSortedMap has **2× faster growth** than hashbrown
- **25% faster iteration** than hashbrown (dense group layout)
- **sort_by_hash is 28% faster** than Vec::sort_unstable (data is partially sorted by group)
- **merge + sort is 5% faster** than hashbrown merge + Vec sort (the primary use case)
- Pre-sized insert is 32% slower than hashbrown (trade-off for sort/merge efficiency)
57 changes: 18 additions & 39 deletions crates/hash-sorted-map/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,45 +42,24 @@ keys, which means:

## Benchmark results

All benchmarks insert 1000 random trigram hashes (scrambled with
`folded_multiply`) into maps with various configurations. Measured on Apple
M-series (aarch64).

### Insert 1000 trigrams — pre-sized, no growth

| Rank | Map | Time (µs) | vs best |
|------|-----|-----------|---------|
| 🥇 | FoldHashMap | 2.44 | — |
| 🥈 | FxHashMap | 2.61 | +7% |
| 🥉 | hashbrown::HashMap | 2.67 | +9% |
| 4 | **HashSortedMap** | **2.71** | +11% |
| 5 | hashbrown+Identity | 2.74 | +12% |
| 6 | std::HashMap+FNV | 3.27 | +34% |
| 7 | AHashMap | 3.22 | +32% |
| 8 | std::HashMap | 8.49 | +248% |

### Re-insert same keys (all overwrites)

| Map | Time (µs) |
|-----|-----------|
| **HashSortedMap** | **2.36** ✅ |
| hashbrown+Identity | 2.58 |

### Growth from small (`with_capacity(128)`, 3 resize rounds)

| Map | Time (µs) | Growth penalty |
|-----|-----------|----------------|
| **HashSortedMap** | **4.85** | +2.14 |
| hashbrown+Identity | 9.77 | +7.03 |

### Key takeaways

- **HashSortedMap matches the fastest hashbrown configurations** on pre-sized
first-time inserts and is **the fastest for overwrites**.
- **Growth is ~2× faster** than hashbrown thanks to the optimized
`insert_for_grow` path that skips duplicate checking and uses raw copies.
- The remaining gap to FoldHashMap (~11%) comes from foldhash's extremely
efficient hash function that pipelines well with hashbrown's SIMD scan.
Latest local Criterion snapshot from this repository's
`target/criterion` outputs (lower is better):

| Scenario | HashSortedMap | Comparison | Result |
| :------------------------------------------- | ------------: | :------------------------------------- | :---------- |
| Insert 1000 trigrams (pre-sized) | 7.34 µs | hashbrown::HashMap: 12.88 µs | ~43% faster |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't agree with the other file, which puts us 32% slower than hashbrown on the same microbenchmark. Different architecture?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be best to have all the benchmarks in one place and up to date, and preferably all on Intel if that's what we think most cloud servers have.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reran on codespace intel CPU.
runs vary sometimes pretty drastically and generally the xeon CPUs are much slower than my M4 machine :(

We might want to retest on a dedicated production machine at some point

| Grow from capacity 128 | 20.54 µs | hashbrown+Identity: 23.17 µs | ~11% faster |
| Count 4000 trigrams (`entry().or_default()`) | 12.70 µs | hashbrown+Identity `entry()`: 13.53 µs | ~6% faster |
| Iterate 1000 trigrams (`iter()`) | 3.93 µs | hashbrown+Identity `iter()`: 2.87 µs | ~37% slower |
| Sort 100000 trigrams by hash | 1.83 ms | `Vec::sort_unstable`: 2.09 ms | ~12% faster |
| Merge 100 sorted maps + final sort | 161.93 ms | hashbrown merge + vec sort: 234.70 ms | ~31% faster |

Key takeaways:

- `HashSortedMap` is strongest on insert-heavy and merge/sort-heavy paths.
- Iteration throughput is currently behind `hashbrown+Identity`.
- In workloads that need deterministic hash-order serialization, the merge and
sort advantages can outweigh the iteration gap.

## Running

Expand Down
1 change: 1 addition & 0 deletions crates/hash-sorted-map/benchmarks/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,4 @@ ahash = "0.8"
hashbrown = "0.15"
foldhash = "0.1"
fnv = "1"
itertools = "0.14"
Loading