Chapter 22: Algorithms & Data Structures for Real Work

Part I — Foundations | Prerequisites: None | Difficulty: Intermediate → Advanced

Not LeetCode grinding. This is the set of data structures and algorithms you will actually reach for in production — the ones hiding inside Redis, Postgres, Cassandra, and Elasticsearch, doing the heavy lifting you've been taking for granted. Once you see them, you can't unsee them.

In This Chapter

Complexity Analysis That Matters
Hash-Based Structures
Tree Structures in Databases & Systems
Graph Algorithms in Infrastructure
Probabilistic Data Structures
Practical Implementations
Sorting & Searching in the Real World
Search Engineering

Related Chapters

Ch 1 (consistent hashing, distributed systems)
Ch 24 (database internals — B-trees, LSM trees in depth)
Ch 6 (lock-free data structures)

1. Complexity Analysis That Matters

Here is the truth about Big-O: it is not an academic exercise designed to torment junior engineers before coding interviews. It is how you predict whether your system will survive at 10x the current traffic. The difference between O(n) and O(n^2) is the difference between a 200ms response and a 20-minute timeout. When your on-call pager goes off at 3am and the logs show a query that used to take 80ms now taking 45 seconds, Big-O is how you diagnose the problem in five minutes instead of three hours.

1.1 Big-O Refresher with Real-World Examples

Big-O describes how an operation's cost grows as the input size (n) increases. Here is every complexity class you will actually encounter in backend work:

Complexity	Name	Real-World Example
O(1)	Constant	Hash map lookup, array index access, checking if a bit is set
O(log n)	Logarithmic	Binary search, B-tree index lookup in Postgres, finding an element in a balanced BST
O(n)	Linear	Scanning every row in a table without an index, iterating a linked list, `grep` through a file
O(n log n)	Linearithmic	Sorting query results (ORDER BY without an index), building a B-tree index from scratch
O(n^2)	Quadratic	Nested loop join without indexes, comparing every pair of items (naive deduplication)

To put these in perspective with concrete numbers:

n = 1,000,000 (one million rows)

O(1)        →  1 operation
O(log n)    →  20 operations
O(n)        →  1,000,000 operations
O(n log n)  →  20,000,000 operations
O(n²)       →  1,000,000,000,000 operations  ← this kills your server

The jump from O(n) to O(n^2) is where systems fall over. A query that scans 1M rows in 100ms will take 100,000 seconds with an O(n^2) algorithm. This is the number one reason to add database indexes. Every nested loop over a large dataset is a landmine waiting to detonate at scale.

1.2 Amortized Analysis

Some operations are expensive occasionally but cheap on average. Amortized analysis accounts for this — and it matters more than most engineers realize.

Dynamic arrays (ArrayList, Go slices, Python lists) are the canonical example:

Append operation:
- Usually O(1): write to the next empty slot
- Occasionally O(n): array is full, allocate 2x the space, copy everything over

But if the array doubles each time:
- After n appends, total work = n + n/2 + n/4 + n/8 + ... ≈ 2n
- Average cost per append = 2n / n = O(1) amortized

This matters in practice. When someone says "appending to a slice is O(1)," they mean amortized O(1). If you are building a system that cannot tolerate occasional latency spikes — say, a real-time trading system where a 10ms hiccup triggers a margin call — you care about the worst case, not the amortized case. Pre-allocate your arrays.

Hash map resize follows the same pattern. When the load factor exceeds a threshold (typically 0.75), the map doubles its bucket count and rehashes every key. Individual inserts are O(1) amortized but occasionally O(n). In Go, if you know upfront how big a map will get, make(map[K]V, expectedSize) avoids those resize pauses entirely.

1.3 Space-Time Trade-offs

Almost every design decision in backend engineering is a space-time trade-off. You are constantly negotiating between how much memory you spend and how fast things run.

Trading space for time (caching):

Database query cache: store result sets to avoid recomputation
CDN: replicate content closer to users (more storage, less latency)
Denormalization: store redundant data to avoid JOINs
Materialized views: precompute expensive aggregations

Trading accuracy for space (probabilistic structures):

Bloom filters: 1% false positive rate in 1MB vs exact set membership in 100MB
HyperLogLog: count distinct elements with 0.81% error using 12KB vs exact count requiring O(n) memory
Count-Min Sketch: approximate frequency counts in bounded memory

Trading time for space (compression):

Gzip responses: spend CPU cycles to reduce bandwidth
Column-oriented storage: compress similar values together
Delta encoding: store differences instead of absolute values

The probabilistic structures in particular are underused by most engineers. You'll meet them properly in Section 5 — they're one of the most satisfying parts of this chapter.

1.4 Why Constants Matter

Big-O hides constant factors, but constants dominate at practical scales. This is the most commonly misunderstood nuance about complexity analysis.

Algorithm A: O(n)     with constant factor 1000  → 1000n operations
Algorithm B: O(n²)    with constant factor 1     → n² operations

Crossover point: 1000n = n²  →  n = 1000

For n < 1000: Algorithm B is faster
For n > 1000: Algorithm A is faster

Real examples where constants matter:

Insertion sort vs quicksort for small arrays: Insertion sort is O(n^2) but has tiny constants (no recursion, good cache locality). Most standard library sort implementations switch to insertion sort for arrays under 10-20 elements. That's not a bug — it's intentional.
Hash map vs linear scan for small collections: For fewer than ~10 elements, linear scan through an array beats a hash map because hash computation, memory indirection, and cache misses cost more than scanning a handful of elements.
B-tree fan-out: A B-tree with node size matching a disk page (4KB) can store ~500 keys per node. A binary search tree stores 1 key per node. Same O(log n) lookup, but the B-tree does log_500(n) disk reads vs log_2(n). For 1 billion keys, that is 3 disk reads vs 30. The algorithm is the same; the constant is everything. (More on this in Section 3.3 and Ch 24.)

1.5 The Operations That Matter

For backend work, these are the operations you evaluate data structures against:

Operation	Why It Matters
Insert	How fast can you write new data? (API writes, log ingestion, event processing)
Lookup (point query)	Find one record by key. The bread and butter of web applications.
Range scan	Find all records in a range. Time-series queries, pagination, analytics.
Delete	Remove data. Harder than it sounds — tombstones, compaction, fragmentation.
Iterate (full scan)	Process every element. Batch jobs, migrations, analytics.

Different data structures optimize for different operations. There is no structure that is best at everything — if there were, you would only need one:

               Insert    Lookup    Range Scan    Delete    Iterate
Hash Map       O(1)*     O(1)*     O(n)          O(1)*     O(n)
B-Tree         O(log n)  O(log n)  O(log n + k)  O(log n)  O(n)
LSM-Tree       O(1)*     O(log n)  O(log n + k)  O(1)*     O(n)
Skip List      O(log n)  O(log n)  O(log n + k)  O(log n)  O(n)
Sorted Array   O(n)      O(log n)  O(log n + k)  O(n)      O(n)

* = amortized
k = number of elements in the range

This table is the cheat sheet for every "which database should I use?" conversation. Hash maps win on point lookups. B-trees win on range scans. LSM trees win on writes. You choose based on your workload.

2. Hash-Based Structures

Hash maps are the most important data structure in backend engineering. If you understand nothing else in this chapter, understand hash maps. They are in Redis (literally every data type uses hashing under the hood), in your programming language's standard library, in Kafka's partition assignment, and in the lookup table your CPU uses to find memory pages. They are everywhere, and they are almost always the right first choice.

2.1 How Hash Maps Work Internally

Think of a hash map as a perfect filing cabinet. You walk up, say the key, and a very fast clerk computes exactly which drawer to open. No scanning, no searching — just compute and retrieve.

put(key, value):
    1. Compute hash = hash_function(key)         # e.g., "user:123" → 0xA3F2...
    2. Compute index = hash % num_buckets         # e.g., 0xA3F2... % 16 = 7
    3. Store (key, value) in bucket[7]

get(key):
    1. Compute hash = hash_function(key)
    2. Compute index = hash % num_buckets
    3. Look in bucket[index] for matching key
    4. Return value (or "not found")

The hash function must be:

Deterministic: same key always produces the same hash
Uniform: keys should spread evenly across buckets
Fast: hashing should be cheaper than the alternative (linear scan)

Common hash functions in practice:

MurmurHash3: fast, good distribution, used by many hash map implementations
SipHash: DoS-resistant (Python dicts, Rust HashMaps use it to prevent hash-flooding attacks)
xxHash: extremely fast, used for checksums and non-cryptographic hashing
CityHash/FarmHash: Google's fast hash families, optimized for short strings

2.2 Why Hash Maps Are O(1) Amortized

With a good hash function and reasonable load factor, most buckets contain 0 or 1 entries. The hash computation is O(1) (fixed work regardless of map size), and jumping to a bucket index is O(1) (array index access). Therefore, the expected lookup time is O(1).

But this is an average. In the worst case, every key hashes to the same bucket, and lookup degenerates to O(n) — scanning a linked list of all entries. This is why hash function quality matters, and why language runtimes use randomized hash seeds to prevent attackers from crafting collision-heavy inputs (hash-flooding DoS attacks). Python's dict randomizes its seed at startup for exactly this reason.

Load factor = number of entries / number of buckets. When the load factor exceeds a threshold (typically 0.75 in Java, 6.5 average per bucket in Go), the map resizes — doubles the bucket count and rehashes every key. This is the O(n) operation that makes the amortized cost still O(1).

2.3 Hash Collisions: Chaining vs Open Addressing

When two keys hash to the same bucket, you have a collision. Two strategies for dealing with it, and the right choice depends on your hardware:

Chaining (separate chaining):

bucket[7] → (key1, val1) → (key2, val2) → (key3, val3)

Each bucket is a linked list (or in Java 8+, a red-black tree if
the chain exceeds 8 elements).

Pros: Simple, load factor can exceed 1.0, deletion is easy
Cons: Pointer chasing (cache-unfriendly), extra memory for pointers
Used by: Java HashMap, Go map (with overflow buckets)

Open addressing:

Collision at bucket[7]?
- Linear probing:   try 7, 8, 9, 10, ...
- Quadratic probing: try 7, 7+1², 7+2², 7+3², ...
- Robin Hood hashing: linear probing, but swap entries to minimize
                      max probe distance (more fair distribution)

Pros: Cache-friendly (data in contiguous memory), no pointer overhead
Cons: Clustering (long probe sequences), deletion requires tombstones,
      load factor must stay below ~0.7
Used by: Python dict (open addressing with perturbation),
         Rust HashMap (Robin Hood → SwissTable),
         Google's Swiss Tables (SIMD-accelerated probing)

In practice: Modern high-performance hash maps use open addressing with SIMD instructions to probe multiple slots simultaneously (Google's SwissTable, adopted by Rust and Abseil C++). The cache-friendliness of open addressing wins on modern hardware because cache misses are far more expensive than a few extra comparisons.

2.4 Hash Sets

A hash set is simply a hash map where you only care about the keys (values are ignored or boolean). All the same internals apply. You already use these constantly, probably without thinking about them.

Use cases you encounter constantly:

Deduplication: seen = set() to skip already-processed records
Membership testing: is this user ID in the blocklist?
Set operations: intersection (common friends), union (merge results), difference (what changed?)

# Deduplication during event processing
processed_ids = set()
for event in event_stream:
    if event.id in processed_ids:    # O(1) lookup
        continue                      # skip duplicate
    process(event)
    processed_ids.add(event.id)       # O(1) insert

The alternative — a sorted list with binary search — gives O(log n) membership tests. For a million events, that is the difference between 1 operation and 20. Hash sets are the right tool for this job almost every time.

2.5 Consistent Hashing

Standard hashing (hash(key) % N) falls apart when you add or remove servers. If N changes from 4 to 5, nearly every key maps to a different server — cache miss storm, mass data migration. This is the problem that brought down caches across entire tech companies before consistent hashing became standard.

Consistent hashing fixes this by mapping both keys and servers onto a ring (0 to 2^32 - 1):

The Ring (0 to 2³²-1):

        ServerA(hash=1000)
           /
    ------*-----------
   /                   \
  |    key1(hash=800)   |
  |         ↓           |
  |    maps to ServerA  |
  |    (next server     |
  |     clockwise)      |
   \                   /
    ------*-----------
           \
        ServerB(hash=5000)

Rule: A key maps to the first server found going clockwise from
      the key's hash position on the ring.

Virtual nodes solve the problem of uneven distribution. Instead of placing each server once on the ring, place it 100-200 times (with different hash values). This smooths out the distribution so no single server holds a disproportionate share of the keyspace:

class ConsistentHashRing:
    def __init__(self, nodes=None, num_virtual=150):
        self.ring = {}              # hash_value → node
        self.sorted_keys = []       # sorted hash values for binary search
        self.num_virtual = num_virtual
        if nodes:
            for node in nodes:
                self.add_node(node)

    def add_node(self, node):
        for i in range(self.num_virtual):
            virtual_key = f"{node}:vn{i}"
            hash_val = self._hash(virtual_key)
            self.ring[hash_val] = node
            self.sorted_keys.append(hash_val)
        self.sorted_keys.sort()

    def remove_node(self, node):
        for i in range(self.num_virtual):
            virtual_key = f"{node}:vn{i}"
            hash_val = self._hash(virtual_key)
            del self.ring[hash_val]
            self.sorted_keys.remove(hash_val)

    def get_node(self, key):
        if not self.ring:
            return None
        hash_val = self._hash(key)
        # Find first server clockwise (binary search for next largest hash)
        idx = bisect.bisect_right(self.sorted_keys, hash_val)
        if idx == len(self.sorted_keys):
            idx = 0  # wrap around the ring
        return self.ring[self.sorted_keys[idx]]

    def _hash(self, key):
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

Why it minimizes redistribution: When a server is added, it takes ownership of a portion of the ring from its clockwise neighbor. Only keys in that arc are reassigned — roughly 1/N of all keys, not all of them. When a server is removed, its keys move to the next clockwise server.

Used by: Amazon DynamoDB (partition assignment), Apache Cassandra (token ring), Memcached client-side sharding, Nginx upstream consistent hashing, Akka Cluster Sharding. When you're configuring a Cassandra cluster and wondering how it decides which node owns which rows — this is it.

2.6 Hash-Based Algorithms in Practice

Deduplication at scale:

Problem: Process 1 billion events, skip duplicates
Solution 1: Hash set in memory (requires ~16GB for 1B 128-bit hashes)
Solution 2: Bloom filter (1% false positive rate in ~1.2GB)
Solution 3: Hash to partition, deduplicate per partition (distributed)

Counting distinct elements:

Problem: How many unique users visited in the last 24 hours?
Exact: Store every user ID in a set → O(n) memory
Approximate: HyperLogLog → 12KB memory, 0.81% error (Redis PFCOUNT)

Partitioning (how Kafka and DynamoDB assign data to shards):

partition = hash(key) % num_partitions

Kafka: Messages with the same key always go to the same partition
       (guarantees ordering per key)
DynamoDB: hash(partition_key) determines which storage node holds the item

Every time you set a partition key in Kafka, you are doing distributed hash routing. The same algorithm, at scale, across a cluster.

3. Tree Structures in Databases & Systems

Trees are how databases organize data on disk. If you understand B-trees and LSM-trees, you understand how virtually every database works under the hood. And because Ch 24 covers database internals in depth, this chapter will give you the mental model — the "why" — and Ch 24 will go deep on implementation details, page formats, and tuning.

3.1 Binary Search Trees (BST)

A BST maintains the invariant: left child < parent < right child. This gives O(log n) lookup, insert, and delete — but only if the tree is balanced.

Balanced BST (height = 3):          Degenerate BST (height = 5):

        4                            1
       / \                            \
      2   6                            2
     / \ / \                            \
    1  3 5  7                            3
                                          \
                                           4
                                            \
                                             5

Lookup: O(log n) = O(3)             Lookup: O(n) = O(5)

An unbalanced BST is just a linked list with extra steps. Insert sorted data into a naive BST and you get O(n) everything — which is why databases would never use a plain BST. This is why self-balancing trees exist.

3.2 Red-Black Trees and AVL Trees

Both are self-balancing BSTs that guarantee O(log n) height through rotation operations on insert and delete. They are the workhorses of in-memory ordered data structures.

Red-Black Trees:

Every node is red or black
Root is black, leaves (NIL) are black
Red nodes cannot have red children
Every path from root to leaf has the same number of black nodes
Guarantees: height <= 2 * log2(n+1)
Slightly less balanced than AVL but fewer rotations on insert/delete

AVL Trees:

Heights of left and right subtrees differ by at most 1
Stricter balancing = faster lookups, but more rotations on modification
Better for read-heavy workloads

Where you encounter them:

java.util.TreeMap, java.util.TreeSet — Red-Black Tree
std::map, std::set in C++ — typically Red-Black Tree
Linux kernel CFS (Completely Fair Scheduler) — Red-Black Tree for process scheduling by virtual runtime
In-memory ordered indexes in databases
Redis sorted sets' backing tree (alongside the skip list — they keep both)

You almost never implement these yourself. You use them through standard library ordered maps/sets when you need sorted iteration, range queries, or finding the min/max efficiently. But understanding them helps you understand why TreeMap is slower than HashMap on point lookups.

3.3 B-Trees: How Database Indexes Actually Work

B-Trees are the most important data structure in databases. Every time you create an index in Postgres or MySQL, you are building a B-Tree (technically, a B+ Tree). That CREATE INDEX statement you run in your migration? This is what happens.

Why not a binary tree? Disk I/O. A binary tree with 1 billion entries has height log2(10^9) = 30. That is 30 disk reads per lookup. A B-Tree with a branching factor of 500 (typical for 4KB pages) has height log500(10^9) = 3. Three disk reads to find any record among a billion.

The B-tree is designed around one key insight: disk reads are catastrophically slow compared to memory operations, but reading a large chunk at once is almost free. So pack as many keys as possible into each page and minimize the number of pages you read.

B-Tree node structure:

Each node (= one disk page, typically 4KB or 8KB):

┌────────────────────────────────────────────────────┐
│ key1 │ key2 │ key3 │ ... │ keyN │                  │
│  ↓      ↓      ↓      ↓     ↓     ↓               │
│ ptr0  ptr1   ptr2   ptr3   ptrN  ptrN+1            │
└────────────────────────────────────────────────────┘

ptr0 → all keys < key1
ptr1 → all keys >= key1 and < key2
ptr2 → all keys >= key2 and < key3
...

Lookup walks from root to leaf, doing binary search within each node to pick the right child pointer. With pages cached in the buffer pool, the root and first few levels are almost always in memory, so a lookup typically does 1-2 actual disk reads.

Insert finds the correct leaf, inserts the key. If the leaf is full, it splits into two nodes and pushes the middle key up to the parent. Splits can cascade up but are rare.

B+ Trees (what databases actually use):

B-Tree: data stored in internal nodes AND leaves
B+ Tree: data stored ONLY in leaves; internal nodes are pure index

Internal nodes:  [  10  |  20  |  30  ]
                 /    |      |      \
Leaf nodes:  [1,5,8] → [10,12,15] → [20,22,28] → [30,35,40]
                 ↑           ↑            ↑            ↑
              linked list connecting all leaves

Advantages of B+ Trees:
1. Internal nodes hold more keys (no data pointers) → higher fan-out → shorter tree
2. Leaves form a linked list → range scans are sequential reads (fast!)
3. All lookups go to leaves → more predictable performance

This is why WHERE id BETWEEN 100 AND 200 is fast with an index: the database finds leaf 100 via the tree, then follows the linked list to leaf 200, reading sequential pages. No random access — just a scan.

This is why SELECT * FROM users ORDER BY created_at is fast with an index on created_at: the leaves are already in order, just scan the linked list. The sort is free.

See Ch 24 for a deep dive into how Postgres and MySQL implement B+ trees, their page formats, and how the buffer pool caches work.

3.4 LSM Trees: Write-Optimized Databases

B-Trees are optimized for reads (O(log n) with high fan-out). But every write to a B-Tree requires a random disk seek to find the right page. For write-heavy workloads — logging, time-series, event ingestion, IoT sensor data — this becomes a bottleneck. You're burning through I/O budget just finding the right place to write.

Log-Structured Merge Trees (LSM Trees) flip the script. Instead of writing to the right place immediately, they write everything sequentially and sort it out later:

Write path:

1. Write to WAL (Write-Ahead Log) on disk           ← sequential write (fast)
2. Write to MemTable (in-memory sorted structure)    ← in-memory (fast)
3. When MemTable is full → flush to SSTable on disk  ← sequential write (fast)

                ┌─────────────┐
  Writes ──→    │  MemTable   │  (sorted, in-memory, typically a skip list or red-black tree)
                │  (4MB-64MB) │
                └──────┬──────┘
                       │ flush when full
                       ▼
              ┌──────────────────┐
    Level 0:  │ SSTable │ SSTable │  (sorted, immutable files on disk)
              └────────┬─────────┘
                       │ compaction (merge sorted files)
                       ▼
              ┌───────────────────────────┐
    Level 1:  │ SSTable │ SSTable │ SSTable │  (10x larger than Level 0)
              └────────┬──────────────────┘
                       │ compaction
                       ▼
              ┌──────────────────────────────────────┐
    Level 2:  │ SSTable │ SSTable │ SSTable │ SSTable │  (10x larger)
              └──────────────────────────────────────┘

Read path (slower than B-Tree):

Check MemTable
Check each SSTable level (newest first), using Bloom filters to skip SSTables that definitely do not contain the key
Merge results

Compaction merges SSTables to:

Remove deleted keys (tombstones)
Remove older versions of updated keys
Reduce the number of files to check on reads

Write amplification: a single write may be rewritten multiple times as it moves through compaction levels. If there are L levels and each is 10x larger, write amplification is ~10 * L. This is the price you pay for write speed.

The Bloom filter integration is elegant: before reading any SSTable, Cassandra checks a Bloom filter that tells it "this key is definitely not here" with zero disk I/O. You'll see Bloom filters again in Section 5.1.

3.5 B-Tree vs LSM-Tree

Property	B-Tree	LSM-Tree
Write throughput	Moderate (random I/O)	High (sequential I/O)
Read latency	Low (1-3 disk reads)	Higher (check multiple levels)
Range scans	Fast (B+ tree leaf list)	Fast after compaction
Space amplification	Low (~page fill factor ~70%)	Can be high (multiple copies during compaction)
Write amplification	Moderate (page rewrites)	High (compaction rewrites)
Predictable latency	Yes	No (compaction spikes)
Used by	Postgres, MySQL (InnoDB), SQL Server	Cassandra, RocksDB, LevelDB, HBase, ScyllaDB

Rule of thumb: B-Tree for read-heavy, mixed workloads (most web apps). LSM-Tree for write-heavy workloads (logging, metrics, time-series, IoT ingestion). When your write rate is so high that Postgres is struggling, the answer is often "switch to a database built on RocksDB." Now you know why.

Ch 24 covers both structures in depth — including page-level B-tree operations, compaction strategies (leveled vs tiered vs FIFO), and how to tune them for your workload.

3.6 Tries (Prefix Trees)

A trie stores strings character-by-character along tree edges. Common prefix = shared path. It is the reason autocomplete feels instant — the work is done at index time, not query time.

Storing: "cat", "car", "card", "do", "dog"

        (root)
        /    \
       c      d
       |      |
       a      o
      / \      \
     t   r      g
         |
         d

Lookup "car":  root → c → a → r  ✓ (3 steps, regardless of how many words are stored)
Lookup "cab":  root → c → a → ?  ✗ (no 'b' child)

Where you encounter tries:

Autocomplete / typeahead: traverse the trie to the prefix, then enumerate all descendants
IP routing tables: longest-prefix match on IP addresses (trie variant called a Patricia/Radix tree) — your router is running trie lookups on every packet
Spell checking: traverse the trie to find similar words within edit distance
HTTP routers: many web frameworks use radix trees for route matching (Go's httprouter) — this is how your GET /users/:id routes are matched

3.7 Skip Lists

A skip list is a probabilistic alternative to balanced BSTs. It is a layered linked list where higher layers skip over elements, enabling O(log n) search — and it is dramatically simpler to implement correctly than a red-black tree.

Level 3: HEAD ─────────────────────────────────── 50 ──────────── NIL
Level 2: HEAD ────────── 20 ───────────────────── 50 ──── 70 ─── NIL
Level 1: HEAD ──── 10 ── 20 ──── 30 ──── 40 ──── 50 ──── 70 ─── NIL
Level 0: HEAD ─ 5─ 10 ── 20 ─ 25─ 30 ─ 35─ 40 ── 50 ─60─ 70 ─80 NIL

To find 35:
1. Start at Level 3 HEAD → 50 (too far) → stay at HEAD
2. Drop to Level 2 HEAD → 20 (ok) → 50 (too far) → stay at 20
3. Drop to Level 1: 20 → 30 (ok) → 40 (too far) → stay at 30
4. Drop to Level 0: 30 → 35 (found!)

Each element is promoted to the next level with probability p (typically 0.5 or 0.25). This gives O(log n) expected height and O(log n) expected search time.

Why skip lists matter in practice:

Redis sorted sets (ZSET): Uses a skip list for the ordered index. Chosen over red-black trees because skip lists are simpler to implement, easier to reason about concurrently, and range queries are natural (just walk the bottom level). Next time you call ZRANGEBYSCORE, you're walking a skip list.
LevelDB / RocksDB MemTable: The in-memory sorted structure before flushing to SSTables. Skip lists allow concurrent reads and writes without global locks.
MemSQL / SingleStore: Uses lock-free skip lists for in-memory indexes.

Skip lists are easier to make concurrent than balanced BSTs because they avoid complex tree rotations — inserts only modify local pointers. When you see a database advertise "lock-free reads," there is often a skip list underneath.

4. Graph Algorithms in Infrastructure

Graphs are everywhere in backend systems, even when you do not think of them as graphs. Once you start seeing the world as graphs, you cannot stop:

Service dependency graph: Service A calls Service B, which calls Service C
Database entity-relationship diagram: Users have Orders, Orders have Items
Network topology: Routers and switches connected by links
Permission model: User is member of Group, Group has Role, Role grants Permission
Task dependencies: Migration B depends on Migration A

That last one is something most engineers have actually debugged — when a migration fails because it ran in the wrong order. That's a topological sort problem.

4.1 Graph Representations

Two ways to store a graph:

Adjacency List (most common in practice):

# Each node stores a list of its neighbors
graph = {
    "service-a": ["service-b", "service-c"],
    "service-b": ["service-d", "database"],
    "service-c": ["service-d", "cache"],
    "service-d": ["database"],
    "database":  [],
    "cache":     []
}

# Space: O(V + E)  — vertices + edges
# Check if edge exists: O(degree of vertex) — scan neighbor list
# Iterate neighbors: O(degree of vertex)

Adjacency Matrix:

              a  b  c  d  db  cache
service-a  [  0  1  1  0   0    0 ]
service-b  [  0  0  0  1   1    0 ]
service-c  [  0  0  0  1   0    1 ]
service-d  [  0  0  0  0   1    0 ]
database   [  0  0  0  0   0    0 ]
cache      [  0  0  0  0   0    0 ]

# Space: O(V²)
# Check if edge exists: O(1) — direct array lookup
# Iterate neighbors: O(V) — scan entire row

When to use which:

Adjacency list: sparse graphs (most real systems), memory-efficient, better for traversal
Adjacency matrix: dense graphs, or when you need O(1) edge existence checks (rare in backend work)

In practice, you almost always use adjacency lists. Most real graphs — service dependencies, social graphs, permission hierarchies — are sparse. An adjacency matrix for 1000 services would be 1000×1000 = 1M cells, most of them zero.

4.2 BFS (Breadth-First Search)

BFS explores all nodes at distance 1, then distance 2, then distance 3, and so on. It finds the shortest path in unweighted graphs. Think of it as ripples spreading out from a stone dropped in water.

from collections import deque

def bfs_shortest_path(graph, start, target):
    """Find shortest path (fewest hops) between two nodes."""
    queue = deque([(start, [start])])
    visited = {start}

    while queue:
        node, path = queue.popleft()

        if node == target:
            return path

        for neighbor in graph[node]:
            if neighbor not in visited:
                visited.add(neighbor)
                queue.append((neighbor, path + [neighbor]))

    return None  # no path exists

Where BFS is used:

Social networks: "degrees of separation" between users (LinkedIn's connection degree, "You and Alice share 3 mutual connections")
Service dependency blast radius: "if service X goes down, what is affected within 2 hops?" — BFS from X to enumerate the impact
Network routing: finding the fewest hops between two hosts
Garbage collection: mark-and-sweep (BFS from root objects to find all reachable objects — everything not found is garbage)

4.3 DFS (Depth-First Search)

DFS explores as deep as possible along each branch before backtracking. It is simpler to implement (naturally recursive) and uses less memory than BFS for deep graphs. Where BFS thinks in rings, DFS thinks in paths.

def dfs(graph, node, visited=None):
    """Traverse all reachable nodes depth-first."""
    if visited is None:
        visited = set()
    visited.add(node)
    for neighbor in graph[node]:
        if neighbor not in visited:
            dfs(graph, neighbor, visited)
    return visited

def has_cycle(graph):
    """Detect cycles in a directed graph using DFS coloring."""
    WHITE, GRAY, BLACK = 0, 1, 2
    color = {node: WHITE for node in graph}

    def dfs_visit(node):
        color[node] = GRAY           # currently being explored
        for neighbor in graph[node]:
            if color[neighbor] == GRAY:
                return True           # back edge = cycle!
            if color[neighbor] == WHITE:
                if dfs_visit(neighbor):
                    return True
        color[node] = BLACK           # fully explored
        return False

    return any(
        dfs_visit(node)
        for node in graph
        if color[node] == WHITE
    )

The three-color trick is elegant: WHITE means unvisited, GRAY means currently in the DFS stack, BLACK means done. If you ever reach a GRAY node, you have found a back edge — a cycle.

Where DFS is used:

Deadlock detection: model lock waits as a directed graph; a cycle means deadlock
Circular dependency detection: in module imports, service dependencies, database foreign keys
Topological sorting (see below)
Connected components: finding isolated clusters in a system

4.4 Topological Sort

A topological sort of a directed acyclic graph (DAG) produces a linear ordering where for every edge A → B, A appears before B. Only possible if the graph has no cycles.

You run topological sort every time you run npm install — the package manager is solving the dependency graph to determine build order.

def topological_sort(graph):
    """Kahn's algorithm: BFS-based topological sort."""
    # Count incoming edges for each node
    in_degree = {node: 0 for node in graph}
    for node in graph:
        for neighbor in graph[node]:
            in_degree[neighbor] += 1

    # Start with nodes that have no dependencies
    queue = deque([node for node in graph if in_degree[node] == 0])
    result = []

    while queue:
        node = queue.popleft()
        result.append(node)
        for neighbor in graph[node]:
            in_degree[neighbor] -= 1
            if in_degree[neighbor] == 0:
                queue.append(neighbor)

    if len(result) != len(graph):
        raise ValueError("Graph has a cycle — topological sort impossible")

    return result

Where topological sort is used:

Database migrations: Migration 003 depends on 002 depends on 001. Run them in topological order.
Build systems: make, Bazel, and Gradle use topological sort to determine build order
Package managers: npm, pip, cargo resolve dependency trees with topological sort
Task schedulers: Airflow DAGs, CI/CD pipelines with step dependencies
Spreadsheet recalculation: cells depend on other cells; recalculate in topological order

If len(result) != len(graph), you have a cycle. This is how npm detects circular dependencies.

4.5 Dijkstra's Algorithm

BFS finds shortest paths when all edges have equal weight. Dijkstra handles weighted edges (as long as weights are non-negative). Replace "hops" with "latency in milliseconds" and suddenly you're routing packets.

import heapq

def dijkstra(graph, start):
    """
    graph[node] = [(neighbor, weight), ...]
    Returns shortest distance from start to every reachable node.
    """
    distances = {start: 0}
    priority_queue = [(0, start)]      # (distance, node)

    while priority_queue:
        dist, node = heapq.heappop(priority_queue)

        if dist > distances.get(node, float('inf')):
            continue  # we already found a shorter path

        for neighbor, weight in graph[node]:
            new_dist = dist + weight
            if new_dist < distances.get(neighbor, float('inf')):
                distances[neighbor] = new_dist
                heapq.heappush(priority_queue, (new_dist, neighbor))

    return distances

Where Dijkstra is used:

Network routing: OSPF (Open Shortest Path First) protocol uses Dijkstra to compute shortest paths between routers — this runs inside every enterprise network
Cost optimization: finding the cheapest cloud region-to-region data transfer path
Load balancing: routing requests through the lowest-latency path
Game servers: geographic routing to minimize player latency

4.6 Cycle Detection

Beyond the DFS coloring approach shown above, cycle detection is critical in:

Deadlock detection in databases: Postgres maintains a wait-for graph. When a transaction waits for a lock held by another, an edge is added. A cycle means deadlock — Postgres aborts one of the transactions. The next time you see ERROR: deadlock detected, Postgres ran a cycle detection algorithm.
Circular dependency prevention: module systems, microservice dependency validation
Reference counting garbage collection: cycles of objects pointing to each other will never reach refcount 0. This is why Python uses a separate cycle-detecting GC alongside reference counting.

5. Probabilistic Data Structures

Sometimes you do not need an exact answer. Probabilistic data structures trade a small, bounded error rate for massive savings in memory and computation. They are some of the most elegant engineering tools in existence — the kind of thing that makes you say "that's clever" when you first understand how they work.

The trade-off is always the same: you sacrifice exactness to gain scale. A 0.81% error rate is worth it when it's the difference between 12KB and 12GB.

5.1 Bloom Filters

A Bloom filter is like a bouncer with a bad memory — they'll never let the wrong person through, but they might wave someone in who shouldn't be there. More precisely:

"No" — definitely not in the set (100% certain)
"Yes" — probably in the set (might be a false positive)

No false negatives, but possible false positives. This asymmetry is what makes Bloom filters useful: you can always trust a "no."

How it works:

Initialize: m-bit array, all zeros
            k hash functions (h1, h2, ..., hk)

Insert("hello"):
    h1("hello") % m = 3   → set bit 3
    h2("hello") % m = 7   → set bit 7
    h3("hello") % m = 11  → set bit 11

    Bit array: [0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0]

Check("hello"):
    Check bits 3, 7, 11 → all set → "probably yes" ✓

Check("world"):
    h1("world") % m = 3   → set ✓
    h2("world") % m = 5   → NOT set ✗
    → "definitely no" (if any bit is 0, element was never inserted)

Check("xyz"):
    h1("xyz") % m = 3     → set ✓
    h2("xyz") % m = 7     → set ✓
    h3("xyz") % m = 11    → set ✓
    → "probably yes" — but this is a FALSE POSITIVE
       (all bits happen to be set by other elements)

Tuning: For a target false positive rate p with n elements:

Optimal number of bits: m = -n * ln(p) / (ln 2)^2
Optimal number of hash functions: k = (m/n) * ln 2
For p = 1%, n = 1M: m = 9.6M bits (1.2MB), k = 7

Compare that to storing 1M 128-bit hashes exactly: 16MB. The Bloom filter gives you 99% accuracy in 7.5% of the space.

import mmh3  # MurmurHash3
from bitarray import bitarray

class BloomFilter:
    def __init__(self, size, num_hashes):
        self.size = size
        self.num_hashes = num_hashes
        self.bit_array = bitarray(size)
        self.bit_array.setall(0)

    def add(self, item):
        for i in range(self.num_hashes):
            idx = mmh3.hash(item, i) % self.size
            self.bit_array[idx] = 1

    def might_contain(self, item):
        return all(
            self.bit_array[mmh3.hash(item, i) % self.size]
            for i in range(self.num_hashes)
        )

Where Bloom filters are used:

System	Use Case
Cassandra	Checks Bloom filter before reading an SSTable — avoids disk I/O for SSTables that definitely do not contain the key
Google Chrome	Checks URLs against a Bloom filter of known malicious URLs (a few MB instead of a multi-GB database)
Akamai CDN	Avoids caching one-hit-wonder URLs — only cache if the URL has been seen before (Bloom filter check)
Medium	Avoids recommending articles a user has already read
Bitcoin	SPV (Simplified Payment Verification) nodes use Bloom filters to request relevant transactions without downloading the full blockchain

The Cassandra case connects directly to LSM trees (Section 3.4): every SSTable has a Bloom filter. Before doing any disk I/O to search an SSTable, Cassandra asks the filter "is this key definitely not here?" If yes, skip the file entirely. This is how Cassandra makes reads tolerable despite having to check multiple levels.

5.2 Count-Min Sketch

Estimates the frequency of elements in a stream using bounded memory. Where a hash map would grow unboundedly as you track more unique keys, a Count-Min Sketch uses a fixed-size grid of counters.

Structure: d hash functions, each mapping to a row of w counters

Insert("cat"):
    h1("cat") % w = 2  → row1[2] += 1
    h2("cat") % w = 5  → row2[5] += 1
    h3("cat") % w = 1  → row3[1] += 1

Query frequency("cat"):
    return min(row1[2], row2[5], row3[1])
    (take the minimum to reduce over-counting from collisions)

         0   1   2   3   4   5   6
row 1: [ 0   0   3   1   0   0   0 ]  ← h1("cat")=2
row 2: [ 1   0   0   0   0   2   0 ]  ← h2("cat")=5
row 3: [ 0   4   0   0   1   0   0 ]  ← h3("cat")=1

min(3, 2, 4) = 2  (true count might be 2; never underestimates)

The minimum is the key insight — collisions can only inflate counts, never deflate them. By taking the minimum across all rows, you get the tightest upper bound.

Where Count-Min Sketch is used:

Heavy hitters detection: finding the most frequent API endpoints, most active users, hottest cache keys — "which 1% of users are generating 50% of our traffic?"
Network monitoring: detecting DDoS by finding source IPs with abnormally high request counts
Database query optimization: approximate frequency statistics for query planning

5.3 HyperLogLog

Estimates the number of distinct elements (cardinality) in a dataset. This is the algorithm powering SELECT COUNT(DISTINCT user_id) in systems that care about performance.

The core insight: if you hash elements uniformly and count the maximum number of leading zeros in any hash, that correlates with the log2 of the number of distinct elements.

Intuition:
- Hash every element to a uniform random binary string
- If you see a hash starting with 0...   → probably >= 2 distinct elements
- If you see a hash starting with 00...  → probably >= 4 distinct elements
- If you see a hash starting with 000... → probably >= 8 distinct elements
- If you see a hash starting with k zeros → probably >= 2^k distinct elements

Problem: High variance from a single estimate
Solution: Split into m buckets (registers), estimate per bucket, take harmonic mean

HyperLogLog with m=16384 (2^14) registers:
- Each register: 6 bits (max leading zeros in 64-bit hash)
- Total memory: 16384 * 6 bits = 12KB
- Error rate: 1.04 / sqrt(m) ≈ 0.81%

12KB. For any cardinality, any number of elements. The memory footprint doesn't change whether you've seen 1,000 or 1,000,000,000 distinct values.

Where HyperLogLog is used:

System	Use Case
Redis `PFADD` / `PFCOUNT`	Count unique visitors, unique IPs, unique events — 12KB per counter regardless of cardinality
Google BigQuery	`APPROX_COUNT_DISTINCT()` uses HyperLogLog++
Presto / Trino	`approx_distinct()` for fast analytics
Elasticsearch	Cardinality aggregation

# Redis example: count unique daily visitors
PFADD visitors:2026-03-24 "user:123"
PFADD visitors:2026-03-24 "user:456"
PFADD visitors:2026-03-24 "user:123"    # duplicate, no effect
PFCOUNT visitors:2026-03-24              # returns ~2

# Merge multiple days
PFMERGE visitors:week visitors:2026-03-18 visitors:2026-03-19 ... visitors:2026-03-24
PFCOUNT visitors:week                    # unique visitors across the entire week

The merge operation is the killer feature: you can compute weekly/monthly uniques from daily HyperLogLogs without storing all the user IDs.

5.4 Cuckoo Filters

Like Bloom filters but with two advantages:

Support deletion (Bloom filters do not — you cannot unset a bit shared by multiple elements)
Better space efficiency at low false positive rates (below ~3%)

Named after the cuckoo bird that displaces other birds' eggs from nests:

Two hash functions, two possible buckets per element.

Insert("hello"):
    bucket_a = h1("hello") = 4
    bucket_b = h2("hello") = 9

    If bucket 4 has space → store fingerprint of "hello" in bucket 4
    If bucket 9 has space → store fingerprint of "hello" in bucket 9
    If both full → kick out the existing element in bucket 4,
                   relocate it to its alternate bucket,
                   store "hello" in bucket 4
                   (may cascade, like cuckoo hashing)

Use when: you need Bloom filter functionality but also need to remove elements (e.g., a distributed blocklist that entries can be removed from). Bloom filters are static; cuckoo filters are dynamic.

6. Practical Implementations

These are the "system design building blocks" that come up repeatedly. Each one is a small system unto itself, combining multiple data structures into something that solves a real problem. When you see these in system design interviews, the interviewer wants to know that you understand the underlying mechanics — not just that the pattern exists.

6.1 LRU Cache

An LRU (Least Recently Used) cache evicts the least recently accessed entry when it reaches capacity. It requires O(1) for both get and put — and that constraint is what makes the implementation non-obvious.

The trick: combine a hash map (O(1) key lookup) with a doubly linked list (O(1) move-to-front and remove-from-tail). Neither structure alone is sufficient. Together, they're O(1) for everything.

Hash Map:  key → pointer to linked list node
Doubly Linked List:  most recent ←→ ... ←→ least recent

get(key):
    1. Look up node in hash map          O(1)
    2. Move node to front of linked list  O(1) — just relink pointers
    3. Return value

put(key, value):
    1. If key exists: update value, move to front
    2. If key doesn't exist:
       a. If at capacity: remove tail node, delete from hash map
       b. Create new node, add to front, add to hash map

class Node:
    def __init__(self, key, value):
        self.key = key
        self.value = value
        self.prev = None
        self.next = None

class LRUCache:
    def __init__(self, capacity):
        self.capacity = capacity
        self.cache = {}            # key → Node
        # Sentinel nodes simplify edge cases
        self.head = Node(0, 0)     # dummy head (most recent side)
        self.tail = Node(0, 0)     # dummy tail (least recent side)
        self.head.next = self.tail
        self.tail.prev = self.head

    def get(self, key):
        if key not in self.cache:
            return -1
        node = self.cache[key]
        self._remove(node)
        self._add_to_front(node)
        return node.value

    def put(self, key, value):
        if key in self.cache:
            self._remove(self.cache[key])
        node = Node(key, value)
        self._add_to_front(node)
        self.cache[key] = node
        if len(self.cache) > self.capacity:
            lru = self.tail.prev
            self._remove(lru)
            del self.cache[lru.key]

    def _add_to_front(self, node):
        node.prev = self.head
        node.next = self.head.next
        self.head.next.prev = node
        self.head.next = node

    def _remove(self, node):
        node.prev.next = node.next
        node.next.prev = node.prev

Where LRU caches are used:

Database buffer pool: Postgres/MySQL keep frequently accessed disk pages in memory using an LRU-like policy (Postgres uses a clock-sweep algorithm, a cheaper approximation — the full LRU tracking overhead isn't worth it at that scale)
Web application caching: in-process caches (Guava Cache, Caffeine in Java; lru-cache in Node.js)
CPU caches: hardware LRU approximations for L1/L2/L3 cache eviction
Operating system page cache: which disk pages to keep in RAM

6.2 Rate Limiter

Token Bucket

The token bucket is the most common rate limiting algorithm. Imagine a bucket that fills with tokens at a steady rate. Each request consumes a token. If the bucket is empty, the request is denied (HTTP 429 Too Many Requests).

Bucket:
  - capacity: 10 tokens (max burst)
  - refill_rate: 2 tokens/second
  - tokens: current number of tokens
  - last_refill: timestamp of last refill

allow_request():
    now = current_time()
    elapsed = now - last_refill

    # Add tokens based on elapsed time
    tokens = min(capacity, tokens + elapsed * refill_rate)
    last_refill = now

    if tokens >= 1:
        tokens -= 1
        return True     # request allowed
    else:
        return False    # rate limited (429 Too Many Requests)

import time

class TokenBucket:
    def __init__(self, capacity, refill_rate):
        self.capacity = capacity
        self.refill_rate = refill_rate  # tokens per second
        self.tokens = capacity
        self.last_refill = time.monotonic()

    def allow(self):
        now = time.monotonic()
        elapsed = now - self.last_refill
        self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
        self.last_refill = now

        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

Properties: allows bursts (up to capacity requests at once), then rate-limits to refill_rate sustained throughput. Most APIs use this because it naturally handles bursty traffic. A mobile app that wakes from sleep and fires 5 requests immediately won't be rejected as long as tokens have accumulated.

Sliding Window Counter

Provides smoother rate limiting than fixed windows, without the memory overhead of tracking every individual request timestamp.

Fixed window problem:
    Window 1 (0:00-1:00): 100 requests (limit: 100/min)
    Window 2 (1:00-2:00): 100 requests
    But: 100 requests at 0:59 + 100 requests at 1:01 = 200 requests in 2 seconds!

Sliding window counter solution:
    Current window count + (previous window count * overlap percentage)

    At time 1:15 (15 seconds into window 2):
    Previous window (0:00-1:00): 84 requests
    Current window  (1:00-2:00): 36 requests so far

    Overlap of previous window: (60 - 15) / 60 = 75%
    Estimated count: 36 + (84 * 0.75) = 36 + 63 = 99

    If limit is 100 → allow (99 < 100)

class SlidingWindowCounter:
    def __init__(self, limit, window_seconds):
        self.limit = limit
        self.window = window_seconds
        self.prev_count = 0
        self.curr_count = 0
        self.curr_window_start = 0

    def allow(self):
        now = time.monotonic()
        window_start = now - (now % self.window)

        # Roll over to new window if needed
        if window_start != self.curr_window_start:
            self.prev_count = self.curr_count
            self.curr_count = 0
            self.curr_window_start = window_start

        # Calculate weighted count
        elapsed_in_window = now - window_start
        weight = (self.window - elapsed_in_window) / self.window
        estimated = self.curr_count + (self.prev_count * weight)

        if estimated < self.limit:
            self.curr_count += 1
            return True
        return False

Distributed Rate Limiting with Redis

For multi-server deployments, rate limit state must be shared. Redis is the standard solution — every application server talks to the same Redis, so rate limit state is consistent across the fleet:

# Token bucket in Redis using a Lua script for atomicity
RATE_LIMIT_SCRIPT = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])   -- tokens per second
local now = tonumber(ARGV[3])

local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now

-- Refill tokens
local elapsed = now - last_refill
tokens = math.min(capacity, tokens + elapsed * refill_rate)

-- Try to consume a token
if tokens >= 1 then
    tokens = tokens - 1
    redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
    redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) * 2)
    return 1  -- allowed
else
    redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
    redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) * 2)
    return 0  -- denied
end
"""

The Lua script runs atomically in Redis — no race conditions between multiple application servers checking and decrementing the counter. This is critical: without atomicity, two servers could both read "1 token remaining," both allow a request, and you'd end up with -1 tokens.

6.3 Consistent Hashing Ring

(Detailed implementation covered in Section 2.5 above. Here we focus on the operational aspects.)

Adding a node:

Before: Nodes A, B, C on the ring
        A handles keys in range (C, A]
        B handles keys in range (A, B]
        C handles keys in range (B, C]

Add node D between A and B:
        A handles keys in range (C, A]     ← unchanged
        D handles keys in range (A, D]     ← takes from B
        B handles keys in range (D, B]     ← smaller range now
        C handles keys in range (B, C]     ← unchanged

Only keys in range (A, D] need to migrate from B to D.
With virtual nodes, this is ~1/N of all keys (evenly distributed).

Removing a node:

Remove node D:
        All keys in (A, D] transfer to B (the next node clockwise)
        Again, only ~1/N of keys move.

Used in practice by:

Amazon DynamoDB: partition assignment across storage nodes
Apache Cassandra: token ring for data distribution
Nginx: upstream consistent hashing for sticky sessions
HAProxy: balance uri consistent for cache-friendly load balancing

6.4 URL Shortener

Two approaches to generating short URLs. This is a classic system design question, but the interesting part is the trade-off between the two approaches.

Approach 1: Base62 Encoding from Auto-Increment ID

ALPHABET = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"

def encode_base62(num):
    """Convert integer to base62 string."""
    if num == 0:
        return ALPHABET[0]
    result = []
    while num > 0:
        result.append(ALPHABET[num % 62])
        num //= 62
    return ''.join(reversed(result))

def decode_base62(s):
    """Convert base62 string back to integer."""
    num = 0
    for char in s:
        num = num * 62 + ALPHABET.index(char)
    return num

# Database auto-increment ID → short code
# ID 1000000 → encode_base62(1000000) → "4c92"
# 6-character base62 → 62^6 = 56.8 billion possible URLs

Write path: Insert URL into database, get auto-increment ID, encode as base62, return short URL.

Read path: Decode base62 to ID, look up in database by primary key (O(1) with index), redirect.

Pros: Simple, guaranteed unique, predictable length. Cons: Sequential IDs are predictable (user can enumerate URLs), requires centralized ID generation.

Approach 2: Hash-Based with Collision Handling

import hashlib

def shorten(long_url):
    """Generate short code from URL hash."""
    hash_hex = hashlib.sha256(long_url.encode()).hexdigest()
    short_code = encode_base62(int(hash_hex[:12], 16))[:7]  # take first 7 chars

    # Check for collision
    existing = db.get(short_code)
    if existing and existing.long_url != long_url:
        # Collision! Append a counter and retry
        for i in range(1, 100):
            candidate = encode_base62(int(hash_hex[:12], 16) + i)[:7]
            if not db.get(candidate):
                short_code = candidate
                break

    db.put(short_code, long_url)
    return short_code

Pros: No centralized counter, same URL always generates same short code (idempotent). Cons: Collision handling adds complexity, requires a read-before-write.

6.5 Distributed ID Generation

In distributed systems, you cannot rely on a single database auto-increment. Multiple machines need to generate unique IDs independently, often with ordering guarantees. The naive solution — a centralized ID generator — is a single point of failure and a bottleneck. Here are three approaches that scale.

Snowflake IDs (Twitter's approach)

64-bit ID:

┌─ 1 bit ─┬─── 41 bits ────┬── 10 bits ──┬── 12 bits ──┐
│  unused  │  timestamp(ms) │  machine ID │  sequence   │
└──────────┴────────────────┴─────────────┴─────────────┘

- Timestamp: milliseconds since custom epoch
  → 2^41 ms ≈ 69 years before overflow
- Machine ID: 1024 unique workers (datacenter ID + worker ID)
- Sequence: 4096 IDs per millisecond per machine

Total capacity: 4096 * 1000 * 1024 ≈ 4 billion IDs per second cluster-wide

import time

class SnowflakeGenerator:
    EPOCH = 1609459200000  # 2021-01-01 00:00:00 UTC in ms

    def __init__(self, machine_id):
        assert 0 <= machine_id < 1024
        self.machine_id = machine_id
        self.sequence = 0
        self.last_timestamp = -1

    def next_id(self):
        timestamp = int(time.time() * 1000) - self.EPOCH

        if timestamp == self.last_timestamp:
            self.sequence = (self.sequence + 1) & 0xFFF  # 12-bit wrap
            if self.sequence == 0:
                # Exhausted 4096 IDs this millisecond — wait for next ms
                while timestamp == self.last_timestamp:
                    timestamp = int(time.time() * 1000) - self.EPOCH
        else:
            self.sequence = 0

        self.last_timestamp = timestamp

        return (
            (timestamp << 22) |
            (self.machine_id << 12) |
            self.sequence
        )

Properties: Roughly time-sorted (not perfectly — clock skew between machines), 64-bit (fits in a bigint), no coordination needed between machines.

Used by: Twitter (original Snowflake), Discord (modified Snowflake), Instagram (similar scheme).

UUIDv7 (Timestamp-Ordered UUIDs)

128-bit UUID (standard UUID format, but timestamp-prefixed):

xxxxxxxx-xxxx-7xxx-yxxx-xxxxxxxxxxxx

First 48 bits: Unix timestamp in milliseconds
Next 4 bits:   version (7)
Next 12 bits:  random
Next 2 bits:   variant (RFC 4122)
Next 62 bits:  random

Example: 018e0a3c-5b00-7123-a456-789012345678
         ^^^^^^^^^^^^
         timestamp portion → IDs are naturally sorted by creation time

Advantages over UUIDv4:

Time-sortable: B-tree indexes stay sequential (no random page splits)
Standard format: works with any system that accepts UUIDs
No coordination: pure random component ensures uniqueness without a central authority

UUIDv4's problem with databases: Random UUIDs cause random B-tree inserts, leading to poor cache utilization, excessive page splits, and index fragmentation. UUIDv7 fixes this by being monotonically increasing. If you've ever seen a Postgres table with UUID primary keys perform worse than expected — this is why.

ULID (Universally Unique Lexicographically Sortable Identifier)

128-bit, encoded as 26-character Crockford's Base32:

01ARZ3NDEKTSV4RRFFQ69G5FAV

┌──── 48 bits ────┬──── 80 bits ────┐
│  timestamp(ms)  │    randomness   │
└─────────────────┴─────────────────┘

- Timestamp: milliseconds since Unix epoch
- Randomness: 80 bits of cryptographic randomness
- Lexicographically sortable (string comparison = time comparison)
- Case-insensitive, no special characters (URL-safe)

Comparison

Property	Snowflake	UUIDv7	ULID	UUIDv4
Size	64 bits	128 bits	128 bits	128 bits
Sortable by time	Yes	Yes	Yes	No
Coordination needed	Machine ID assignment	No	No	No
DB index friendly	Yes	Yes	Yes	No (random)
Standard format	Custom	UUID	Custom (26 chars)	UUID
IDs per ms per node	4096	Unlimited (random)	Unlimited (random)	N/A
Uniqueness guarantee	Machine ID + sequence	Probabilistic	Probabilistic	Probabilistic

Guidance: Use Snowflake (or a variant) when you need compact 64-bit IDs and can manage machine ID assignment. Use UUIDv7 when you need standard UUID compatibility. Use ULID when you want string-sortable IDs without UUID format constraints. Avoid UUIDv4 as a primary key in B-tree indexed databases — every insert is a random B-tree page access.

7. Sorting & Searching in the Real World

You will almost never implement a sorting algorithm from scratch. But you must understand them because they determine how your database executes queries, how your files are organized, and where your performance bottlenecks are. When Postgres says Sort Method: external merge Disk: 145MB, you need to know what that means and how to fix it.

7.1 The Sorting Algorithms That Matter

QuickSort

1. Pick a pivot element
2. Partition: move elements < pivot to left, > pivot to right
3. Recursively sort left and right partitions

Average: O(n log n)   Worst: O(n²) — when pivot is always min/max
Space:   O(log n) stack depth
Stable:  No (equal elements may change relative order)

Why it is the default: Excellent cache locality (operates on contiguous memory), low constant factors, in-place (no extra array allocation). Most standard library sort() functions use QuickSort or a variant.

The worst case: If you always pick the first element as pivot and the array is already sorted, you get O(n^2). Modern implementations use median-of-three or random pivots to avoid this. IntroSort (used by C++ std::sort) switches to HeapSort if recursion depth exceeds 2*log(n), guaranteeing O(n log n) worst case.

MergeSort

1. Split array in half
2. Recursively sort each half
3. Merge the two sorted halves

Always:  O(n log n)   (no worst case degradation)
Space:   O(n) — needs a temporary array for merging
Stable:  Yes (equal elements maintain relative order)

Why it matters:

Stable sorting: when you ORDER BY created_at, name, stability ensures that rows with the same created_at maintain their name ordering from the second sort pass
External sorting: when data does not fit in memory, MergeSort is the only practical algorithm (see below)
Linked lists: MergeSort is optimal for linked lists (no random access needed, O(1) extra space)

TimSort

1. Divide array into "runs" (already-sorted subsequences)
2. Extend short runs using insertion sort (to minimum run length, usually 32-64)
3. Merge runs using a modified merge sort with galloping mode

Always:  O(n log n) worst case
Best:    O(n) when data is already sorted or nearly sorted
Space:   O(n)
Stable:  Yes

Why it is used by Python and Java: Real-world data is often partially sorted (log entries, time-series data, nearly-sorted lists after a small update). TimSort exploits existing order and achieves near-O(n) performance on such inputs. If you're sorting events by timestamp and they're mostly in order already, TimSort will be dramatically faster than pure QuickSort.

7.2 Binary Search

Binary search is the most useful algorithm you will actually write (or debug) in production code. It applies anywhere you have sorted data and need to find a boundary. Once you see it as "finding the boundary," you'll recognize it everywhere.

Standard Binary Search

def binary_search(arr, target):
    lo, hi = 0, len(arr) - 1
    while lo <= hi:
        mid = lo + (hi - lo) // 2      # avoid integer overflow
        if arr[mid] == target:
            return mid
        elif arr[mid] < target:
            lo = mid + 1
        else:
            hi = mid - 1
    return -1  # not found

Finding Boundaries (Lower Bound / Upper Bound)

More useful than exact search in practice. "Find the first element >= X" or "find the last element <= X." This is what database indexes actually do.

def lower_bound(arr, target):
    """Find the index of the first element >= target."""
    lo, hi = 0, len(arr)
    while lo < hi:
        mid = lo + (hi - lo) // 2
        if arr[mid] < target:
            lo = mid + 1
        else:
            hi = mid
    return lo

def upper_bound(arr, target):
    """Find the index of the first element > target."""
    lo, hi = 0, len(arr)
    while lo < hi:
        mid = lo + (hi - lo) // 2
        if arr[mid] <= target:
            lo = mid + 1
        else:
            hi = mid
    return lo

Where boundary search is used:

Database range queries: B-tree index finds the lower bound of the range, then scans forward through the leaf linked list
Time-series data: "find the first event after timestamp T"
Rate limiting sliding windows: "find all requests in the last 60 seconds"
Version ranges: "find the latest version <= 2.5"

Binary Search on Answer Space

This is the most powerful and under-appreciated application. Instead of searching in an array, you binary search on the possible answer range. The key insight: if you can ask "is X too small?" as a yes/no question, you can binary search for the minimum X that works.

def min_capacity_to_ship(weights, days):
    """
    Given packages with given weights and a deadline of D days,
    find the minimum ship capacity to deliver all packages on time.
    (This is a capacity planning problem.)
    """
    def can_ship_in_time(capacity):
        current_load = 0
        days_needed = 1
        for w in weights:
            if current_load + w > capacity:
                days_needed += 1
                current_load = 0
            current_load += w
        return days_needed <= days

    lo = max(weights)          # must carry at least the heaviest package
    hi = sum(weights)          # worst case: ship everything in one day
    while lo < hi:
        mid = lo + (hi - lo) // 2
        if can_ship_in_time(mid):
            hi = mid           # try smaller capacity
        else:
            lo = mid + 1       # need more capacity
    return lo

Real applications of binary search on answer space:

Capacity planning: "What is the minimum number of servers to handle this traffic?" (binary search on server count, simulate load for each)
SLA optimization: "What is the lowest latency budget per service that still meets our end-to-end SLA?"
Batch size tuning: "What is the largest batch size that still processes within the timeout?"
Resource allocation: "What is the minimum memory allocation that avoids OOM for this workload?"

7.3 External Sorting

When your dataset does not fit in memory (sorting a 100GB file on a machine with 8GB RAM), you use external merge sort. This is what your database does when you don't give it enough memory for a sort operation:

Phase 1: Create sorted runs
    1. Read 8GB chunk from disk into memory
    2. Sort in memory (quicksort)
    3. Write sorted chunk ("run") to disk
    4. Repeat until all data is processed
    → Result: ~13 sorted runs of 8GB each

Phase 2: K-way merge
    1. Open all 13 runs simultaneously
    2. Read a small buffer (e.g., 100MB) from each run
    3. Use a min-heap (priority queue) to find the smallest element
       across all runs
    4. Output the smallest, refill buffer when empty
    5. Continue until all runs are exhausted

    Heap size: 13 elements (one per run) — fits easily in memory
    I/O pattern: sequential reads from each run — disk-friendly

import heapq

def external_sort(input_file, output_file, memory_limit):
    # Phase 1: Create sorted runs
    runs = []
    chunk = []
    chunk_size = 0

    for record in read_records(input_file):
        chunk.append(record)
        chunk_size += record.size
        if chunk_size >= memory_limit:
            chunk.sort(key=lambda r: r.sort_key)
            run_file = write_run(chunk)
            runs.append(run_file)
            chunk = []
            chunk_size = 0

    if chunk:  # remaining records
        chunk.sort(key=lambda r: r.sort_key)
        runs.append(write_run(chunk))

    # Phase 2: K-way merge
    readers = [open_run_reader(run) for run in runs]
    heap = []
    for i, reader in enumerate(readers):
        record = next(reader, None)
        if record:
            heapq.heappush(heap, (record.sort_key, i, record))

    with open(output_file, 'w') as out:
        while heap:
            _, run_idx, record = heapq.heappop(heap)
            out.write(record)
            next_record = next(readers[run_idx], None)
            if next_record:
                heapq.heappush(heap, (next_record.sort_key, run_idx, next_record))

Where external sorting is used:

Database ORDER BY on large result sets: when the sort buffer (work_mem in Postgres, sort_buffer_size in MySQL) is too small, the database switches to external sort
MapReduce shuffle phase: sorting intermediate key-value pairs before the reduce step
Log aggregation: merging sorted log files from multiple servers into a single sorted stream
sort command: Unix sort uses external merge sort for large files

7.4 How Database Sorting Works

When you write SELECT * FROM orders ORDER BY created_at, the database has three strategies, and understanding which one it chooses tells you a lot about your query's performance:

1. Index-ordered scan (best case):

If there's a B+ tree index on created_at:
→ Just scan the leaf nodes in order (they're already sorted via the linked list)
→ No sorting needed!
→ Cost: O(n) sequential I/O

EXPLAIN: "Index Scan using idx_orders_created_at"

2. In-memory sort (small result set):

If the result fits in work_mem (default 4MB in Postgres):
→ Load all rows into memory
→ QuickSort
→ Return

EXPLAIN: "Sort  Sort Method: quicksort  Memory: 3412kB"

3. External merge sort (large result set):

If the result exceeds work_mem:
→ External merge sort using temp files on disk
→ Much slower due to disk I/O

EXPLAIN: "Sort  Sort Method: external merge  Disk: 145MB"

Fix: Increase work_mem (per-query) or add an index
SET work_mem = '256MB';

This is why "add an index" is the answer to most performance questions. An index transforms a sort operation from O(n log n) with potential disk I/O into a simple O(n) sequential scan. The sort already happened at insert time — you're just reading it.

8. Search Engineering

Search is one of the most common features in production applications, yet most engineers treat it as a black box you configure by copying Stack Overflow answers. Understanding how search engines work — from text analysis to ranking to scaling — is what separates engineers who can build search that actually works from engineers who cargo-cult Elasticsearch configs and wonder why results are bad.

8.1 How Search Engines Work

The search pipeline has six stages, and understanding each one is the key to debugging why search results are bad:

Document Ingestion → Analysis → Indexing → Query Processing → Ranking → Results

1. INGESTION:    Receive documents (API call, database sync, web crawl)
2. ANALYSIS:     Break text into tokens, normalize them (lowercase, stem, remove stop words)
3. INDEXING:     Build an inverted index: term → list of documents containing that term
4. QUERY:        Parse the user's query through the same analysis pipeline
5. RANKING:      Score each matching document by relevance (TF-IDF, BM25, vector similarity)
6. RESULTS:      Return top-K documents, optionally with highlights and facets

The inverted index is the core data structure of search. It maps every analyzed term to the documents (and positions) where that term appears. The name "inverted" means you've flipped the relationship: instead of "document → words it contains," you have "word → documents that contain it."

Inverted Index:

"database"   → [doc1:pos3, doc2:pos7, doc5:pos1, doc12:pos22]
"optimize"   → [doc1:pos4, doc3:pos1, doc5:pos8]
"postgresql" → [doc2:pos1, doc5:pos3, doc12:pos5]
"index"      → [doc1:pos9, doc2:pos12, doc3:pos6, doc5:pos2, doc7:pos1]

Query: "database optimization"
→ Analyze: ["database", "optimiz"] (stemmed)
→ Look up both terms in inverted index
→ Find documents that match both (intersection): doc1, doc5
→ Score and rank them

This is why search is fast: instead of scanning every document for your query (O(total_words)), you do a dictionary lookup per query term and intersect posting lists (O(matches)).

Index segments, merging, and immutability (Lucene architecture):

Lucene (the library behind Elasticsearch and Solr) uses an append-only segment architecture — notice the similarity to LSM trees from Section 3.4:

Write path (similar to LSM-Trees):
1. New documents go into an in-memory buffer
2. Buffer is periodically flushed to an immutable segment on disk
3. Each segment is a self-contained inverted index
4. Segments are periodically merged into larger segments (compaction)

Search path:
1. Query runs against ALL segments in parallel
2. Results are merged across segments
3. Deleted documents are excluded via a deletion bitmap

Why immutable segments?
- No locking needed for reads (concurrent search is trivial)
- Filesystem cache is very effective (segments don't change)
- Write-ahead log protects against crashes
- Trade-off: deletes/updates require marking old doc as deleted + reindexing

The LSM-tree pattern — write sequentially, compact asynchronously — appears everywhere. Search engines independently arrived at the same architecture as write-optimized databases.

8.2 Text Analysis Pipeline

Text analysis is where most search quality problems live. If your analyzer is wrong, no amount of ranking tuning will fix bad results. Garbage in, garbage out — except the garbage is invisible because everything still appears to work.

Tokenization splits text into individual terms:

Input: "New York City's best coffee shops (2024 edition)"

Standard tokenizer:  ["new", "york", "city's", "best", "coffee", "shops", "2024", "edition"]
Whitespace tokenizer: ["New", "York", "City's", "best", "coffee", "shops", "(2024", "edition)"]
Keyword tokenizer:   ["New York City's best coffee shops (2024 edition)"]  ← entire string as one token

The choice matters enormously:
- Standard: good default for full-text search
- Whitespace: when you want to preserve case and punctuation
- Keyword: for exact-match fields (email addresses, product SKUs, status codes)

The full analysis chain: Analyzer = Character Filters → Tokenizer → Token Filters

Input text: "The <b>Quick</b> Brown FOX jumped over 2 lazy dogs"

Character filter:  Strip HTML     → "The Quick Brown FOX jumped over 2 lazy dogs"
Tokenizer:         Standard       → ["The", "Quick", "Brown", "FOX", "jumped", "over", "2", "lazy", "dogs"]
Token filter 1:    Lowercase      → ["the", "quick", "brown", "fox", "jumped", "over", "2", "lazy", "dogs"]
Token filter 2:    Stop words     → ["quick", "brown", "fox", "jumped", "lazy", "dogs"]
Token filter 3:    Stemming       → ["quick", "brown", "fox", "jump", "lazi", "dog"]

Final indexed tokens: ["quick", "brown", "fox", "jump", "lazi", "dog"]

Stemming vs Lemmatization:

Stemming (algorithmic, fast, sometimes aggressive):
  "running"    → "run"
  "runs"       → "run"
  "better"     → "better"  (stemmer doesn't know this)
  "studies"    → "studi"   (overstemming — "studio" also becomes "studi")
  "universal"  → "univers"
  "university" → "univers" (conflation — different meanings, same stem)

Lemmatization (dictionary-based, slower, more accurate):
  "running"    → "run"
  "runs"       → "run"
  "better"     → "good"    (understands irregular forms)
  "studies"    → "study"   (correct base form)
  "universal"  → "universal"
  "university" → "university"  (no conflation)

Rule of thumb: Use stemming for most search use cases (recall is more important).
Use lemmatization when precision matters (medical, legal, scientific search).

Stop words — common words like "the", "is", "at", "which":

When to REMOVE stop words:
- General full-text search (they add noise without meaning)
- Index size matters (stop words can be 25-30% of all tokens)

When to KEEP stop words:
- Phrase queries: "to be or not to be" — every word matters
- Song/movie titles: "The Who", "Let It Be", "It"
- Technical content: "C++" tokenized without "+" is just "c"

Synonyms expand queries to match equivalent terms:

Synonym mappings:
  "couch" ↔ "sofa"
  "laptop" ↔ "notebook computer"
  "NYC" → "New York City"  (one-way: NYC matches New York City, but not vice versa)

Apply at index time, query time, or both:
- Index time: larger index, but faster queries. Cannot change synonyms without reindex.
- Query time: smaller index, flexible. But multi-word synonyms are tricky.
- Best practice: query-time synonyms for most cases (flexibility > speed).

N-grams and edge n-grams for autocomplete:

Edge n-grams (index time) — for "search-as-you-type":

Input: "elasticsearch"
min_gram: 2, max_gram: 10

Tokens: ["el", "ela", "elas", "elast", "elasti", "elastic", "elastics", "elasticse", "elasticsea"]

When user types "elas" → exact match on the "elas" token → instant results

N-grams (sliding window) — for fuzzy substring matching:

Input: "hello"
n: 3 (trigrams)

Tokens: ["hel", "ell", "llo"]

Useful for partial matching and language-agnostic search (CJK languages, compound words).

8.3 Ranking Algorithms

Ranking determines result quality. A search engine that finds documents but ranks them badly is useless — worse than useless, because it makes people distrust the search. The ranking algorithm is where you compete.

TF-IDF (Term Frequency - Inverse Document Frequency):

TF(term, doc) = count of term in document / total terms in document
IDF(term) = log(total documents / documents containing term)
Score = TF × IDF

Example with 10,000 documents:
  Query: "database optimization"

  Document A (1000 words): "database" appears 5 times, "optimization" appears 3 times
  Document B (100 words): "database" appears 5 times, "optimization" appears 3 times

  "database" appears in 2000 docs:     IDF = log(10000/2000) = 1.61
  "optimization" appears in 200 docs:  IDF = log(10000/200) = 3.91

  Score A = (5/1000 × 1.61) + (3/1000 × 3.91) = 0.008 + 0.012 = 0.020
  Score B = (5/100 × 1.61)  + (3/100 × 3.91)  = 0.081 + 0.117 = 0.198

  Document B scores higher — same term counts but shorter document, so higher term density.
  "optimization" contributes more because it's rarer across the corpus (higher IDF).

BM25 (Okapi BM25) — the current standard:

BM25 improves on TF-IDF with two critical refinements that eliminate its worst behaviors:

BM25(query, doc) = Σ IDF(term) × (TF × (k1 + 1)) / (TF + k1 × (1 - b + b × docLength/avgDocLength))

Where:
  k1 = 1.2 (default) — controls term frequency saturation
  b  = 0.75 (default) — controls document length normalization

Key improvements over TF-IDF:

1. Term frequency saturation:
   TF-IDF:  5 occurrences scores 5x higher than 1 occurrence
   BM25:    5 occurrences scores maybe 2.5x higher than 1 occurrence (diminishing returns)

   This prevents keyword-stuffed documents from dominating results.

2. Document length normalization:
   b = 0.75 means long documents are penalized, but not severely.
   b = 0: no length normalization (long documents treated same as short)
   b = 1: full length normalization (strongly penalize long documents)

BM25 is the default ranking algorithm in Elasticsearch, Solr, and most modern search engines.

Vector search / semantic search:

Traditional (lexical) search:
  Query: "car maintenance tips"
  Misses: "automobile service guide"  (no matching terms!)

Vector (semantic) search:
  1. Encode query into a vector:    "car maintenance tips"     → [0.23, -0.45, 0.12, ...]
  2. Encode document into a vector:  "automobile service guide" → [0.21, -0.43, 0.15, ...]
  3. Compute cosine similarity:      0.94 (very similar!)

How it works:
- Use an embedding model (OpenAI ada-002, Cohere embed, sentence-transformers)
- Documents and queries are projected into the same high-dimensional vector space
- Similar meanings cluster together, regardless of exact wording
- Find nearest neighbors using approximate nearest neighbor (ANN) algorithms:
  HNSW (Hierarchical Navigable Small World) — most common, used by Elasticsearch, pgvector
  IVF (Inverted File Index) — faster indexing, slightly less accurate

Trade-offs:
- Catches semantic meaning that lexical search misses
- But loses exact keyword matching ("error code XJ-4521" won't work well)
- Embedding model quality matters enormously
- Vectors are expensive to compute and store (768-1536 dimensions × 4 bytes each)

Hybrid search — current best practice:

Combine BM25 (lexical) + vector (semantic) for the best of both worlds:

Strategy 1: Reciprocal Rank Fusion (RRF)
  BM25 results:    [docA:rank1, docB:rank2, docC:rank3, ...]
  Vector results:  [docC:rank1, docA:rank2, docD:rank3, ...]

  RRF_score(doc) = Σ 1/(k + rank_in_list)  where k = 60 (constant)

  docA: 1/(60+1) + 1/(60+2) = 0.0164 + 0.0161 = 0.0325  ← top result
  docC: 1/(60+3) + 1/(60+1) = 0.0159 + 0.0164 = 0.0323
  docB: 1/(60+2) + 0         = 0.0161

Strategy 2: Weighted linear combination
  final_score = α × BM25_score + (1-α) × vector_score
  Tune α based on your use case (start with 0.5, adjust based on relevance testing)

Hybrid search is the current industry standard for production search systems.

Learning to Rank (LTR):

Use machine learning to optimize ranking based on user behavior:

1. Collect features per query-document pair:
   - BM25 score, vector similarity, document freshness, popularity, click count
   - Query-specific: exact title match, category match, price range match

2. Collect training labels from user behavior:
   - Click-through rate, dwell time, add-to-cart, purchase
   - Explicit relevance judgments from human raters

3. Train a model (LambdaMART, neural LTR) to predict relevance
4. Use the model to re-rank results at query time

Used by: Google, Amazon, Netflix, Airbnb — any search with enough traffic to generate training data.
LTR is overkill for most applications. Start with BM25, add vectors if needed, then consider LTR.

8.4 Elasticsearch / OpenSearch Deep Dive

Elasticsearch is the most widely-deployed search engine. Understanding its architecture is essential for operating it without surprises.

Architecture:

Cluster → Nodes → Indices → Shards → Segments

Cluster: "production-search" (group of nodes)
  │
  ├── Node 1 (master-eligible, data)
  │     ├── Index: "products"
  │     │     ├── Shard 0 (primary)
  │     │     │     ├── Segment 0 (immutable, contains 50K docs)
  │     │     │     ├── Segment 1 (immutable, contains 30K docs)
  │     │     │     └── Segment 2 (immutable, contains 10K docs, recently flushed)
  │     │     └── Shard 2 (primary)
  │     └── Index: "orders"
  │           └── Shard 1 (replica)
  │
  ├── Node 2 (data)
  │     ├── Index: "products"
  │     │     ├── Shard 1 (primary)
  │     │     └── Shard 0 (replica)
  │     └── Index: "orders"
  │           └── Shard 0 (primary)
  │
  └── Node 3 (coordinating only — routes queries, merges results)

Node types:
- Master:       Manages cluster state (index creation, shard allocation). 3 dedicated masters minimum.
- Data:         Stores data, executes searches and aggregations. Scale horizontally.
- Ingest:       Preprocesses documents (transform, enrich) before indexing.
- Coordinating: Routes requests, merges results. Useful for heavy aggregation workloads.

Index mappings — defining your schema:

PUT /products
{
  "mappings": {
    "properties": {
      "title":       { "type": "text", "analyzer": "english" },
      "title_exact": { "type": "keyword" },
      "description": { "type": "text" },
      "price":       { "type": "float" },
      "category":    { "type": "keyword" },
      "tags":        { "type": "keyword" },
      "created_at":  { "type": "date" },
      "location":    { "type": "geo_point" },
      "ratings":     { "type": "nested",
        "properties": {
          "user_id": { "type": "keyword" },
          "score":   { "type": "integer" },
          "comment": { "type": "text" }
        }
      }
    }
  }
}

text vs keyword — the most important distinction:

"text" field (analyzed):
  Input: "Quick Brown Fox"
  Stored as tokens: ["quick", "brown", "fox"]
  Supports: full-text search (match query), relevance scoring
  Does NOT support: exact match, sorting, aggregations (use .keyword sub-field)

"keyword" field (not analyzed):
  Input: "Quick Brown Fox"
  Stored as: "Quick Brown Fox" (exact string)
  Supports: exact match (term query), sorting, aggregations, filtering
  Does NOT support: full-text search

Common pattern — use both:
  "title": {
    "type": "text",
    "fields": {
      "keyword": { "type": "keyword" }   ← multi-field mapping
    }
  }
  Search: match on "title" (full-text)
  Filter/sort/aggregate: term on "title.keyword" (exact)

Query DSL essentials:

{
  "query": {
    "bool": {
      "must": [
        { "match": { "title": "database optimization" } }
      ],
      "filter": [
        { "term": { "status": "published" } },
        { "range": { "created_at": { "gte": "2024-01-01" } } }
      ],
      "should": [
        { "match": { "tags": "postgresql" } }
      ],
      "must_not": [
        { "term": { "archived": true } }
      ]
    }
  }
}

bool query breakdown:
  must:     Documents MUST match. Contributes to relevance score.
  filter:   Documents MUST match. Does NOT contribute to score (faster, cacheable).
  should:   Documents SHOULD match. Boosts score if they do. (Optional match.)
  must_not: Documents MUST NOT match. Does not affect score.

Rule of thumb:
  - Use "must" for the main search terms (what the user typed)
  - Use "filter" for structured constraints (status, date range, category)
  - Use "should" for boosting signals (prefer recent docs, prefer certain categories)
  - Use "must_not" for exclusions (hide archived, deleted, blocked content)

The filter vs must distinction is a performance lever. Filter clauses are cached at the node level and can be reused across queries. must clauses compute scores and can't be cached. Put your date ranges, status filters, and category filters in filter — not must.

Query types — when to use which:

match:         Full-text search. Analyzes the query, finds docs with matching terms.
               { "match": { "title": "database optimization" } }
               Matches: "Optimization techniques for databases"

term:          Exact match. No analysis. Use for keyword fields only.
               { "term": { "status": "published" } }
               Does NOT match "Published" (case-sensitive, no analysis)

match_phrase:  All terms must appear in order, adjacent.
               { "match_phrase": { "title": "New York" } }
               Matches: "Visit New York City"
               Does NOT match: "York is New"

multi_match:   Search across multiple fields.
               { "multi_match": {
                   "query": "database optimization",
                   "fields": ["title^3", "description", "tags^2"],
                   "type": "best_fields"
               }}
               title matches score 3x, tags 2x, description 1x

Aggregations — analytics on search results:

{
  "size": 0,
  "aggs": {
    "categories": {
      "terms": { "field": "category", "size": 20 }
    },
    "price_ranges": {
      "histogram": { "field": "price", "interval": 50 }
    },
    "avg_rating": {
      "avg": { "field": "rating" }
    },
    "monthly_sales": {
      "date_histogram": {
        "field": "sold_at",
        "calendar_interval": "month"
      }
    }
  }
}

Aggregations power: faceted navigation (sidebar filters with counts), dashboards, analytics, and monitoring. They run on the same query scope, so filtering narrows both results and aggregation counts simultaneously.

Fuzzy matching for typo tolerance:

{
  "query": {
    "match": {
      "title": {
        "query": "databse optimizaton",
        "fuzziness": "AUTO"
      }
    }
  }
}

Fuzziness = Levenshtein edit distance (insertions, deletions, substitutions):
  "databse"     → "database"     (1 edit: insert 'a')
  "optimizaton" → "optimization" (1 edit: insert 'i')

AUTO fuzziness (recommended):
  0-2 characters:  exact match only
  3-5 characters:  1 edit allowed
  6+ characters:   2 edits allowed

Autocomplete — three approaches:

1. Completion Suggester (FST-based, fastest):
   - Uses a Finite State Transducer (compact, in-memory)
   - Prefix matching only
   - Best for: search box suggestions with known terms

2. Edge n-grams (flexible, good for search-as-you-type):
   - Index "elasticsearch" as ["el", "ela", "elas", "elast", ...]
   - Regular match query works for autocomplete
   - Best for: full-text autocomplete on document content

3. search_as_you_type field type (built-in, simplest):
   - Automatically creates edge n-gram and shingle sub-fields
   - Just use multi_match with type: "bool_prefix"
   - Best for: quick setup, good-enough autocomplete

   { "match_bool_prefix": { "title.search_as_you_type": "data optim" } }

Highlighting — show matched terms in context:

{
  "query": { "match": { "description": "database optimization" } },
  "highlight": {
    "fields": {
      "description": {
        "pre_tags": ["<mark>"],
        "post_tags": ["</mark>"],
        "fragment_size": 150,
        "number_of_fragments": 3
      }
    }
  }
}

Returns: "...techniques for <mark>database</mark> <mark>optimization</mark> in production systems..."

Faceted search — the filter sidebar:

Faceted search combines a query with aggregations to show users what filters are available and how many results each filter would produce:

User searches "laptop":

Results: 2,345 laptops

Facets (from aggregations):
  Brand:     Apple (543)  |  Dell (421)  |  Lenovo (389)  |  HP (312)  | ...
  Price:     $0-500 (234) |  $500-1000 (876) | $1000-2000 (901) | $2000+ (334)
  RAM:       8GB (567)    |  16GB (1,102) |  32GB (543)  |  64GB (133)
  Rating:    ★★★★+ (1,203) | ★★★+ (1,890) | ★★+ (2,100)

User clicks "Apple" → filter applied → results narrow to 543 → facet counts update

8.5 Performance & Scaling

Shard sizing:

Rules of thumb:
- Target 10-50GB per shard (sweet spot for most workloads)
- Avoid too many small shards (each shard has overhead: memory, file descriptors, threads)
- Avoid too-large shards (slow recovery, long GC pauses)
- Number of shards = expected_data_size / target_shard_size

Example:
  Expected data: 500GB
  Target shard size: 25GB
  Primary shards: 20
  Replicas: 1 (so 40 total shards across the cluster)

Over-sharding is the #1 operational mistake. 1000 tiny 100MB shards
is far worse than 10 shards of 10GB each.

Index Lifecycle Management (ILM):

Manage time-series data (logs, metrics, events) across hardware tiers:

Hot phase:    (fast SSDs, lots of RAM)
  → Actively indexed and searched
  → Recent data (last 7 days)

Warm phase:   (cheaper SSDs, less RAM)
  → Read-only, still searchable
  → Force-merge to 1 segment (fewer file handles, faster search)
  → Older data (7-30 days)

Cold phase:   (HDDs or object storage)
  → Rarely searched, slow is acceptable
  → Searchable snapshots (data in S3, cached locally)
  → Historical data (30-90 days)

Delete phase:
  → Data past retention period is deleted automatically

This is how teams manage petabytes of log data without going bankrupt on storage.

Caching:

Elasticsearch caches at three levels:

1. Filter cache (node query cache):
   - Caches the results of filter clauses (term, range in filter context)
   - Bitmap of matching doc IDs — extremely fast for repeated filters
   - Why filters should use "filter" context, not "must"

2. Request cache:
   - Caches the entire response for a search request
   - Only for size:0 requests (aggregations, count)
   - Invalidated when the index is refreshed

3. Fielddata / doc values:
   - Column-oriented data for sorting and aggregations
   - Doc values are built at index time (disk-based, memory-mapped)
   - Fielddata is built at query time (heap-heavy, avoid for text fields)

Performance tuning checklist:

□ Use filter context for non-scoring queries (cacheable, faster)
□ Increase refresh_interval from 1s to 30s for write-heavy indices
□ Bulk index with _bulk API (not individual document puts)
□ Denormalize data — don't model like a relational DB (avoid parent-child, minimize nested)
□ Use routing to colocate related documents on the same shard
□ Monitor slow query log: index.search.slowlog.threshold.query.warn: 10s
□ Force-merge read-only indices to 1 segment
□ Use doc values instead of fielddata for sorting/aggregations
□ Right-size your shards (10-50GB each)

Reindexing strategies:

Zero-downtime reindex with alias swap:

1. Current state:
   Alias "products" → Index "products-v1"

2. Create new index with updated mappings:
   PUT products-v2 { new mappings, new analyzers }

3. Reindex data:
   POST _reindex { "source": { "index": "products-v1" }, "dest": { "index": "products-v2" } }

4. Swap alias atomically:
   POST _aliases {
     "actions": [
       { "remove": { "index": "products-v1", "alias": "products" } },
       { "add":    { "index": "products-v2", "alias": "products" } }
     ]
   }

5. Application never knows the index changed — it queries the "products" alias.
6. Delete products-v1 when confident.

Always use aliases. Never point application code at concrete index names.

8.6 Alternatives to Elasticsearch

Tool	Best For	Architecture
Elasticsearch/OpenSearch	Full-featured search + analytics	Distributed, JVM-based, scales to petabytes
Meilisearch	Instant search, product catalogs	Single binary, Rust, sub-50ms responses, simple config
Typesense	Similar to Meilisearch, typo-tolerant	C++, simple to operate, built-in clustering
PostgreSQL FTS	Good-enough search within Postgres	`tsvector` + GIN index, no extra infrastructure
pgvector + tsvector	Hybrid search in Postgres	Combine full-text + vector search, no extra infra
Algolia	Managed, instant search	SaaS, great DX, expensive at scale

Decision framework:

"Do I need a separate search engine?"

< 100K documents + simple search → PostgreSQL full-text search (tsvector + GIN)
  - No extra infrastructure
  - Transactionally consistent with your data
  - Supports ranking, highlighting, fuzzy matching
  - Example:
    SELECT title, ts_rank(search_vector, query) AS rank
    FROM products, plainto_tsquery('english', 'database optimization') query
    WHERE search_vector @@ query
    ORDER BY rank DESC LIMIT 20;

100K-10M documents + fast autocomplete → Meilisearch or Typesense
  - Drop-in search engine with great defaults
  - Sub-50ms response times, typo tolerance, faceting built-in
  - Much simpler to operate than Elasticsearch

10M+ documents OR complex analytics → Elasticsearch / OpenSearch
  - Full-featured: aggregations, geo search, nested objects, custom analyzers
  - Scales horizontally to petabytes
  - Higher operational complexity (JVM tuning, shard management, cluster ops)

Any scale + semantic search needed → Add vector search
  - pgvector if already on Postgres
  - Elasticsearch kNN if already on ES
  - Pinecone / Weaviate / Qdrant for dedicated vector DB

Start simple. Adding Elasticsearch to a system with 10K documents is operational overhead for zero benefit.

8.7 Search UX Patterns

Good search engineering is wasted without good search UX. These patterns are battle-tested:

Autocomplete / typeahead:

Implementation checklist:
□ Debounce input: 200-300ms (don't fire on every keystroke)
□ Minimum query length: 2-3 characters (don't search on "a")
□ Show suggestions after 2+ characters typed
□ Keyboard navigation (arrow keys + Enter)
□ Highlight the matching portion in suggestions
□ Cancel in-flight requests when user types more (AbortController)
□ Cache recent queries client-side

const controller = new AbortController();
const response = await fetch(`/api/search?q=${query}`, {
  signal: controller.signal
});
// On new keystroke: controller.abort() → cancels previous request

"Did you mean?" suggestions:

Three approaches:
1. Fuzzy matching: find terms within edit distance 1-2 of the query
2. Phonetic matching: Metaphone/Soundex ("Jon" → "John")
3. Popularity-weighted: suggest the most common similar query from search logs

Best practice: combine all three.
If a query returns 0 results but a similar query returns 1000+, show "Did you mean: [similar query]?"

Faceted navigation:

The filter sidebar with counts. Rules:

1. Show facets relevant to current results (not all possible values)
2. Update counts dynamically as filters are applied
3. Use OR within a facet, AND between facets:
   Category: [Laptops OR Tablets] AND Brand: [Apple] AND Price: [$500-$1000]
4. Show count of results each filter value would produce
5. Don't show filter values with 0 results (or show them grayed out)
6. Allow removing individual filters (breadcrumbs / pills)

Zero-results page:

When search returns nothing, don't show a blank page:
1. Check for typos and suggest corrections
2. Show results for a relaxed query (drop least important terms)
3. Suggest popular/trending searches
4. Show category browse options
5. Log zero-result queries — they reveal gaps in your content or analysis pipeline

Search analytics — measure and improve:

Track these metrics:
1. Query volume:          what people search for (query distribution)
2. Zero-result rate:      % of queries with no results (target: < 5%)
3. Click-through rate:    % of searches where user clicks a result
4. Click position:        which result position gets clicked (lower is better)
5. Refinement rate:       % of searches followed by another search (user didn't find what they wanted)
6. Time to first click:   how long users spend before clicking
7. Search exit rate:      % of users who leave after searching (worst case)

Zero-result queries are a gold mine: they tell you exactly what your search is failing at.
Analyze them weekly and fix the top offenders (add synonyms, fix analysis, add content).

Summary: When to Use What

Problem	Data Structure / Algorithm	Why
Key-value lookup	Hash map	O(1) average lookup
Ordered data, range queries	B+ Tree (database index)	O(log n) lookup + sequential range scan
Write-heavy ingestion	LSM Tree (Cassandra, RocksDB)	O(1) amortized writes, sequential I/O
Set membership at scale	Bloom filter	O(1) check, tiny memory footprint
Count distinct at scale	HyperLogLog	12KB for billions of elements
Frequency estimation in streams	Count-Min Sketch	Bounded memory, no false negatives on count
In-memory cache with eviction	LRU Cache	O(1) get/put with bounded memory
Rate limiting	Token bucket / sliding window	Smooth rate enforcement with burst support
Distribute data across nodes	Consistent hashing	Minimal redistribution on node add/remove
Find boundary in sorted data	Binary search	O(log n), universally applicable
Sort data larger than memory	External merge sort	Disk-friendly, O(n log n) I/O operations
Task/build ordering	Topological sort	Respects dependencies, detects cycles
Shortest path (unweighted)	BFS	O(V + E), guarantees shortest
Shortest path (weighted)	Dijkstra	O((V + E) log V) with priority queue
Cycle/deadlock detection	DFS with coloring	O(V + E), finds back edges
Autocomplete, prefix matching	Trie / Radix tree	O(k) lookup where k = key length
Ordered set with concurrency	Skip list	O(log n), simpler concurrent access than trees
Distributed unique IDs	Snowflake / UUIDv7 / ULID	Time-sorted, no coordination, index-friendly
Full-text search	Inverted index (Elasticsearch, Solr)	O(query_terms) lookup, sub-second on millions of docs
Lexical relevance ranking	BM25	Industry standard, handles term saturation and doc length
Semantic / similarity search	Vector search (HNSW, IVF)	Catches meaning, not just keywords
Search within Postgres	tsvector + GIN index	No extra infra, good enough for < 100K docs
Typo-tolerant instant search	Meilisearch / Typesense	Simple ops, sub-50ms, great defaults

The structures in this chapter are not interview trivia. They are the building blocks of every database, cache, load balancer, and distributed system you work with every day. The B+ tree in Postgres's indexes, the skip list in Redis sorted sets, the Bloom filter in Cassandra's LSM read path, the consistent hashing ring in DynamoDB's partition layer, the HyperLogLog behind PFCOUNT — you interact with these data structures constantly, whether you know it or not. Understanding them means you can read a Postgres EXPLAIN plan and know what is happening, choose the right database for your workload, debug performance problems from first principles, and design systems that scale. That is what separates engineers who understand their tools from engineers who cargo-cult configurations.

Try It Yourself

Want to put this into practice? The TicketPulse course has hands-on modules that build on these concepts:

L1-M07: Indexing & Query Performance — Apply B-tree and hash index knowledge to make TicketPulse's ticket queries 10x faster
L2-M40: Search Engineering — Build full-text search for TicketPulse events using inverted indexes and relevance scoring
L3-M65: Consistent Hashing & Distributed Cache — Implement consistent hashing to distribute TicketPulse's cache layer across nodes without thundering-herd on reshards

Quick Exercises

Implement an LRU cache in your language of choice using only a hash map and a doubly-linked list — no library calls — and verify it evicts the correct entry.
Profile a slow function in your codebase: add timing instrumentation, measure it under realistic load, then write down its Big-O complexity for both time and space.
Find where your codebase uses a bloom filter, B-tree, or hash map — hint: your database uses all three internally. Check EXPLAIN output, your cache client docs, and any "exists" check in your application layer.

FilesExpand file tree

22-algorithms-data-structures.md

Latest commit

History