Part I — Foundations | Prerequisites: None | Difficulty: Intermediate → Advanced
Not LeetCode grinding. This is the set of data structures and algorithms you will actually reach for in production — the ones hiding inside Redis, Postgres, Cassandra, and Elasticsearch, doing the heavy lifting you've been taking for granted. Once you see them, you can't unsee them.
- Complexity Analysis That Matters
- Hash-Based Structures
- Tree Structures in Databases & Systems
- Graph Algorithms in Infrastructure
- Probabilistic Data Structures
- Practical Implementations
- Sorting & Searching in the Real World
- Search Engineering
- Ch 1 (consistent hashing, distributed systems)
- Ch 24 (database internals — B-trees, LSM trees in depth)
- Ch 6 (lock-free data structures)
Here is the truth about Big-O: it is not an academic exercise designed to torment junior engineers before coding interviews. It is how you predict whether your system will survive at 10x the current traffic. The difference between O(n) and O(n^2) is the difference between a 200ms response and a 20-minute timeout. When your on-call pager goes off at 3am and the logs show a query that used to take 80ms now taking 45 seconds, Big-O is how you diagnose the problem in five minutes instead of three hours.
Big-O describes how an operation's cost grows as the input size (n) increases. Here is every complexity class you will actually encounter in backend work:
| Complexity | Name | Real-World Example |
|---|---|---|
| O(1) | Constant | Hash map lookup, array index access, checking if a bit is set |
| O(log n) | Logarithmic | Binary search, B-tree index lookup in Postgres, finding an element in a balanced BST |
| O(n) | Linear | Scanning every row in a table without an index, iterating a linked list, grep through a file |
| O(n log n) | Linearithmic | Sorting query results (ORDER BY without an index), building a B-tree index from scratch |
| O(n^2) | Quadratic | Nested loop join without indexes, comparing every pair of items (naive deduplication) |
To put these in perspective with concrete numbers:
n = 1,000,000 (one million rows)
O(1) → 1 operation
O(log n) → 20 operations
O(n) → 1,000,000 operations
O(n log n) → 20,000,000 operations
O(n²) → 1,000,000,000,000 operations ← this kills your server
The jump from O(n) to O(n^2) is where systems fall over. A query that scans 1M rows in 100ms will take 100,000 seconds with an O(n^2) algorithm. This is the number one reason to add database indexes. Every nested loop over a large dataset is a landmine waiting to detonate at scale.
Some operations are expensive occasionally but cheap on average. Amortized analysis accounts for this — and it matters more than most engineers realize.
Dynamic arrays (ArrayList, Go slices, Python lists) are the canonical example:
Append operation:
- Usually O(1): write to the next empty slot
- Occasionally O(n): array is full, allocate 2x the space, copy everything over
But if the array doubles each time:
- After n appends, total work = n + n/2 + n/4 + n/8 + ... ≈ 2n
- Average cost per append = 2n / n = O(1) amortized
This matters in practice. When someone says "appending to a slice is O(1)," they mean amortized O(1). If you are building a system that cannot tolerate occasional latency spikes — say, a real-time trading system where a 10ms hiccup triggers a margin call — you care about the worst case, not the amortized case. Pre-allocate your arrays.
Hash map resize follows the same pattern. When the load factor exceeds a threshold (typically 0.75), the map doubles its bucket count and rehashes every key. Individual inserts are O(1) amortized but occasionally O(n). In Go, if you know upfront how big a map will get, make(map[K]V, expectedSize) avoids those resize pauses entirely.
Almost every design decision in backend engineering is a space-time trade-off. You are constantly negotiating between how much memory you spend and how fast things run.
Trading space for time (caching):
- Database query cache: store result sets to avoid recomputation
- CDN: replicate content closer to users (more storage, less latency)
- Denormalization: store redundant data to avoid JOINs
- Materialized views: precompute expensive aggregations
Trading accuracy for space (probabilistic structures):
- Bloom filters: 1% false positive rate in 1MB vs exact set membership in 100MB
- HyperLogLog: count distinct elements with 0.81% error using 12KB vs exact count requiring O(n) memory
- Count-Min Sketch: approximate frequency counts in bounded memory
Trading time for space (compression):
- Gzip responses: spend CPU cycles to reduce bandwidth
- Column-oriented storage: compress similar values together
- Delta encoding: store differences instead of absolute values
The probabilistic structures in particular are underused by most engineers. You'll meet them properly in Section 5 — they're one of the most satisfying parts of this chapter.
Big-O hides constant factors, but constants dominate at practical scales. This is the most commonly misunderstood nuance about complexity analysis.
Algorithm A: O(n) with constant factor 1000 → 1000n operations
Algorithm B: O(n²) with constant factor 1 → n² operations
Crossover point: 1000n = n² → n = 1000
For n < 1000: Algorithm B is faster
For n > 1000: Algorithm A is faster
Real examples where constants matter:
- Insertion sort vs quicksort for small arrays: Insertion sort is O(n^2) but has tiny constants (no recursion, good cache locality). Most standard library sort implementations switch to insertion sort for arrays under 10-20 elements. That's not a bug — it's intentional.
- Hash map vs linear scan for small collections: For fewer than ~10 elements, linear scan through an array beats a hash map because hash computation, memory indirection, and cache misses cost more than scanning a handful of elements.
- B-tree fan-out: A B-tree with node size matching a disk page (4KB) can store ~500 keys per node. A binary search tree stores 1 key per node. Same O(log n) lookup, but the B-tree does log_500(n) disk reads vs log_2(n). For 1 billion keys, that is 3 disk reads vs 30. The algorithm is the same; the constant is everything. (More on this in Section 3.3 and Ch 24.)
For backend work, these are the operations you evaluate data structures against:
| Operation | Why It Matters |
|---|---|
| Insert | How fast can you write new data? (API writes, log ingestion, event processing) |
| Lookup (point query) | Find one record by key. The bread and butter of web applications. |
| Range scan | Find all records in a range. Time-series queries, pagination, analytics. |
| Delete | Remove data. Harder than it sounds — tombstones, compaction, fragmentation. |
| Iterate (full scan) | Process every element. Batch jobs, migrations, analytics. |
Different data structures optimize for different operations. There is no structure that is best at everything — if there were, you would only need one:
Insert Lookup Range Scan Delete Iterate
Hash Map O(1)* O(1)* O(n) O(1)* O(n)
B-Tree O(log n) O(log n) O(log n + k) O(log n) O(n)
LSM-Tree O(1)* O(log n) O(log n + k) O(1)* O(n)
Skip List O(log n) O(log n) O(log n + k) O(log n) O(n)
Sorted Array O(n) O(log n) O(log n + k) O(n) O(n)
* = amortized
k = number of elements in the range
This table is the cheat sheet for every "which database should I use?" conversation. Hash maps win on point lookups. B-trees win on range scans. LSM trees win on writes. You choose based on your workload.
Hash maps are the most important data structure in backend engineering. If you understand nothing else in this chapter, understand hash maps. They are in Redis (literally every data type uses hashing under the hood), in your programming language's standard library, in Kafka's partition assignment, and in the lookup table your CPU uses to find memory pages. They are everywhere, and they are almost always the right first choice.
Think of a hash map as a perfect filing cabinet. You walk up, say the key, and a very fast clerk computes exactly which drawer to open. No scanning, no searching — just compute and retrieve.
put(key, value):
1. Compute hash = hash_function(key) # e.g., "user:123" → 0xA3F2...
2. Compute index = hash % num_buckets # e.g., 0xA3F2... % 16 = 7
3. Store (key, value) in bucket[7]
get(key):
1. Compute hash = hash_function(key)
2. Compute index = hash % num_buckets
3. Look in bucket[index] for matching key
4. Return value (or "not found")
The hash function must be:
- Deterministic: same key always produces the same hash
- Uniform: keys should spread evenly across buckets
- Fast: hashing should be cheaper than the alternative (linear scan)
Common hash functions in practice:
- MurmurHash3: fast, good distribution, used by many hash map implementations
- SipHash: DoS-resistant (Python dicts, Rust HashMaps use it to prevent hash-flooding attacks)
- xxHash: extremely fast, used for checksums and non-cryptographic hashing
- CityHash/FarmHash: Google's fast hash families, optimized for short strings
With a good hash function and reasonable load factor, most buckets contain 0 or 1 entries. The hash computation is O(1) (fixed work regardless of map size), and jumping to a bucket index is O(1) (array index access). Therefore, the expected lookup time is O(1).
But this is an average. In the worst case, every key hashes to the same bucket, and lookup degenerates to O(n) — scanning a linked list of all entries. This is why hash function quality matters, and why language runtimes use randomized hash seeds to prevent attackers from crafting collision-heavy inputs (hash-flooding DoS attacks). Python's dict randomizes its seed at startup for exactly this reason.
Load factor = number of entries / number of buckets. When the load factor exceeds a threshold (typically 0.75 in Java, 6.5 average per bucket in Go), the map resizes — doubles the bucket count and rehashes every key. This is the O(n) operation that makes the amortized cost still O(1).
When two keys hash to the same bucket, you have a collision. Two strategies for dealing with it, and the right choice depends on your hardware:
Chaining (separate chaining):
bucket[7] → (key1, val1) → (key2, val2) → (key3, val3)
Each bucket is a linked list (or in Java 8+, a red-black tree if
the chain exceeds 8 elements).
Pros: Simple, load factor can exceed 1.0, deletion is easy
Cons: Pointer chasing (cache-unfriendly), extra memory for pointers
Used by: Java HashMap, Go map (with overflow buckets)
Open addressing:
Collision at bucket[7]?
- Linear probing: try 7, 8, 9, 10, ...
- Quadratic probing: try 7, 7+1², 7+2², 7+3², ...
- Robin Hood hashing: linear probing, but swap entries to minimize
max probe distance (more fair distribution)
Pros: Cache-friendly (data in contiguous memory), no pointer overhead
Cons: Clustering (long probe sequences), deletion requires tombstones,
load factor must stay below ~0.7
Used by: Python dict (open addressing with perturbation),
Rust HashMap (Robin Hood → SwissTable),
Google's Swiss Tables (SIMD-accelerated probing)
In practice: Modern high-performance hash maps use open addressing with SIMD instructions to probe multiple slots simultaneously (Google's SwissTable, adopted by Rust and Abseil C++). The cache-friendliness of open addressing wins on modern hardware because cache misses are far more expensive than a few extra comparisons.
A hash set is simply a hash map where you only care about the keys (values are ignored or boolean). All the same internals apply. You already use these constantly, probably without thinking about them.
Use cases you encounter constantly:
- Deduplication:
seen = set()to skip already-processed records - Membership testing: is this user ID in the blocklist?
- Set operations: intersection (common friends), union (merge results), difference (what changed?)
# Deduplication during event processing
processed_ids = set()
for event in event_stream:
if event.id in processed_ids: # O(1) lookup
continue # skip duplicate
process(event)
processed_ids.add(event.id) # O(1) insertThe alternative — a sorted list with binary search — gives O(log n) membership tests. For a million events, that is the difference between 1 operation and 20. Hash sets are the right tool for this job almost every time.
Standard hashing (hash(key) % N) falls apart when you add or remove servers. If N changes from 4 to 5, nearly every key maps to a different server — cache miss storm, mass data migration. This is the problem that brought down caches across entire tech companies before consistent hashing became standard.
Consistent hashing fixes this by mapping both keys and servers onto a ring (0 to 2^32 - 1):
The Ring (0 to 2³²-1):
ServerA(hash=1000)
/
------*-----------
/ \
| key1(hash=800) |
| ↓ |
| maps to ServerA |
| (next server |
| clockwise) |
\ /
------*-----------
\
ServerB(hash=5000)
Rule: A key maps to the first server found going clockwise from
the key's hash position on the ring.
Virtual nodes solve the problem of uneven distribution. Instead of placing each server once on the ring, place it 100-200 times (with different hash values). This smooths out the distribution so no single server holds a disproportionate share of the keyspace:
class ConsistentHashRing:
def __init__(self, nodes=None, num_virtual=150):
self.ring = {} # hash_value → node
self.sorted_keys = [] # sorted hash values for binary search
self.num_virtual = num_virtual
if nodes:
for node in nodes:
self.add_node(node)
def add_node(self, node):
for i in range(self.num_virtual):
virtual_key = f"{node}:vn{i}"
hash_val = self._hash(virtual_key)
self.ring[hash_val] = node
self.sorted_keys.append(hash_val)
self.sorted_keys.sort()
def remove_node(self, node):
for i in range(self.num_virtual):
virtual_key = f"{node}:vn{i}"
hash_val = self._hash(virtual_key)
del self.ring[hash_val]
self.sorted_keys.remove(hash_val)
def get_node(self, key):
if not self.ring:
return None
hash_val = self._hash(key)
# Find first server clockwise (binary search for next largest hash)
idx = bisect.bisect_right(self.sorted_keys, hash_val)
if idx == len(self.sorted_keys):
idx = 0 # wrap around the ring
return self.ring[self.sorted_keys[idx]]
def _hash(self, key):
return int(hashlib.md5(key.encode()).hexdigest(), 16)Why it minimizes redistribution: When a server is added, it takes ownership of a portion of the ring from its clockwise neighbor. Only keys in that arc are reassigned — roughly 1/N of all keys, not all of them. When a server is removed, its keys move to the next clockwise server.
Used by: Amazon DynamoDB (partition assignment), Apache Cassandra (token ring), Memcached client-side sharding, Nginx upstream consistent hashing, Akka Cluster Sharding. When you're configuring a Cassandra cluster and wondering how it decides which node owns which rows — this is it.
Deduplication at scale:
Problem: Process 1 billion events, skip duplicates
Solution 1: Hash set in memory (requires ~16GB for 1B 128-bit hashes)
Solution 2: Bloom filter (1% false positive rate in ~1.2GB)
Solution 3: Hash to partition, deduplicate per partition (distributed)
Counting distinct elements:
Problem: How many unique users visited in the last 24 hours?
Exact: Store every user ID in a set → O(n) memory
Approximate: HyperLogLog → 12KB memory, 0.81% error (Redis PFCOUNT)
Partitioning (how Kafka and DynamoDB assign data to shards):
partition = hash(key) % num_partitions
Kafka: Messages with the same key always go to the same partition
(guarantees ordering per key)
DynamoDB: hash(partition_key) determines which storage node holds the item
Every time you set a partition key in Kafka, you are doing distributed hash routing. The same algorithm, at scale, across a cluster.
Trees are how databases organize data on disk. If you understand B-trees and LSM-trees, you understand how virtually every database works under the hood. And because Ch 24 covers database internals in depth, this chapter will give you the mental model — the "why" — and Ch 24 will go deep on implementation details, page formats, and tuning.
A BST maintains the invariant: left child < parent < right child. This gives O(log n) lookup, insert, and delete — but only if the tree is balanced.
Balanced BST (height = 3): Degenerate BST (height = 5):
4 1
/ \ \
2 6 2
/ \ / \ \
1 3 5 7 3
\
4
\
5
Lookup: O(log n) = O(3) Lookup: O(n) = O(5)
An unbalanced BST is just a linked list with extra steps. Insert sorted data into a naive BST and you get O(n) everything — which is why databases would never use a plain BST. This is why self-balancing trees exist.
Both are self-balancing BSTs that guarantee O(log n) height through rotation operations on insert and delete. They are the workhorses of in-memory ordered data structures.
Red-Black Trees:
- Every node is red or black
- Root is black, leaves (NIL) are black
- Red nodes cannot have red children
- Every path from root to leaf has the same number of black nodes
- Guarantees: height <= 2 * log2(n+1)
- Slightly less balanced than AVL but fewer rotations on insert/delete
AVL Trees:
- Heights of left and right subtrees differ by at most 1
- Stricter balancing = faster lookups, but more rotations on modification
- Better for read-heavy workloads
Where you encounter them:
java.util.TreeMap,java.util.TreeSet— Red-Black Treestd::map,std::setin C++ — typically Red-Black Tree- Linux kernel
CFS(Completely Fair Scheduler) — Red-Black Tree for process scheduling by virtual runtime - In-memory ordered indexes in databases
- Redis sorted sets' backing tree (alongside the skip list — they keep both)
You almost never implement these yourself. You use them through standard library ordered maps/sets when you need sorted iteration, range queries, or finding the min/max efficiently. But understanding them helps you understand why TreeMap is slower than HashMap on point lookups.
B-Trees are the most important data structure in databases. Every time you create an index in Postgres or MySQL, you are building a B-Tree (technically, a B+ Tree). That CREATE INDEX statement you run in your migration? This is what happens.
Why not a binary tree? Disk I/O. A binary tree with 1 billion entries has height log2(10^9) = 30. That is 30 disk reads per lookup. A B-Tree with a branching factor of 500 (typical for 4KB pages) has height log500(10^9) = 3. Three disk reads to find any record among a billion.
The B-tree is designed around one key insight: disk reads are catastrophically slow compared to memory operations, but reading a large chunk at once is almost free. So pack as many keys as possible into each page and minimize the number of pages you read.
B-Tree node structure:
Each node (= one disk page, typically 4KB or 8KB):
┌────────────────────────────────────────────────────┐
│ key1 │ key2 │ key3 │ ... │ keyN │ │
│ ↓ ↓ ↓ ↓ ↓ ↓ │
│ ptr0 ptr1 ptr2 ptr3 ptrN ptrN+1 │
└────────────────────────────────────────────────────┘
ptr0 → all keys < key1
ptr1 → all keys >= key1 and < key2
ptr2 → all keys >= key2 and < key3
...
Lookup walks from root to leaf, doing binary search within each node to pick the right child pointer. With pages cached in the buffer pool, the root and first few levels are almost always in memory, so a lookup typically does 1-2 actual disk reads.
Insert finds the correct leaf, inserts the key. If the leaf is full, it splits into two nodes and pushes the middle key up to the parent. Splits can cascade up but are rare.
B+ Trees (what databases actually use):
B-Tree: data stored in internal nodes AND leaves
B+ Tree: data stored ONLY in leaves; internal nodes are pure index
Internal nodes: [ 10 | 20 | 30 ]
/ | | \
Leaf nodes: [1,5,8] → [10,12,15] → [20,22,28] → [30,35,40]
↑ ↑ ↑ ↑
linked list connecting all leaves
Advantages of B+ Trees:
1. Internal nodes hold more keys (no data pointers) → higher fan-out → shorter tree
2. Leaves form a linked list → range scans are sequential reads (fast!)
3. All lookups go to leaves → more predictable performance
This is why WHERE id BETWEEN 100 AND 200 is fast with an index: the database finds leaf 100 via the tree, then follows the linked list to leaf 200, reading sequential pages. No random access — just a scan.
This is why SELECT * FROM users ORDER BY created_at is fast with an index on created_at: the leaves are already in order, just scan the linked list. The sort is free.
See Ch 24 for a deep dive into how Postgres and MySQL implement B+ trees, their page formats, and how the buffer pool caches work.
B-Trees are optimized for reads (O(log n) with high fan-out). But every write to a B-Tree requires a random disk seek to find the right page. For write-heavy workloads — logging, time-series, event ingestion, IoT sensor data — this becomes a bottleneck. You're burning through I/O budget just finding the right place to write.
Log-Structured Merge Trees (LSM Trees) flip the script. Instead of writing to the right place immediately, they write everything sequentially and sort it out later:
Write path:
1. Write to WAL (Write-Ahead Log) on disk ← sequential write (fast)
2. Write to MemTable (in-memory sorted structure) ← in-memory (fast)
3. When MemTable is full → flush to SSTable on disk ← sequential write (fast)
┌─────────────┐
Writes ──→ │ MemTable │ (sorted, in-memory, typically a skip list or red-black tree)
│ (4MB-64MB) │
└──────┬──────┘
│ flush when full
▼
┌──────────────────┐
Level 0: │ SSTable │ SSTable │ (sorted, immutable files on disk)
└────────┬─────────┘
│ compaction (merge sorted files)
▼
┌───────────────────────────┐
Level 1: │ SSTable │ SSTable │ SSTable │ (10x larger than Level 0)
└────────┬──────────────────┘
│ compaction
▼
┌──────────────────────────────────────┐
Level 2: │ SSTable │ SSTable │ SSTable │ SSTable │ (10x larger)
└──────────────────────────────────────┘
Read path (slower than B-Tree):
- Check MemTable
- Check each SSTable level (newest first), using Bloom filters to skip SSTables that definitely do not contain the key
- Merge results
Compaction merges SSTables to:
- Remove deleted keys (tombstones)
- Remove older versions of updated keys
- Reduce the number of files to check on reads
Write amplification: a single write may be rewritten multiple times as it moves through compaction levels. If there are L levels and each is 10x larger, write amplification is ~10 * L. This is the price you pay for write speed.
The Bloom filter integration is elegant: before reading any SSTable, Cassandra checks a Bloom filter that tells it "this key is definitely not here" with zero disk I/O. You'll see Bloom filters again in Section 5.1.
| Property | B-Tree | LSM-Tree |
|---|---|---|
| Write throughput | Moderate (random I/O) | High (sequential I/O) |
| Read latency | Low (1-3 disk reads) | Higher (check multiple levels) |
| Range scans | Fast (B+ tree leaf list) | Fast after compaction |
| Space amplification | Low (~page fill factor ~70%) | Can be high (multiple copies during compaction) |
| Write amplification | Moderate (page rewrites) | High (compaction rewrites) |
| Predictable latency | Yes | No (compaction spikes) |
| Used by | Postgres, MySQL (InnoDB), SQL Server | Cassandra, RocksDB, LevelDB, HBase, ScyllaDB |
Rule of thumb: B-Tree for read-heavy, mixed workloads (most web apps). LSM-Tree for write-heavy workloads (logging, metrics, time-series, IoT ingestion). When your write rate is so high that Postgres is struggling, the answer is often "switch to a database built on RocksDB." Now you know why.
Ch 24 covers both structures in depth — including page-level B-tree operations, compaction strategies (leveled vs tiered vs FIFO), and how to tune them for your workload.
A trie stores strings character-by-character along tree edges. Common prefix = shared path. It is the reason autocomplete feels instant — the work is done at index time, not query time.
Storing: "cat", "car", "card", "do", "dog"
(root)
/ \
c d
| |
a o
/ \ \
t r g
|
d
Lookup "car": root → c → a → r ✓ (3 steps, regardless of how many words are stored)
Lookup "cab": root → c → a → ? ✗ (no 'b' child)
Where you encounter tries:
- Autocomplete / typeahead: traverse the trie to the prefix, then enumerate all descendants
- IP routing tables: longest-prefix match on IP addresses (trie variant called a Patricia/Radix tree) — your router is running trie lookups on every packet
- Spell checking: traverse the trie to find similar words within edit distance
- HTTP routers: many web frameworks use radix trees for route matching (Go's
httprouter) — this is how yourGET /users/:idroutes are matched
A skip list is a probabilistic alternative to balanced BSTs. It is a layered linked list where higher layers skip over elements, enabling O(log n) search — and it is dramatically simpler to implement correctly than a red-black tree.
Level 3: HEAD ─────────────────────────────────── 50 ──────────── NIL
Level 2: HEAD ────────── 20 ───────────────────── 50 ──── 70 ─── NIL
Level 1: HEAD ──── 10 ── 20 ──── 30 ──── 40 ──── 50 ──── 70 ─── NIL
Level 0: HEAD ─ 5─ 10 ── 20 ─ 25─ 30 ─ 35─ 40 ── 50 ─60─ 70 ─80 NIL
To find 35:
1. Start at Level 3 HEAD → 50 (too far) → stay at HEAD
2. Drop to Level 2 HEAD → 20 (ok) → 50 (too far) → stay at 20
3. Drop to Level 1: 20 → 30 (ok) → 40 (too far) → stay at 30
4. Drop to Level 0: 30 → 35 (found!)
Each element is promoted to the next level with probability p (typically 0.5 or 0.25). This gives O(log n) expected height and O(log n) expected search time.
Why skip lists matter in practice:
- Redis sorted sets (ZSET): Uses a skip list for the ordered index. Chosen over red-black trees because skip lists are simpler to implement, easier to reason about concurrently, and range queries are natural (just walk the bottom level). Next time you call
ZRANGEBYSCORE, you're walking a skip list. - LevelDB / RocksDB MemTable: The in-memory sorted structure before flushing to SSTables. Skip lists allow concurrent reads and writes without global locks.
- MemSQL / SingleStore: Uses lock-free skip lists for in-memory indexes.
Skip lists are easier to make concurrent than balanced BSTs because they avoid complex tree rotations — inserts only modify local pointers. When you see a database advertise "lock-free reads," there is often a skip list underneath.
Graphs are everywhere in backend systems, even when you do not think of them as graphs. Once you start seeing the world as graphs, you cannot stop:
- Service dependency graph: Service A calls Service B, which calls Service C
- Database entity-relationship diagram: Users have Orders, Orders have Items
- Network topology: Routers and switches connected by links
- Permission model: User is member of Group, Group has Role, Role grants Permission
- Task dependencies: Migration B depends on Migration A
That last one is something most engineers have actually debugged — when a migration fails because it ran in the wrong order. That's a topological sort problem.
Two ways to store a graph:
Adjacency List (most common in practice):
# Each node stores a list of its neighbors
graph = {
"service-a": ["service-b", "service-c"],
"service-b": ["service-d", "database"],
"service-c": ["service-d", "cache"],
"service-d": ["database"],
"database": [],
"cache": []
}
# Space: O(V + E) — vertices + edges
# Check if edge exists: O(degree of vertex) — scan neighbor list
# Iterate neighbors: O(degree of vertex)Adjacency Matrix:
a b c d db cache
service-a [ 0 1 1 0 0 0 ]
service-b [ 0 0 0 1 1 0 ]
service-c [ 0 0 0 1 0 1 ]
service-d [ 0 0 0 0 1 0 ]
database [ 0 0 0 0 0 0 ]
cache [ 0 0 0 0 0 0 ]
# Space: O(V²)
# Check if edge exists: O(1) — direct array lookup
# Iterate neighbors: O(V) — scan entire row
When to use which:
- Adjacency list: sparse graphs (most real systems), memory-efficient, better for traversal
- Adjacency matrix: dense graphs, or when you need O(1) edge existence checks (rare in backend work)
In practice, you almost always use adjacency lists. Most real graphs — service dependencies, social graphs, permission hierarchies — are sparse. An adjacency matrix for 1000 services would be 1000×1000 = 1M cells, most of them zero.
BFS explores all nodes at distance 1, then distance 2, then distance 3, and so on. It finds the shortest path in unweighted graphs. Think of it as ripples spreading out from a stone dropped in water.
from collections import deque
def bfs_shortest_path(graph, start, target):
"""Find shortest path (fewest hops) between two nodes."""
queue = deque([(start, [start])])
visited = {start}
while queue:
node, path = queue.popleft()
if node == target:
return path
for neighbor in graph[node]:
if neighbor not in visited:
visited.add(neighbor)
queue.append((neighbor, path + [neighbor]))
return None # no path existsWhere BFS is used:
- Social networks: "degrees of separation" between users (LinkedIn's connection degree, "You and Alice share 3 mutual connections")
- Service dependency blast radius: "if service X goes down, what is affected within 2 hops?" — BFS from X to enumerate the impact
- Network routing: finding the fewest hops between two hosts
- Garbage collection: mark-and-sweep (BFS from root objects to find all reachable objects — everything not found is garbage)
DFS explores as deep as possible along each branch before backtracking. It is simpler to implement (naturally recursive) and uses less memory than BFS for deep graphs. Where BFS thinks in rings, DFS thinks in paths.
def dfs(graph, node, visited=None):
"""Traverse all reachable nodes depth-first."""
if visited is None:
visited = set()
visited.add(node)
for neighbor in graph[node]:
if neighbor not in visited:
dfs(graph, neighbor, visited)
return visited
def has_cycle(graph):
"""Detect cycles in a directed graph using DFS coloring."""
WHITE, GRAY, BLACK = 0, 1, 2
color = {node: WHITE for node in graph}
def dfs_visit(node):
color[node] = GRAY # currently being explored
for neighbor in graph[node]:
if color[neighbor] == GRAY:
return True # back edge = cycle!
if color[neighbor] == WHITE:
if dfs_visit(neighbor):
return True
color[node] = BLACK # fully explored
return False
return any(
dfs_visit(node)
for node in graph
if color[node] == WHITE
)The three-color trick is elegant: WHITE means unvisited, GRAY means currently in the DFS stack, BLACK means done. If you ever reach a GRAY node, you have found a back edge — a cycle.
Where DFS is used:
- Deadlock detection: model lock waits as a directed graph; a cycle means deadlock
- Circular dependency detection: in module imports, service dependencies, database foreign keys
- Topological sorting (see below)
- Connected components: finding isolated clusters in a system
A topological sort of a directed acyclic graph (DAG) produces a linear ordering where for every edge A → B, A appears before B. Only possible if the graph has no cycles.
You run topological sort every time you run npm install — the package manager is solving the dependency graph to determine build order.
def topological_sort(graph):
"""Kahn's algorithm: BFS-based topological sort."""
# Count incoming edges for each node
in_degree = {node: 0 for node in graph}
for node in graph:
for neighbor in graph[node]:
in_degree[neighbor] += 1
# Start with nodes that have no dependencies
queue = deque([node for node in graph if in_degree[node] == 0])
result = []
while queue:
node = queue.popleft()
result.append(node)
for neighbor in graph[node]:
in_degree[neighbor] -= 1
if in_degree[neighbor] == 0:
queue.append(neighbor)
if len(result) != len(graph):
raise ValueError("Graph has a cycle — topological sort impossible")
return resultWhere topological sort is used:
- Database migrations: Migration 003 depends on 002 depends on 001. Run them in topological order.
- Build systems:
make, Bazel, and Gradle use topological sort to determine build order - Package managers:
npm,pip,cargoresolve dependency trees with topological sort - Task schedulers: Airflow DAGs, CI/CD pipelines with step dependencies
- Spreadsheet recalculation: cells depend on other cells; recalculate in topological order
If len(result) != len(graph), you have a cycle. This is how npm detects circular dependencies.
BFS finds shortest paths when all edges have equal weight. Dijkstra handles weighted edges (as long as weights are non-negative). Replace "hops" with "latency in milliseconds" and suddenly you're routing packets.
import heapq
def dijkstra(graph, start):
"""
graph[node] = [(neighbor, weight), ...]
Returns shortest distance from start to every reachable node.
"""
distances = {start: 0}
priority_queue = [(0, start)] # (distance, node)
while priority_queue:
dist, node = heapq.heappop(priority_queue)
if dist > distances.get(node, float('inf')):
continue # we already found a shorter path
for neighbor, weight in graph[node]:
new_dist = dist + weight
if new_dist < distances.get(neighbor, float('inf')):
distances[neighbor] = new_dist
heapq.heappush(priority_queue, (new_dist, neighbor))
return distancesWhere Dijkstra is used:
- Network routing: OSPF (Open Shortest Path First) protocol uses Dijkstra to compute shortest paths between routers — this runs inside every enterprise network
- Cost optimization: finding the cheapest cloud region-to-region data transfer path
- Load balancing: routing requests through the lowest-latency path
- Game servers: geographic routing to minimize player latency
Beyond the DFS coloring approach shown above, cycle detection is critical in:
- Deadlock detection in databases: Postgres maintains a wait-for graph. When a transaction waits for a lock held by another, an edge is added. A cycle means deadlock — Postgres aborts one of the transactions. The next time you see
ERROR: deadlock detected, Postgres ran a cycle detection algorithm. - Circular dependency prevention: module systems, microservice dependency validation
- Reference counting garbage collection: cycles of objects pointing to each other will never reach refcount 0. This is why Python uses a separate cycle-detecting GC alongside reference counting.
Sometimes you do not need an exact answer. Probabilistic data structures trade a small, bounded error rate for massive savings in memory and computation. They are some of the most elegant engineering tools in existence — the kind of thing that makes you say "that's clever" when you first understand how they work.
The trade-off is always the same: you sacrifice exactness to gain scale. A 0.81% error rate is worth it when it's the difference between 12KB and 12GB.
A Bloom filter is like a bouncer with a bad memory — they'll never let the wrong person through, but they might wave someone in who shouldn't be there. More precisely:
- "No" — definitely not in the set (100% certain)
- "Yes" — probably in the set (might be a false positive)
No false negatives, but possible false positives. This asymmetry is what makes Bloom filters useful: you can always trust a "no."
How it works:
Initialize: m-bit array, all zeros
k hash functions (h1, h2, ..., hk)
Insert("hello"):
h1("hello") % m = 3 → set bit 3
h2("hello") % m = 7 → set bit 7
h3("hello") % m = 11 → set bit 11
Bit array: [0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0]
Check("hello"):
Check bits 3, 7, 11 → all set → "probably yes" ✓
Check("world"):
h1("world") % m = 3 → set ✓
h2("world") % m = 5 → NOT set ✗
→ "definitely no" (if any bit is 0, element was never inserted)
Check("xyz"):
h1("xyz") % m = 3 → set ✓
h2("xyz") % m = 7 → set ✓
h3("xyz") % m = 11 → set ✓
→ "probably yes" — but this is a FALSE POSITIVE
(all bits happen to be set by other elements)
Tuning: For a target false positive rate p with n elements:
- Optimal number of bits:
m = -n * ln(p) / (ln 2)^2 - Optimal number of hash functions:
k = (m/n) * ln 2 - For p = 1%, n = 1M: m = 9.6M bits (1.2MB), k = 7
Compare that to storing 1M 128-bit hashes exactly: 16MB. The Bloom filter gives you 99% accuracy in 7.5% of the space.
import mmh3 # MurmurHash3
from bitarray import bitarray
class BloomFilter:
def __init__(self, size, num_hashes):
self.size = size
self.num_hashes = num_hashes
self.bit_array = bitarray(size)
self.bit_array.setall(0)
def add(self, item):
for i in range(self.num_hashes):
idx = mmh3.hash(item, i) % self.size
self.bit_array[idx] = 1
def might_contain(self, item):
return all(
self.bit_array[mmh3.hash(item, i) % self.size]
for i in range(self.num_hashes)
)Where Bloom filters are used:
| System | Use Case |
|---|---|
| Cassandra | Checks Bloom filter before reading an SSTable — avoids disk I/O for SSTables that definitely do not contain the key |
| Google Chrome | Checks URLs against a Bloom filter of known malicious URLs (a few MB instead of a multi-GB database) |
| Akamai CDN | Avoids caching one-hit-wonder URLs — only cache if the URL has been seen before (Bloom filter check) |
| Medium | Avoids recommending articles a user has already read |
| Bitcoin | SPV (Simplified Payment Verification) nodes use Bloom filters to request relevant transactions without downloading the full blockchain |
The Cassandra case connects directly to LSM trees (Section 3.4): every SSTable has a Bloom filter. Before doing any disk I/O to search an SSTable, Cassandra asks the filter "is this key definitely not here?" If yes, skip the file entirely. This is how Cassandra makes reads tolerable despite having to check multiple levels.
Estimates the frequency of elements in a stream using bounded memory. Where a hash map would grow unboundedly as you track more unique keys, a Count-Min Sketch uses a fixed-size grid of counters.
Structure: d hash functions, each mapping to a row of w counters
Insert("cat"):
h1("cat") % w = 2 → row1[2] += 1
h2("cat") % w = 5 → row2[5] += 1
h3("cat") % w = 1 → row3[1] += 1
Query frequency("cat"):
return min(row1[2], row2[5], row3[1])
(take the minimum to reduce over-counting from collisions)
0 1 2 3 4 5 6
row 1: [ 0 0 3 1 0 0 0 ] ← h1("cat")=2
row 2: [ 1 0 0 0 0 2 0 ] ← h2("cat")=5
row 3: [ 0 4 0 0 1 0 0 ] ← h3("cat")=1
min(3, 2, 4) = 2 (true count might be 2; never underestimates)
The minimum is the key insight — collisions can only inflate counts, never deflate them. By taking the minimum across all rows, you get the tightest upper bound.
Where Count-Min Sketch is used:
- Heavy hitters detection: finding the most frequent API endpoints, most active users, hottest cache keys — "which 1% of users are generating 50% of our traffic?"
- Network monitoring: detecting DDoS by finding source IPs with abnormally high request counts
- Database query optimization: approximate frequency statistics for query planning
Estimates the number of distinct elements (cardinality) in a dataset. This is the algorithm powering SELECT COUNT(DISTINCT user_id) in systems that care about performance.
The core insight: if you hash elements uniformly and count the maximum number of leading zeros in any hash, that correlates with the log2 of the number of distinct elements.
Intuition:
- Hash every element to a uniform random binary string
- If you see a hash starting with 0... → probably >= 2 distinct elements
- If you see a hash starting with 00... → probably >= 4 distinct elements
- If you see a hash starting with 000... → probably >= 8 distinct elements
- If you see a hash starting with k zeros → probably >= 2^k distinct elements
Problem: High variance from a single estimate
Solution: Split into m buckets (registers), estimate per bucket, take harmonic mean
HyperLogLog with m=16384 (2^14) registers:
- Each register: 6 bits (max leading zeros in 64-bit hash)
- Total memory: 16384 * 6 bits = 12KB
- Error rate: 1.04 / sqrt(m) ≈ 0.81%
12KB. For any cardinality, any number of elements. The memory footprint doesn't change whether you've seen 1,000 or 1,000,000,000 distinct values.
Where HyperLogLog is used:
| System | Use Case |
|---|---|
Redis PFADD / PFCOUNT |
Count unique visitors, unique IPs, unique events — 12KB per counter regardless of cardinality |
| Google BigQuery | APPROX_COUNT_DISTINCT() uses HyperLogLog++ |
| Presto / Trino | approx_distinct() for fast analytics |
| Elasticsearch | Cardinality aggregation |
# Redis example: count unique daily visitors
PFADD visitors:2026-03-24 "user:123"
PFADD visitors:2026-03-24 "user:456"
PFADD visitors:2026-03-24 "user:123" # duplicate, no effect
PFCOUNT visitors:2026-03-24 # returns ~2
# Merge multiple days
PFMERGE visitors:week visitors:2026-03-18 visitors:2026-03-19 ... visitors:2026-03-24
PFCOUNT visitors:week # unique visitors across the entire week
The merge operation is the killer feature: you can compute weekly/monthly uniques from daily HyperLogLogs without storing all the user IDs.
Like Bloom filters but with two advantages:
- Support deletion (Bloom filters do not — you cannot unset a bit shared by multiple elements)
- Better space efficiency at low false positive rates (below ~3%)
Named after the cuckoo bird that displaces other birds' eggs from nests:
Two hash functions, two possible buckets per element.
Insert("hello"):
bucket_a = h1("hello") = 4
bucket_b = h2("hello") = 9
If bucket 4 has space → store fingerprint of "hello" in bucket 4
If bucket 9 has space → store fingerprint of "hello" in bucket 9
If both full → kick out the existing element in bucket 4,
relocate it to its alternate bucket,
store "hello" in bucket 4
(may cascade, like cuckoo hashing)
Use when: you need Bloom filter functionality but also need to remove elements (e.g., a distributed blocklist that entries can be removed from). Bloom filters are static; cuckoo filters are dynamic.
These are the "system design building blocks" that come up repeatedly. Each one is a small system unto itself, combining multiple data structures into something that solves a real problem. When you see these in system design interviews, the interviewer wants to know that you understand the underlying mechanics — not just that the pattern exists.
An LRU (Least Recently Used) cache evicts the least recently accessed entry when it reaches capacity. It requires O(1) for both get and put — and that constraint is what makes the implementation non-obvious.
The trick: combine a hash map (O(1) key lookup) with a doubly linked list (O(1) move-to-front and remove-from-tail). Neither structure alone is sufficient. Together, they're O(1) for everything.
Hash Map: key → pointer to linked list node
Doubly Linked List: most recent ←→ ... ←→ least recent
get(key):
1. Look up node in hash map O(1)
2. Move node to front of linked list O(1) — just relink pointers
3. Return value
put(key, value):
1. If key exists: update value, move to front
2. If key doesn't exist:
a. If at capacity: remove tail node, delete from hash map
b. Create new node, add to front, add to hash map
class Node:
def __init__(self, key, value):
self.key = key
self.value = value
self.prev = None
self.next = None
class LRUCache:
def __init__(self, capacity):
self.capacity = capacity
self.cache = {} # key → Node
# Sentinel nodes simplify edge cases
self.head = Node(0, 0) # dummy head (most recent side)
self.tail = Node(0, 0) # dummy tail (least recent side)
self.head.next = self.tail
self.tail.prev = self.head
def get(self, key):
if key not in self.cache:
return -1
node = self.cache[key]
self._remove(node)
self._add_to_front(node)
return node.value
def put(self, key, value):
if key in self.cache:
self._remove(self.cache[key])
node = Node(key, value)
self._add_to_front(node)
self.cache[key] = node
if len(self.cache) > self.capacity:
lru = self.tail.prev
self._remove(lru)
del self.cache[lru.key]
def _add_to_front(self, node):
node.prev = self.head
node.next = self.head.next
self.head.next.prev = node
self.head.next = node
def _remove(self, node):
node.prev.next = node.next
node.next.prev = node.prevWhere LRU caches are used:
- Database buffer pool: Postgres/MySQL keep frequently accessed disk pages in memory using an LRU-like policy (Postgres uses a clock-sweep algorithm, a cheaper approximation — the full LRU tracking overhead isn't worth it at that scale)
- Web application caching: in-process caches (Guava Cache, Caffeine in Java;
lru-cachein Node.js) - CPU caches: hardware LRU approximations for L1/L2/L3 cache eviction
- Operating system page cache: which disk pages to keep in RAM
The token bucket is the most common rate limiting algorithm. Imagine a bucket that fills with tokens at a steady rate. Each request consumes a token. If the bucket is empty, the request is denied (HTTP 429 Too Many Requests).
Bucket:
- capacity: 10 tokens (max burst)
- refill_rate: 2 tokens/second
- tokens: current number of tokens
- last_refill: timestamp of last refill
allow_request():
now = current_time()
elapsed = now - last_refill
# Add tokens based on elapsed time
tokens = min(capacity, tokens + elapsed * refill_rate)
last_refill = now
if tokens >= 1:
tokens -= 1
return True # request allowed
else:
return False # rate limited (429 Too Many Requests)
import time
class TokenBucket:
def __init__(self, capacity, refill_rate):
self.capacity = capacity
self.refill_rate = refill_rate # tokens per second
self.tokens = capacity
self.last_refill = time.monotonic()
def allow(self):
now = time.monotonic()
elapsed = now - self.last_refill
self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
self.last_refill = now
if self.tokens >= 1:
self.tokens -= 1
return True
return FalseProperties: allows bursts (up to capacity requests at once), then rate-limits to refill_rate sustained throughput. Most APIs use this because it naturally handles bursty traffic. A mobile app that wakes from sleep and fires 5 requests immediately won't be rejected as long as tokens have accumulated.
Provides smoother rate limiting than fixed windows, without the memory overhead of tracking every individual request timestamp.
Fixed window problem:
Window 1 (0:00-1:00): 100 requests (limit: 100/min)
Window 2 (1:00-2:00): 100 requests
But: 100 requests at 0:59 + 100 requests at 1:01 = 200 requests in 2 seconds!
Sliding window counter solution:
Current window count + (previous window count * overlap percentage)
At time 1:15 (15 seconds into window 2):
Previous window (0:00-1:00): 84 requests
Current window (1:00-2:00): 36 requests so far
Overlap of previous window: (60 - 15) / 60 = 75%
Estimated count: 36 + (84 * 0.75) = 36 + 63 = 99
If limit is 100 → allow (99 < 100)
class SlidingWindowCounter:
def __init__(self, limit, window_seconds):
self.limit = limit
self.window = window_seconds
self.prev_count = 0
self.curr_count = 0
self.curr_window_start = 0
def allow(self):
now = time.monotonic()
window_start = now - (now % self.window)
# Roll over to new window if needed
if window_start != self.curr_window_start:
self.prev_count = self.curr_count
self.curr_count = 0
self.curr_window_start = window_start
# Calculate weighted count
elapsed_in_window = now - window_start
weight = (self.window - elapsed_in_window) / self.window
estimated = self.curr_count + (self.prev_count * weight)
if estimated < self.limit:
self.curr_count += 1
return True
return FalseFor multi-server deployments, rate limit state must be shared. Redis is the standard solution — every application server talks to the same Redis, so rate limit state is consistent across the fleet:
# Token bucket in Redis using a Lua script for atomicity
RATE_LIMIT_SCRIPT = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2]) -- tokens per second
local now = tonumber(ARGV[3])
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now
-- Refill tokens
local elapsed = now - last_refill
tokens = math.min(capacity, tokens + elapsed * refill_rate)
-- Try to consume a token
if tokens >= 1 then
tokens = tokens - 1
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) * 2)
return 1 -- allowed
else
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) * 2)
return 0 -- denied
end
"""The Lua script runs atomically in Redis — no race conditions between multiple application servers checking and decrementing the counter. This is critical: without atomicity, two servers could both read "1 token remaining," both allow a request, and you'd end up with -1 tokens.
(Detailed implementation covered in Section 2.5 above. Here we focus on the operational aspects.)
Adding a node:
Before: Nodes A, B, C on the ring
A handles keys in range (C, A]
B handles keys in range (A, B]
C handles keys in range (B, C]
Add node D between A and B:
A handles keys in range (C, A] ← unchanged
D handles keys in range (A, D] ← takes from B
B handles keys in range (D, B] ← smaller range now
C handles keys in range (B, C] ← unchanged
Only keys in range (A, D] need to migrate from B to D.
With virtual nodes, this is ~1/N of all keys (evenly distributed).
Removing a node:
Remove node D:
All keys in (A, D] transfer to B (the next node clockwise)
Again, only ~1/N of keys move.
Used in practice by:
- Amazon DynamoDB: partition assignment across storage nodes
- Apache Cassandra: token ring for data distribution
- Nginx:
upstreamconsistent hashing for sticky sessions - HAProxy:
balance uri consistentfor cache-friendly load balancing
Two approaches to generating short URLs. This is a classic system design question, but the interesting part is the trade-off between the two approaches.
ALPHABET = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
def encode_base62(num):
"""Convert integer to base62 string."""
if num == 0:
return ALPHABET[0]
result = []
while num > 0:
result.append(ALPHABET[num % 62])
num //= 62
return ''.join(reversed(result))
def decode_base62(s):
"""Convert base62 string back to integer."""
num = 0
for char in s:
num = num * 62 + ALPHABET.index(char)
return num
# Database auto-increment ID → short code
# ID 1000000 → encode_base62(1000000) → "4c92"
# 6-character base62 → 62^6 = 56.8 billion possible URLsWrite path: Insert URL into database, get auto-increment ID, encode as base62, return short URL.
Read path: Decode base62 to ID, look up in database by primary key (O(1) with index), redirect.
Pros: Simple, guaranteed unique, predictable length. Cons: Sequential IDs are predictable (user can enumerate URLs), requires centralized ID generation.
import hashlib
def shorten(long_url):
"""Generate short code from URL hash."""
hash_hex = hashlib.sha256(long_url.encode()).hexdigest()
short_code = encode_base62(int(hash_hex[:12], 16))[:7] # take first 7 chars
# Check for collision
existing = db.get(short_code)
if existing and existing.long_url != long_url:
# Collision! Append a counter and retry
for i in range(1, 100):
candidate = encode_base62(int(hash_hex[:12], 16) + i)[:7]
if not db.get(candidate):
short_code = candidate
break
db.put(short_code, long_url)
return short_codePros: No centralized counter, same URL always generates same short code (idempotent). Cons: Collision handling adds complexity, requires a read-before-write.
In distributed systems, you cannot rely on a single database auto-increment. Multiple machines need to generate unique IDs independently, often with ordering guarantees. The naive solution — a centralized ID generator — is a single point of failure and a bottleneck. Here are three approaches that scale.
64-bit ID:
┌─ 1 bit ─┬─── 41 bits ────┬── 10 bits ──┬── 12 bits ──┐
│ unused │ timestamp(ms) │ machine ID │ sequence │
└──────────┴────────────────┴─────────────┴─────────────┘
- Timestamp: milliseconds since custom epoch
→ 2^41 ms ≈ 69 years before overflow
- Machine ID: 1024 unique workers (datacenter ID + worker ID)
- Sequence: 4096 IDs per millisecond per machine
Total capacity: 4096 * 1000 * 1024 ≈ 4 billion IDs per second cluster-wide
import time
class SnowflakeGenerator:
EPOCH = 1609459200000 # 2021-01-01 00:00:00 UTC in ms
def __init__(self, machine_id):
assert 0 <= machine_id < 1024
self.machine_id = machine_id
self.sequence = 0
self.last_timestamp = -1
def next_id(self):
timestamp = int(time.time() * 1000) - self.EPOCH
if timestamp == self.last_timestamp:
self.sequence = (self.sequence + 1) & 0xFFF # 12-bit wrap
if self.sequence == 0:
# Exhausted 4096 IDs this millisecond — wait for next ms
while timestamp == self.last_timestamp:
timestamp = int(time.time() * 1000) - self.EPOCH
else:
self.sequence = 0
self.last_timestamp = timestamp
return (
(timestamp << 22) |
(self.machine_id << 12) |
self.sequence
)Properties: Roughly time-sorted (not perfectly — clock skew between machines), 64-bit (fits in a bigint), no coordination needed between machines.
Used by: Twitter (original Snowflake), Discord (modified Snowflake), Instagram (similar scheme).
128-bit UUID (standard UUID format, but timestamp-prefixed):
xxxxxxxx-xxxx-7xxx-yxxx-xxxxxxxxxxxx
First 48 bits: Unix timestamp in milliseconds
Next 4 bits: version (7)
Next 12 bits: random
Next 2 bits: variant (RFC 4122)
Next 62 bits: random
Example: 018e0a3c-5b00-7123-a456-789012345678
^^^^^^^^^^^^
timestamp portion → IDs are naturally sorted by creation time
Advantages over UUIDv4:
- Time-sortable: B-tree indexes stay sequential (no random page splits)
- Standard format: works with any system that accepts UUIDs
- No coordination: pure random component ensures uniqueness without a central authority
UUIDv4's problem with databases: Random UUIDs cause random B-tree inserts, leading to poor cache utilization, excessive page splits, and index fragmentation. UUIDv7 fixes this by being monotonically increasing. If you've ever seen a Postgres table with UUID primary keys perform worse than expected — this is why.
128-bit, encoded as 26-character Crockford's Base32:
01ARZ3NDEKTSV4RRFFQ69G5FAV
┌──── 48 bits ────┬──── 80 bits ────┐
│ timestamp(ms) │ randomness │
└─────────────────┴─────────────────┘
- Timestamp: milliseconds since Unix epoch
- Randomness: 80 bits of cryptographic randomness
- Lexicographically sortable (string comparison = time comparison)
- Case-insensitive, no special characters (URL-safe)
| Property | Snowflake | UUIDv7 | ULID | UUIDv4 |
|---|---|---|---|---|
| Size | 64 bits | 128 bits | 128 bits | 128 bits |
| Sortable by time | Yes | Yes | Yes | No |
| Coordination needed | Machine ID assignment | No | No | No |
| DB index friendly | Yes | Yes | Yes | No (random) |
| Standard format | Custom | UUID | Custom (26 chars) | UUID |
| IDs per ms per node | 4096 | Unlimited (random) | Unlimited (random) | N/A |
| Uniqueness guarantee | Machine ID + sequence | Probabilistic | Probabilistic | Probabilistic |
Guidance: Use Snowflake (or a variant) when you need compact 64-bit IDs and can manage machine ID assignment. Use UUIDv7 when you need standard UUID compatibility. Use ULID when you want string-sortable IDs without UUID format constraints. Avoid UUIDv4 as a primary key in B-tree indexed databases — every insert is a random B-tree page access.
You will almost never implement a sorting algorithm from scratch. But you must understand them because they determine how your database executes queries, how your files are organized, and where your performance bottlenecks are. When Postgres says Sort Method: external merge Disk: 145MB, you need to know what that means and how to fix it.
1. Pick a pivot element
2. Partition: move elements < pivot to left, > pivot to right
3. Recursively sort left and right partitions
Average: O(n log n) Worst: O(n²) — when pivot is always min/max
Space: O(log n) stack depth
Stable: No (equal elements may change relative order)
Why it is the default: Excellent cache locality (operates on contiguous memory), low constant factors, in-place (no extra array allocation). Most standard library sort() functions use QuickSort or a variant.
The worst case: If you always pick the first element as pivot and the array is already sorted, you get O(n^2). Modern implementations use median-of-three or random pivots to avoid this. IntroSort (used by C++ std::sort) switches to HeapSort if recursion depth exceeds 2*log(n), guaranteeing O(n log n) worst case.
1. Split array in half
2. Recursively sort each half
3. Merge the two sorted halves
Always: O(n log n) (no worst case degradation)
Space: O(n) — needs a temporary array for merging
Stable: Yes (equal elements maintain relative order)
Why it matters:
- Stable sorting: when you
ORDER BY created_at, name, stability ensures that rows with the samecreated_atmaintain theirnameordering from the second sort pass - External sorting: when data does not fit in memory, MergeSort is the only practical algorithm (see below)
- Linked lists: MergeSort is optimal for linked lists (no random access needed, O(1) extra space)
1. Divide array into "runs" (already-sorted subsequences)
2. Extend short runs using insertion sort (to minimum run length, usually 32-64)
3. Merge runs using a modified merge sort with galloping mode
Always: O(n log n) worst case
Best: O(n) when data is already sorted or nearly sorted
Space: O(n)
Stable: Yes
Why it is used by Python and Java: Real-world data is often partially sorted (log entries, time-series data, nearly-sorted lists after a small update). TimSort exploits existing order and achieves near-O(n) performance on such inputs. If you're sorting events by timestamp and they're mostly in order already, TimSort will be dramatically faster than pure QuickSort.
Binary search is the most useful algorithm you will actually write (or debug) in production code. It applies anywhere you have sorted data and need to find a boundary. Once you see it as "finding the boundary," you'll recognize it everywhere.
def binary_search(arr, target):
lo, hi = 0, len(arr) - 1
while lo <= hi:
mid = lo + (hi - lo) // 2 # avoid integer overflow
if arr[mid] == target:
return mid
elif arr[mid] < target:
lo = mid + 1
else:
hi = mid - 1
return -1 # not foundMore useful than exact search in practice. "Find the first element >= X" or "find the last element <= X." This is what database indexes actually do.
def lower_bound(arr, target):
"""Find the index of the first element >= target."""
lo, hi = 0, len(arr)
while lo < hi:
mid = lo + (hi - lo) // 2
if arr[mid] < target:
lo = mid + 1
else:
hi = mid
return lo
def upper_bound(arr, target):
"""Find the index of the first element > target."""
lo, hi = 0, len(arr)
while lo < hi:
mid = lo + (hi - lo) // 2
if arr[mid] <= target:
lo = mid + 1
else:
hi = mid
return loWhere boundary search is used:
- Database range queries: B-tree index finds the lower bound of the range, then scans forward through the leaf linked list
- Time-series data: "find the first event after timestamp T"
- Rate limiting sliding windows: "find all requests in the last 60 seconds"
- Version ranges: "find the latest version <= 2.5"
This is the most powerful and under-appreciated application. Instead of searching in an array, you binary search on the possible answer range. The key insight: if you can ask "is X too small?" as a yes/no question, you can binary search for the minimum X that works.
def min_capacity_to_ship(weights, days):
"""
Given packages with given weights and a deadline of D days,
find the minimum ship capacity to deliver all packages on time.
(This is a capacity planning problem.)
"""
def can_ship_in_time(capacity):
current_load = 0
days_needed = 1
for w in weights:
if current_load + w > capacity:
days_needed += 1
current_load = 0
current_load += w
return days_needed <= days
lo = max(weights) # must carry at least the heaviest package
hi = sum(weights) # worst case: ship everything in one day
while lo < hi:
mid = lo + (hi - lo) // 2
if can_ship_in_time(mid):
hi = mid # try smaller capacity
else:
lo = mid + 1 # need more capacity
return loReal applications of binary search on answer space:
- Capacity planning: "What is the minimum number of servers to handle this traffic?" (binary search on server count, simulate load for each)
- SLA optimization: "What is the lowest latency budget per service that still meets our end-to-end SLA?"
- Batch size tuning: "What is the largest batch size that still processes within the timeout?"
- Resource allocation: "What is the minimum memory allocation that avoids OOM for this workload?"
When your dataset does not fit in memory (sorting a 100GB file on a machine with 8GB RAM), you use external merge sort. This is what your database does when you don't give it enough memory for a sort operation:
Phase 1: Create sorted runs
1. Read 8GB chunk from disk into memory
2. Sort in memory (quicksort)
3. Write sorted chunk ("run") to disk
4. Repeat until all data is processed
→ Result: ~13 sorted runs of 8GB each
Phase 2: K-way merge
1. Open all 13 runs simultaneously
2. Read a small buffer (e.g., 100MB) from each run
3. Use a min-heap (priority queue) to find the smallest element
across all runs
4. Output the smallest, refill buffer when empty
5. Continue until all runs are exhausted
Heap size: 13 elements (one per run) — fits easily in memory
I/O pattern: sequential reads from each run — disk-friendly
import heapq
def external_sort(input_file, output_file, memory_limit):
# Phase 1: Create sorted runs
runs = []
chunk = []
chunk_size = 0
for record in read_records(input_file):
chunk.append(record)
chunk_size += record.size
if chunk_size >= memory_limit:
chunk.sort(key=lambda r: r.sort_key)
run_file = write_run(chunk)
runs.append(run_file)
chunk = []
chunk_size = 0
if chunk: # remaining records
chunk.sort(key=lambda r: r.sort_key)
runs.append(write_run(chunk))
# Phase 2: K-way merge
readers = [open_run_reader(run) for run in runs]
heap = []
for i, reader in enumerate(readers):
record = next(reader, None)
if record:
heapq.heappush(heap, (record.sort_key, i, record))
with open(output_file, 'w') as out:
while heap:
_, run_idx, record = heapq.heappop(heap)
out.write(record)
next_record = next(readers[run_idx], None)
if next_record:
heapq.heappush(heap, (next_record.sort_key, run_idx, next_record))Where external sorting is used:
- Database
ORDER BYon large result sets: when the sort buffer (work_memin Postgres,sort_buffer_sizein MySQL) is too small, the database switches to external sort - MapReduce shuffle phase: sorting intermediate key-value pairs before the reduce step
- Log aggregation: merging sorted log files from multiple servers into a single sorted stream
sortcommand: Unixsortuses external merge sort for large files
When you write SELECT * FROM orders ORDER BY created_at, the database has three strategies, and understanding which one it chooses tells you a lot about your query's performance:
1. Index-ordered scan (best case):
If there's a B+ tree index on created_at:
→ Just scan the leaf nodes in order (they're already sorted via the linked list)
→ No sorting needed!
→ Cost: O(n) sequential I/O
EXPLAIN: "Index Scan using idx_orders_created_at"
2. In-memory sort (small result set):
If the result fits in work_mem (default 4MB in Postgres):
→ Load all rows into memory
→ QuickSort
→ Return
EXPLAIN: "Sort Sort Method: quicksort Memory: 3412kB"
3. External merge sort (large result set):
If the result exceeds work_mem:
→ External merge sort using temp files on disk
→ Much slower due to disk I/O
EXPLAIN: "Sort Sort Method: external merge Disk: 145MB"
Fix: Increase work_mem (per-query) or add an index
SET work_mem = '256MB';
This is why "add an index" is the answer to most performance questions. An index transforms a sort operation from O(n log n) with potential disk I/O into a simple O(n) sequential scan. The sort already happened at insert time — you're just reading it.
Search is one of the most common features in production applications, yet most engineers treat it as a black box you configure by copying Stack Overflow answers. Understanding how search engines work — from text analysis to ranking to scaling — is what separates engineers who can build search that actually works from engineers who cargo-cult Elasticsearch configs and wonder why results are bad.
The search pipeline has six stages, and understanding each one is the key to debugging why search results are bad:
Document Ingestion → Analysis → Indexing → Query Processing → Ranking → Results
1. INGESTION: Receive documents (API call, database sync, web crawl)
2. ANALYSIS: Break text into tokens, normalize them (lowercase, stem, remove stop words)
3. INDEXING: Build an inverted index: term → list of documents containing that term
4. QUERY: Parse the user's query through the same analysis pipeline
5. RANKING: Score each matching document by relevance (TF-IDF, BM25, vector similarity)
6. RESULTS: Return top-K documents, optionally with highlights and facets
The inverted index is the core data structure of search. It maps every analyzed term to the documents (and positions) where that term appears. The name "inverted" means you've flipped the relationship: instead of "document → words it contains," you have "word → documents that contain it."
Inverted Index:
"database" → [doc1:pos3, doc2:pos7, doc5:pos1, doc12:pos22]
"optimize" → [doc1:pos4, doc3:pos1, doc5:pos8]
"postgresql" → [doc2:pos1, doc5:pos3, doc12:pos5]
"index" → [doc1:pos9, doc2:pos12, doc3:pos6, doc5:pos2, doc7:pos1]
Query: "database optimization"
→ Analyze: ["database", "optimiz"] (stemmed)
→ Look up both terms in inverted index
→ Find documents that match both (intersection): doc1, doc5
→ Score and rank them
This is why search is fast: instead of scanning every document for your query (O(total_words)), you do a dictionary lookup per query term and intersect posting lists (O(matches)).
Index segments, merging, and immutability (Lucene architecture):
Lucene (the library behind Elasticsearch and Solr) uses an append-only segment architecture — notice the similarity to LSM trees from Section 3.4:
Write path (similar to LSM-Trees):
1. New documents go into an in-memory buffer
2. Buffer is periodically flushed to an immutable segment on disk
3. Each segment is a self-contained inverted index
4. Segments are periodically merged into larger segments (compaction)
Search path:
1. Query runs against ALL segments in parallel
2. Results are merged across segments
3. Deleted documents are excluded via a deletion bitmap
Why immutable segments?
- No locking needed for reads (concurrent search is trivial)
- Filesystem cache is very effective (segments don't change)
- Write-ahead log protects against crashes
- Trade-off: deletes/updates require marking old doc as deleted + reindexing
The LSM-tree pattern — write sequentially, compact asynchronously — appears everywhere. Search engines independently arrived at the same architecture as write-optimized databases.
Text analysis is where most search quality problems live. If your analyzer is wrong, no amount of ranking tuning will fix bad results. Garbage in, garbage out — except the garbage is invisible because everything still appears to work.
Tokenization splits text into individual terms:
Input: "New York City's best coffee shops (2024 edition)"
Standard tokenizer: ["new", "york", "city's", "best", "coffee", "shops", "2024", "edition"]
Whitespace tokenizer: ["New", "York", "City's", "best", "coffee", "shops", "(2024", "edition)"]
Keyword tokenizer: ["New York City's best coffee shops (2024 edition)"] ← entire string as one token
The choice matters enormously:
- Standard: good default for full-text search
- Whitespace: when you want to preserve case and punctuation
- Keyword: for exact-match fields (email addresses, product SKUs, status codes)
The full analysis chain: Analyzer = Character Filters → Tokenizer → Token Filters
Input text: "The <b>Quick</b> Brown FOX jumped over 2 lazy dogs"
Character filter: Strip HTML → "The Quick Brown FOX jumped over 2 lazy dogs"
Tokenizer: Standard → ["The", "Quick", "Brown", "FOX", "jumped", "over", "2", "lazy", "dogs"]
Token filter 1: Lowercase → ["the", "quick", "brown", "fox", "jumped", "over", "2", "lazy", "dogs"]
Token filter 2: Stop words → ["quick", "brown", "fox", "jumped", "lazy", "dogs"]
Token filter 3: Stemming → ["quick", "brown", "fox", "jump", "lazi", "dog"]
Final indexed tokens: ["quick", "brown", "fox", "jump", "lazi", "dog"]
Stemming vs Lemmatization:
Stemming (algorithmic, fast, sometimes aggressive):
"running" → "run"
"runs" → "run"
"better" → "better" (stemmer doesn't know this)
"studies" → "studi" (overstemming — "studio" also becomes "studi")
"universal" → "univers"
"university" → "univers" (conflation — different meanings, same stem)
Lemmatization (dictionary-based, slower, more accurate):
"running" → "run"
"runs" → "run"
"better" → "good" (understands irregular forms)
"studies" → "study" (correct base form)
"universal" → "universal"
"university" → "university" (no conflation)
Rule of thumb: Use stemming for most search use cases (recall is more important).
Use lemmatization when precision matters (medical, legal, scientific search).
Stop words — common words like "the", "is", "at", "which":
When to REMOVE stop words:
- General full-text search (they add noise without meaning)
- Index size matters (stop words can be 25-30% of all tokens)
When to KEEP stop words:
- Phrase queries: "to be or not to be" — every word matters
- Song/movie titles: "The Who", "Let It Be", "It"
- Technical content: "C++" tokenized without "+" is just "c"
Synonyms expand queries to match equivalent terms:
Synonym mappings:
"couch" ↔ "sofa"
"laptop" ↔ "notebook computer"
"NYC" → "New York City" (one-way: NYC matches New York City, but not vice versa)
Apply at index time, query time, or both:
- Index time: larger index, but faster queries. Cannot change synonyms without reindex.
- Query time: smaller index, flexible. But multi-word synonyms are tricky.
- Best practice: query-time synonyms for most cases (flexibility > speed).
N-grams and edge n-grams for autocomplete:
Edge n-grams (index time) — for "search-as-you-type":
Input: "elasticsearch"
min_gram: 2, max_gram: 10
Tokens: ["el", "ela", "elas", "elast", "elasti", "elastic", "elastics", "elasticse", "elasticsea"]
When user types "elas" → exact match on the "elas" token → instant results
N-grams (sliding window) — for fuzzy substring matching:
Input: "hello"
n: 3 (trigrams)
Tokens: ["hel", "ell", "llo"]
Useful for partial matching and language-agnostic search (CJK languages, compound words).
Ranking determines result quality. A search engine that finds documents but ranks them badly is useless — worse than useless, because it makes people distrust the search. The ranking algorithm is where you compete.
TF-IDF (Term Frequency - Inverse Document Frequency):
TF(term, doc) = count of term in document / total terms in document
IDF(term) = log(total documents / documents containing term)
Score = TF × IDF
Example with 10,000 documents:
Query: "database optimization"
Document A (1000 words): "database" appears 5 times, "optimization" appears 3 times
Document B (100 words): "database" appears 5 times, "optimization" appears 3 times
"database" appears in 2000 docs: IDF = log(10000/2000) = 1.61
"optimization" appears in 200 docs: IDF = log(10000/200) = 3.91
Score A = (5/1000 × 1.61) + (3/1000 × 3.91) = 0.008 + 0.012 = 0.020
Score B = (5/100 × 1.61) + (3/100 × 3.91) = 0.081 + 0.117 = 0.198
Document B scores higher — same term counts but shorter document, so higher term density.
"optimization" contributes more because it's rarer across the corpus (higher IDF).
BM25 (Okapi BM25) — the current standard:
BM25 improves on TF-IDF with two critical refinements that eliminate its worst behaviors:
BM25(query, doc) = Σ IDF(term) × (TF × (k1 + 1)) / (TF + k1 × (1 - b + b × docLength/avgDocLength))
Where:
k1 = 1.2 (default) — controls term frequency saturation
b = 0.75 (default) — controls document length normalization
Key improvements over TF-IDF:
1. Term frequency saturation:
TF-IDF: 5 occurrences scores 5x higher than 1 occurrence
BM25: 5 occurrences scores maybe 2.5x higher than 1 occurrence (diminishing returns)
This prevents keyword-stuffed documents from dominating results.
2. Document length normalization:
b = 0.75 means long documents are penalized, but not severely.
b = 0: no length normalization (long documents treated same as short)
b = 1: full length normalization (strongly penalize long documents)
BM25 is the default ranking algorithm in Elasticsearch, Solr, and most modern search engines.
Vector search / semantic search:
Traditional (lexical) search:
Query: "car maintenance tips"
Misses: "automobile service guide" (no matching terms!)
Vector (semantic) search:
1. Encode query into a vector: "car maintenance tips" → [0.23, -0.45, 0.12, ...]
2. Encode document into a vector: "automobile service guide" → [0.21, -0.43, 0.15, ...]
3. Compute cosine similarity: 0.94 (very similar!)
How it works:
- Use an embedding model (OpenAI ada-002, Cohere embed, sentence-transformers)
- Documents and queries are projected into the same high-dimensional vector space
- Similar meanings cluster together, regardless of exact wording
- Find nearest neighbors using approximate nearest neighbor (ANN) algorithms:
HNSW (Hierarchical Navigable Small World) — most common, used by Elasticsearch, pgvector
IVF (Inverted File Index) — faster indexing, slightly less accurate
Trade-offs:
- Catches semantic meaning that lexical search misses
- But loses exact keyword matching ("error code XJ-4521" won't work well)
- Embedding model quality matters enormously
- Vectors are expensive to compute and store (768-1536 dimensions × 4 bytes each)
Hybrid search — current best practice:
Combine BM25 (lexical) + vector (semantic) for the best of both worlds:
Strategy 1: Reciprocal Rank Fusion (RRF)
BM25 results: [docA:rank1, docB:rank2, docC:rank3, ...]
Vector results: [docC:rank1, docA:rank2, docD:rank3, ...]
RRF_score(doc) = Σ 1/(k + rank_in_list) where k = 60 (constant)
docA: 1/(60+1) + 1/(60+2) = 0.0164 + 0.0161 = 0.0325 ← top result
docC: 1/(60+3) + 1/(60+1) = 0.0159 + 0.0164 = 0.0323
docB: 1/(60+2) + 0 = 0.0161
Strategy 2: Weighted linear combination
final_score = α × BM25_score + (1-α) × vector_score
Tune α based on your use case (start with 0.5, adjust based on relevance testing)
Hybrid search is the current industry standard for production search systems.
Learning to Rank (LTR):
Use machine learning to optimize ranking based on user behavior:
1. Collect features per query-document pair:
- BM25 score, vector similarity, document freshness, popularity, click count
- Query-specific: exact title match, category match, price range match
2. Collect training labels from user behavior:
- Click-through rate, dwell time, add-to-cart, purchase
- Explicit relevance judgments from human raters
3. Train a model (LambdaMART, neural LTR) to predict relevance
4. Use the model to re-rank results at query time
Used by: Google, Amazon, Netflix, Airbnb — any search with enough traffic to generate training data.
LTR is overkill for most applications. Start with BM25, add vectors if needed, then consider LTR.
Elasticsearch is the most widely-deployed search engine. Understanding its architecture is essential for operating it without surprises.
Architecture:
Cluster → Nodes → Indices → Shards → Segments
Cluster: "production-search" (group of nodes)
│
├── Node 1 (master-eligible, data)
│ ├── Index: "products"
│ │ ├── Shard 0 (primary)
│ │ │ ├── Segment 0 (immutable, contains 50K docs)
│ │ │ ├── Segment 1 (immutable, contains 30K docs)
│ │ │ └── Segment 2 (immutable, contains 10K docs, recently flushed)
│ │ └── Shard 2 (primary)
│ └── Index: "orders"
│ └── Shard 1 (replica)
│
├── Node 2 (data)
│ ├── Index: "products"
│ │ ├── Shard 1 (primary)
│ │ └── Shard 0 (replica)
│ └── Index: "orders"
│ └── Shard 0 (primary)
│
└── Node 3 (coordinating only — routes queries, merges results)
Node types:
- Master: Manages cluster state (index creation, shard allocation). 3 dedicated masters minimum.
- Data: Stores data, executes searches and aggregations. Scale horizontally.
- Ingest: Preprocesses documents (transform, enrich) before indexing.
- Coordinating: Routes requests, merges results. Useful for heavy aggregation workloads.
Index mappings — defining your schema:
PUT /products
{
"mappings": {
"properties": {
"title": { "type": "text", "analyzer": "english" },
"title_exact": { "type": "keyword" },
"description": { "type": "text" },
"price": { "type": "float" },
"category": { "type": "keyword" },
"tags": { "type": "keyword" },
"created_at": { "type": "date" },
"location": { "type": "geo_point" },
"ratings": { "type": "nested",
"properties": {
"user_id": { "type": "keyword" },
"score": { "type": "integer" },
"comment": { "type": "text" }
}
}
}
}
}text vs keyword — the most important distinction:
"text" field (analyzed):
Input: "Quick Brown Fox"
Stored as tokens: ["quick", "brown", "fox"]
Supports: full-text search (match query), relevance scoring
Does NOT support: exact match, sorting, aggregations (use .keyword sub-field)
"keyword" field (not analyzed):
Input: "Quick Brown Fox"
Stored as: "Quick Brown Fox" (exact string)
Supports: exact match (term query), sorting, aggregations, filtering
Does NOT support: full-text search
Common pattern — use both:
"title": {
"type": "text",
"fields": {
"keyword": { "type": "keyword" } ← multi-field mapping
}
}
Search: match on "title" (full-text)
Filter/sort/aggregate: term on "title.keyword" (exact)
Query DSL essentials:
{
"query": {
"bool": {
"must": [
{ "match": { "title": "database optimization" } }
],
"filter": [
{ "term": { "status": "published" } },
{ "range": { "created_at": { "gte": "2024-01-01" } } }
],
"should": [
{ "match": { "tags": "postgresql" } }
],
"must_not": [
{ "term": { "archived": true } }
]
}
}
}bool query breakdown:
must: Documents MUST match. Contributes to relevance score.
filter: Documents MUST match. Does NOT contribute to score (faster, cacheable).
should: Documents SHOULD match. Boosts score if they do. (Optional match.)
must_not: Documents MUST NOT match. Does not affect score.
Rule of thumb:
- Use "must" for the main search terms (what the user typed)
- Use "filter" for structured constraints (status, date range, category)
- Use "should" for boosting signals (prefer recent docs, prefer certain categories)
- Use "must_not" for exclusions (hide archived, deleted, blocked content)
The filter vs must distinction is a performance lever. Filter clauses are cached at the node level and can be reused across queries. must clauses compute scores and can't be cached. Put your date ranges, status filters, and category filters in filter — not must.
Query types — when to use which:
match: Full-text search. Analyzes the query, finds docs with matching terms.
{ "match": { "title": "database optimization" } }
Matches: "Optimization techniques for databases"
term: Exact match. No analysis. Use for keyword fields only.
{ "term": { "status": "published" } }
Does NOT match "Published" (case-sensitive, no analysis)
match_phrase: All terms must appear in order, adjacent.
{ "match_phrase": { "title": "New York" } }
Matches: "Visit New York City"
Does NOT match: "York is New"
multi_match: Search across multiple fields.
{ "multi_match": {
"query": "database optimization",
"fields": ["title^3", "description", "tags^2"],
"type": "best_fields"
}}
title matches score 3x, tags 2x, description 1x
Aggregations — analytics on search results:
{
"size": 0,
"aggs": {
"categories": {
"terms": { "field": "category", "size": 20 }
},
"price_ranges": {
"histogram": { "field": "price", "interval": 50 }
},
"avg_rating": {
"avg": { "field": "rating" }
},
"monthly_sales": {
"date_histogram": {
"field": "sold_at",
"calendar_interval": "month"
}
}
}
}Aggregations power: faceted navigation (sidebar filters with counts), dashboards, analytics, and monitoring. They run on the same query scope, so filtering narrows both results and aggregation counts simultaneously.
Fuzzy matching for typo tolerance:
{
"query": {
"match": {
"title": {
"query": "databse optimizaton",
"fuzziness": "AUTO"
}
}
}
}Fuzziness = Levenshtein edit distance (insertions, deletions, substitutions):
"databse" → "database" (1 edit: insert 'a')
"optimizaton" → "optimization" (1 edit: insert 'i')
AUTO fuzziness (recommended):
0-2 characters: exact match only
3-5 characters: 1 edit allowed
6+ characters: 2 edits allowed
Autocomplete — three approaches:
1. Completion Suggester (FST-based, fastest):
- Uses a Finite State Transducer (compact, in-memory)
- Prefix matching only
- Best for: search box suggestions with known terms
2. Edge n-grams (flexible, good for search-as-you-type):
- Index "elasticsearch" as ["el", "ela", "elas", "elast", ...]
- Regular match query works for autocomplete
- Best for: full-text autocomplete on document content
3. search_as_you_type field type (built-in, simplest):
- Automatically creates edge n-gram and shingle sub-fields
- Just use multi_match with type: "bool_prefix"
- Best for: quick setup, good-enough autocomplete
{ "match_bool_prefix": { "title.search_as_you_type": "data optim" } }
Highlighting — show matched terms in context:
{
"query": { "match": { "description": "database optimization" } },
"highlight": {
"fields": {
"description": {
"pre_tags": ["<mark>"],
"post_tags": ["</mark>"],
"fragment_size": 150,
"number_of_fragments": 3
}
}
}
}Returns: "...techniques for <mark>database</mark> <mark>optimization</mark> in production systems..."
Faceted search — the filter sidebar:
Faceted search combines a query with aggregations to show users what filters are available and how many results each filter would produce:
User searches "laptop":
Results: 2,345 laptops
Facets (from aggregations):
Brand: Apple (543) | Dell (421) | Lenovo (389) | HP (312) | ...
Price: $0-500 (234) | $500-1000 (876) | $1000-2000 (901) | $2000+ (334)
RAM: 8GB (567) | 16GB (1,102) | 32GB (543) | 64GB (133)
Rating: ★★★★+ (1,203) | ★★★+ (1,890) | ★★+ (2,100)
User clicks "Apple" → filter applied → results narrow to 543 → facet counts update
Shard sizing:
Rules of thumb:
- Target 10-50GB per shard (sweet spot for most workloads)
- Avoid too many small shards (each shard has overhead: memory, file descriptors, threads)
- Avoid too-large shards (slow recovery, long GC pauses)
- Number of shards = expected_data_size / target_shard_size
Example:
Expected data: 500GB
Target shard size: 25GB
Primary shards: 20
Replicas: 1 (so 40 total shards across the cluster)
Over-sharding is the #1 operational mistake. 1000 tiny 100MB shards
is far worse than 10 shards of 10GB each.
Index Lifecycle Management (ILM):
Manage time-series data (logs, metrics, events) across hardware tiers:
Hot phase: (fast SSDs, lots of RAM)
→ Actively indexed and searched
→ Recent data (last 7 days)
Warm phase: (cheaper SSDs, less RAM)
→ Read-only, still searchable
→ Force-merge to 1 segment (fewer file handles, faster search)
→ Older data (7-30 days)
Cold phase: (HDDs or object storage)
→ Rarely searched, slow is acceptable
→ Searchable snapshots (data in S3, cached locally)
→ Historical data (30-90 days)
Delete phase:
→ Data past retention period is deleted automatically
This is how teams manage petabytes of log data without going bankrupt on storage.
Caching:
Elasticsearch caches at three levels:
1. Filter cache (node query cache):
- Caches the results of filter clauses (term, range in filter context)
- Bitmap of matching doc IDs — extremely fast for repeated filters
- Why filters should use "filter" context, not "must"
2. Request cache:
- Caches the entire response for a search request
- Only for size:0 requests (aggregations, count)
- Invalidated when the index is refreshed
3. Fielddata / doc values:
- Column-oriented data for sorting and aggregations
- Doc values are built at index time (disk-based, memory-mapped)
- Fielddata is built at query time (heap-heavy, avoid for text fields)
Performance tuning checklist:
□ Use filter context for non-scoring queries (cacheable, faster)
□ Increase refresh_interval from 1s to 30s for write-heavy indices
□ Bulk index with _bulk API (not individual document puts)
□ Denormalize data — don't model like a relational DB (avoid parent-child, minimize nested)
□ Use routing to colocate related documents on the same shard
□ Monitor slow query log: index.search.slowlog.threshold.query.warn: 10s
□ Force-merge read-only indices to 1 segment
□ Use doc values instead of fielddata for sorting/aggregations
□ Right-size your shards (10-50GB each)
Reindexing strategies:
Zero-downtime reindex with alias swap:
1. Current state:
Alias "products" → Index "products-v1"
2. Create new index with updated mappings:
PUT products-v2 { new mappings, new analyzers }
3. Reindex data:
POST _reindex { "source": { "index": "products-v1" }, "dest": { "index": "products-v2" } }
4. Swap alias atomically:
POST _aliases {
"actions": [
{ "remove": { "index": "products-v1", "alias": "products" } },
{ "add": { "index": "products-v2", "alias": "products" } }
]
}
5. Application never knows the index changed — it queries the "products" alias.
6. Delete products-v1 when confident.
Always use aliases. Never point application code at concrete index names.
| Tool | Best For | Architecture |
|---|---|---|
| Elasticsearch/OpenSearch | Full-featured search + analytics | Distributed, JVM-based, scales to petabytes |
| Meilisearch | Instant search, product catalogs | Single binary, Rust, sub-50ms responses, simple config |
| Typesense | Similar to Meilisearch, typo-tolerant | C++, simple to operate, built-in clustering |
| PostgreSQL FTS | Good-enough search within Postgres | tsvector + GIN index, no extra infrastructure |
| pgvector + tsvector | Hybrid search in Postgres | Combine full-text + vector search, no extra infra |
| Algolia | Managed, instant search | SaaS, great DX, expensive at scale |
Decision framework:
"Do I need a separate search engine?"
< 100K documents + simple search → PostgreSQL full-text search (tsvector + GIN)
- No extra infrastructure
- Transactionally consistent with your data
- Supports ranking, highlighting, fuzzy matching
- Example:
SELECT title, ts_rank(search_vector, query) AS rank
FROM products, plainto_tsquery('english', 'database optimization') query
WHERE search_vector @@ query
ORDER BY rank DESC LIMIT 20;
100K-10M documents + fast autocomplete → Meilisearch or Typesense
- Drop-in search engine with great defaults
- Sub-50ms response times, typo tolerance, faceting built-in
- Much simpler to operate than Elasticsearch
10M+ documents OR complex analytics → Elasticsearch / OpenSearch
- Full-featured: aggregations, geo search, nested objects, custom analyzers
- Scales horizontally to petabytes
- Higher operational complexity (JVM tuning, shard management, cluster ops)
Any scale + semantic search needed → Add vector search
- pgvector if already on Postgres
- Elasticsearch kNN if already on ES
- Pinecone / Weaviate / Qdrant for dedicated vector DB
Start simple. Adding Elasticsearch to a system with 10K documents is operational overhead for zero benefit.
Good search engineering is wasted without good search UX. These patterns are battle-tested:
Autocomplete / typeahead:
Implementation checklist:
□ Debounce input: 200-300ms (don't fire on every keystroke)
□ Minimum query length: 2-3 characters (don't search on "a")
□ Show suggestions after 2+ characters typed
□ Keyboard navigation (arrow keys + Enter)
□ Highlight the matching portion in suggestions
□ Cancel in-flight requests when user types more (AbortController)
□ Cache recent queries client-side
const controller = new AbortController();
const response = await fetch(`/api/search?q=${query}`, {
signal: controller.signal
});
// On new keystroke: controller.abort() → cancels previous request
"Did you mean?" suggestions:
Three approaches:
1. Fuzzy matching: find terms within edit distance 1-2 of the query
2. Phonetic matching: Metaphone/Soundex ("Jon" → "John")
3. Popularity-weighted: suggest the most common similar query from search logs
Best practice: combine all three.
If a query returns 0 results but a similar query returns 1000+, show "Did you mean: [similar query]?"
Faceted navigation:
The filter sidebar with counts. Rules:
1. Show facets relevant to current results (not all possible values)
2. Update counts dynamically as filters are applied
3. Use OR within a facet, AND between facets:
Category: [Laptops OR Tablets] AND Brand: [Apple] AND Price: [$500-$1000]
4. Show count of results each filter value would produce
5. Don't show filter values with 0 results (or show them grayed out)
6. Allow removing individual filters (breadcrumbs / pills)
Zero-results page:
When search returns nothing, don't show a blank page:
1. Check for typos and suggest corrections
2. Show results for a relaxed query (drop least important terms)
3. Suggest popular/trending searches
4. Show category browse options
5. Log zero-result queries — they reveal gaps in your content or analysis pipeline
Search analytics — measure and improve:
Track these metrics:
1. Query volume: what people search for (query distribution)
2. Zero-result rate: % of queries with no results (target: < 5%)
3. Click-through rate: % of searches where user clicks a result
4. Click position: which result position gets clicked (lower is better)
5. Refinement rate: % of searches followed by another search (user didn't find what they wanted)
6. Time to first click: how long users spend before clicking
7. Search exit rate: % of users who leave after searching (worst case)
Zero-result queries are a gold mine: they tell you exactly what your search is failing at.
Analyze them weekly and fix the top offenders (add synonyms, fix analysis, add content).
| Problem | Data Structure / Algorithm | Why |
|---|---|---|
| Key-value lookup | Hash map | O(1) average lookup |
| Ordered data, range queries | B+ Tree (database index) | O(log n) lookup + sequential range scan |
| Write-heavy ingestion | LSM Tree (Cassandra, RocksDB) | O(1) amortized writes, sequential I/O |
| Set membership at scale | Bloom filter | O(1) check, tiny memory footprint |
| Count distinct at scale | HyperLogLog | 12KB for billions of elements |
| Frequency estimation in streams | Count-Min Sketch | Bounded memory, no false negatives on count |
| In-memory cache with eviction | LRU Cache | O(1) get/put with bounded memory |
| Rate limiting | Token bucket / sliding window | Smooth rate enforcement with burst support |
| Distribute data across nodes | Consistent hashing | Minimal redistribution on node add/remove |
| Find boundary in sorted data | Binary search | O(log n), universally applicable |
| Sort data larger than memory | External merge sort | Disk-friendly, O(n log n) I/O operations |
| Task/build ordering | Topological sort | Respects dependencies, detects cycles |
| Shortest path (unweighted) | BFS | O(V + E), guarantees shortest |
| Shortest path (weighted) | Dijkstra | O((V + E) log V) with priority queue |
| Cycle/deadlock detection | DFS with coloring | O(V + E), finds back edges |
| Autocomplete, prefix matching | Trie / Radix tree | O(k) lookup where k = key length |
| Ordered set with concurrency | Skip list | O(log n), simpler concurrent access than trees |
| Distributed unique IDs | Snowflake / UUIDv7 / ULID | Time-sorted, no coordination, index-friendly |
| Full-text search | Inverted index (Elasticsearch, Solr) | O(query_terms) lookup, sub-second on millions of docs |
| Lexical relevance ranking | BM25 | Industry standard, handles term saturation and doc length |
| Semantic / similarity search | Vector search (HNSW, IVF) | Catches meaning, not just keywords |
| Search within Postgres | tsvector + GIN index | No extra infra, good enough for < 100K docs |
| Typo-tolerant instant search | Meilisearch / Typesense | Simple ops, sub-50ms, great defaults |
The structures in this chapter are not interview trivia. They are the building blocks of every database, cache, load balancer, and distributed system you work with every day. The B+ tree in Postgres's indexes, the skip list in Redis sorted sets, the Bloom filter in Cassandra's LSM read path, the consistent hashing ring in DynamoDB's partition layer, the HyperLogLog behind PFCOUNT — you interact with these data structures constantly, whether you know it or not. Understanding them means you can read a Postgres EXPLAIN plan and know what is happening, choose the right database for your workload, debug performance problems from first principles, and design systems that scale. That is what separates engineers who understand their tools from engineers who cargo-cult configurations.
Want to put this into practice? The TicketPulse course has hands-on modules that build on these concepts:
- L1-M07: Indexing & Query Performance — Apply B-tree and hash index knowledge to make TicketPulse's ticket queries 10x faster
- L2-M40: Search Engineering — Build full-text search for TicketPulse events using inverted indexes and relevance scoring
- L3-M65: Consistent Hashing & Distributed Cache — Implement consistent hashing to distribute TicketPulse's cache layer across nodes without thundering-herd on reshards
- Implement an LRU cache in your language of choice using only a hash map and a doubly-linked list — no library calls — and verify it evicts the correct entry.
- Profile a slow function in your codebase: add timing instrumentation, measure it under realistic load, then write down its Big-O complexity for both time and space.
- Find where your codebase uses a bloom filter, B-tree, or hash map — hint: your database uses all three internally. Check
EXPLAINoutput, your cache client docs, and any "exists" check in your application layer.